The Internet Report | Transcript: A Tale of Two Data Center Outages

March 3, 2023 • 22 Minutes

A Tale of Two Data Center Outages | Pulse Update

Mike Hicks: Hi everyone, and welcome back to Internet Report's biweekly Pulse Update, where we keep our finger on the pulse of how the internet's holding out week over week.

And what an eventful two weeks it's been. Today we'll be discussing insights from recent data center-related incidents at Microsoft and Oracle, and explore how IT teams can minimize downtime in situations like these. We're also going to cover other outages that happened everywhere from Twitter to Tesla.

And with so many interesting things to cover, we're piloting a new segment of the podcast, I'm calling “The Download,” my TLDR summary of what you absolutely need to know about the internet this week in two minutes or less, or even quicker if I speak faster.

As I mentioned, in the space of a week we saw two data center-related outages at Microsoft and Oracle that lasted for quite a while. These outages underscore the importance of quickly determining whether a safe and graceful failover and/or shutdown is possible in the event of a data center issue. So stay tuned for more on these outages later in the episode.

We also observed outages at Twitter, Atlassian, Fitbit, and Tesla. The Tesla app outage is one I definitely want to discuss further as an important implications for companies as more and more cars rely on apps and subscription services to power all kinds of features.

In Tesla's case, due to the outage that impacted the Tesla app, car owners were unable to lock or unlock their vehicles or find charging stations using the app. Thankfully, customers were still able to access their car with their physical key cards. Who would've thought that? But these incidents are still a reminder of how critical it is for companies to guard against backend issues like this as cars become increasing reliant on the apps and the internet.

We discuss this a bit further later in the episode, but I also want to chat more about the events at Twitter, Atlassian, and Fitbit.

Looking at the overall trends, one interesting data point we noticed was that the U.S. outages represent a much smaller percentage of global outages than we saw this time last year. Overall outage numbers are increasing both globally and in the US, but U.S.-centric outages only accounted for 21% of all observed outages. This is down from 43% the same period in 2022. It's going to be interesting to see if that trend continues to progress through 2023.

And now let's dive further into trends we're seeing and further explore some of the recent outages observed. As always I've included chapter links in the description box so you can skip ahead into the sections that are most interesting to you.

We’d also love you to hit like and subscribe, and always feel free to email us at internetreport@thousandeyes.com. We welcome your feedback and questions.

And to discuss all of this, I'd like to welcome back Brian Tobia, Lead Technical Marketing Engineer at ThousandEyes. Brian, it's great to have you back. How have you been, mate?

Brian Tobia: Thanks Mike. Doing well, good to be here again.

Mike: All right, then let's take a look at the numbers this week. What I really want us to sort of talk about and focus on there is this, if I look at the percentage of U.S.-centric outages that are there, so it's dropped from 43% to 21% if I compare it to last period last year, right? So I'm actually looking at these two-week period and the same two-week period last year.

Now here's my theory on this is that what we're starting to see from there is that as we make a change to a network—so we talked about this, you know, you've said as these windows come in from a change perspective and this sort of accounts from an outage's point of view we're seeing from there, as we're moving into this produced area, what we're seeing is, I make a change. It's not having so much of a cascading effect you talked about. So I'm able to isolate that change as it were and see my blast radius, my impact zone is far less from there.

So therefore, although the outage numbers are increasing, what I'm actually seeing there is this from a U.S.-centric perspective, I said they're also increasing in terms of numbers but are not having as great a global effect. So my point is if I take an interface down that's connected to another interface down here in Australia, then obviously see two outages from there. But what I'm starting to believe is that we've been able to isolate those sort of further control if it were, the outages. Does that make sense or do you think I'm completely off the rails?

Brian: No, I think it totally makes sense and I think another point of that is companies are realizing that these outages are happening, so we're kind of deploying to more regions and to more global areas. And by segmenting those changes just like you're saying to different areas or a single change that may affect only one network interface or one availability zone, we're not seeing these on a global scale as much now. So, I think it's really helping with the distribution of these, and like you were saying, you know what the blast radius is, how much it's actually affecting. We're still seeing outages, but I think spreading that out and then again where things are deployed is really playing a big impact into that.

Mike: Absolutely. Yeah, and we're always gonna see outages that we talked about from there.

Okay, so that's really interesting to see how that's gonna evolve over the rest of the year. But now let's discuss some of the outages from the past couple of weeks as we go Under The Hood.

All right, so we're gonna start with two data center outages, but before we move on to those, just want to say, if you hear some noise in the background, I've just been visited by a whole flock of cockatoos. The joys of living in the outback.

All right, so as I said, let's start with these two data center outages. So while the data centers and cloud workloads they host strive for always on availability and maximum uptime, there are occasions when things are going to go wrong. It often starts with power availability. And when power is uneven or unavailable, what we have to do or what teams have to do is to determine whether a safe or graceful failover and shut down is possible. Obviously, if we're actually able to shut it down gracefully, then we reduce that risk of damaging data within our system from there as well. We also gotta make sure what actually is going to be shut down. If we take out a critical service or an aggregating point, obviously we're going to have major issues.

On February 7, Microsoft experienced a data center cooling event. This was in Southeast Asia and it was caused by the mains brownout that led to several chillers. So these are units that provide the cooling to the data center, and these actually shut down. This reportedly caused difficulty from some customers, particularly in this region, so in Southeast Asia, accessing a wide variety of cloud services including Teams and many Azure services. And this disruption actually extended quite long. It lasted around 32 hours. And we're talking about disruption that was coming on and off the system being able to do from there.

So, what we saw from this is a couple of things really. So Microsoft identified that the chillers tripped out due to a voltage dip. It only affected a single availability zone, but the whole power management system they had in place was designed to cope with these sort of shutdowns. But what happened was, this particular instance, we had this cascading effect and it was just a subset of these chillers went in. So then without enough cooling, obviously the ambient temperature in a data center is going to rise and we're going to risk damaging all these, this equipment in from there. So this is what I'm saying, they had to make this choice to actually shut things down there, Brian.

So when you actually go to this system, yeah we've talked about, or I've talked about this aspect of having a control system. What are you going do to make decisions? How would you propose we're looking at this to decide what we shut down, what stays, what goes type of thing?

Brian: Yeah, it's an interesting point. I think, one is I think it's important to test, right? So to understand what those shutdown procedures are and maybe have kind of a list and understand what things are easier to, you know, maybe bring down if necessary or what things are replicated or in different regions. So I think the availability and where the different pieces live and how easy those are to either bring back or to fail over to different infrastructure deployed in a different area. I think you kind of need to make that checklist to help understand and make informed decisions rather than just start shutting some applications or servers off because of a power or a cooling issue. We've seen these a lot so I think it really comes down to that.

Mike: Yeah, that's actually really interesting. So it's not just the ability to shut them down. What can we take down? So what is the outgoing point? But also as you said, what is it to actually bring this back up again? How quickly can we restore those? And obviously part of this as well is that again, these things happen. So, we have plans in place and obviously then once we experience something like this, you can actually go through and learn from it again. Ah, we thought that was a critical piece. It wasn't. Or it took an hour to come back up online.

Brian: Yeah.

Mike: So, Microsoft wasn't the only cloud provider to experience data center issues in February. On February 14th, a good day for all of us to remember, Oracle had a separate issue, this time with NetSuite. NetSuite is the Oracle-owned provider of ERP and CRM software.

This was reportedly due to a possible electrical fire at a third-party facility hosting NetSuite infrastructure. Actually in Boston, Massachusetts. So right in your zone, Brian.

Again, this led to a controlled shutdown, an evacuation of the site. Again, this was a lengthy period of time. It was almost off for 24 hours, according to reports we've seen.

A couple of points here. The first one again, going back to your point about understanding what we need to shut down. In this case, we've got to take the whole data center down or the whole facility down in that area there. But the localized nature of it meant that the impact was only seen for the users relying on that service, or that region, I should say. So those are the people impacted the most and they were sort of logging on.

But it does appear that there was some ability to sort of switch services and to be able to switch to some sort of disaster recovery site. But once the power is restored, Brian, what about this situation of bringing the services back up? How can we be sure that we're not doing anything in terms of corrupting data or having these gaps as we're cutting them back across from a disaster recovery site?

Brian: Yeah, I think it's really important to be able to have testing in place to know what an application looks like. What's a good steady state? So as you bring things up, what does that look like? Are we seeing good data coming through, you know, have API connections or have API tests in place to make sure that's looking right. And then have it documented too as you bring things up because I think we've seen that before where it's not just the outage that's a problem, it's actually once network connectivity is restored, then we see additional impacts because when you bring things back up, whether it's a control plane not acting properly or like you just mentioned data not coming or not being totally there, I think that's when we start to see impact. So it's important.

Mike: All really good points that we've seen from there. In addition to these data center-related incidents at Microsoft and Oracle, we observed other outages over the past couple of weeks that had a long duration or large geographical impact and in some cases both.

We've been observing Twitter for some time given the recent major changes at the company, and although the infrastructure and architecture have proven to be resilient so far, the application itself has experienced some glitches. So on February 9, some Twitter users reported being erroneously greeted with warnings that they were over the daily limit for sending tweets and will be restricted from tweeting anymore. The limit is around 2,400 a day and users also reported encountering messages that they've breached other limits such as the amount of people they could follow each day and they've sort of these areas coming back.

What we saw here was more of a functional outage and availability or reachability outage. So we can always get to it from there. But I think it's important to point out that any functional issue within the platform, typically points to an application issue.

But how can we go about identifying essentially that we are looking at a particular function within there that's causing that. And also breaking down that, that it's not an issue specifically with the connectivity or it's not something for a data center or an infrastructure point of view.

Brian: Yeah, and I think we're going to look at that coming up in another example, but I think one of the things important to test is testing full stack. All the way from the network up to the application itself.

So to your point, if the network looks good, we can make a connection, that doesn't necessarily mean the application's available. So we have to look for what's the HTTP request look like, or even in your case, right, you can even do something like a transaction test. Even make sure that what behavior you're expecting—so if I click on something, I'm expecting to see a certain status page or a certain response back. If you really need to understand what the app is doing, you can set up a test, interact like a user, use something like browser synthetics to actually be able to pull that kind of data back and understand where the problem is. Is it network or is it all the way up to, like you mentioned, over the character limit or some sort of application error that's not functioning properly.

Mike: Yeah, so that's good. So we, you are basically looking to test the functionality of it from a usability perspective—

Brian: Exactly.

Mike: —so for a network there. That's really interesting. Now the thing we have to remember here is that these breakages and glitches are more likely to become more common.

We're living in this agile world where we are constructing more applications by using APIs, different microservices, third-party areas from there. Any one sort of change could actually sort of come in there. Which goes back to your earlier point I guess around the fact that you want to be able to do the full stack, but you also want to be able to understand all the components within that delivery chain.

Brian: Exactly.

Mike: All right, so let's move on to the next outage. February 15, around five past nine, 9:04 AM UTC, Atlassian appeared to experience an outage that impacted access to some of its services. So we observed a duration of around 50 minutes that impacted all regions. So, Brian, let's take a more detailed look at this.

Brian: All right, yeah, so we're looking at an Internet Insights snapshot view here, looking at the Atlassian services as you had mentioned, starting around 9:05 UTC. And what this is highlighting for us is that we saw global availability drop to the different services. So within ThousandEyes, we test from all different locations. You can see on the left we have vantage points in the U.S., U.K., Canada, all across. And we can see these are all in red. Red is bad obviously here.

So if I click on the 21 over at Atlassian here, you can actually see they will expand it out, that we have 21 affected servers with that specific, tied to that specific domain. And then we actually saw, and you could see here HTTP response code error type. So it's actually telling us what we actually saw that was wrong. So at a quick glance you can use something like ThousandEyes, like our outage map as well and you can identify what exactly went wrong. So in this case we saw again, global reachability that dropped from the different vantage points to these services.

Mike: Yeah, and back to my patterns again, we also see, if we look at that timeline above there, you're sort of seeing the almost lights on lights off type of scenario which I think we'll get to.

Brian: Yeah, great point. So now we're looking at a more detailed view of the test itself to the JIRA application, which is one of the applications hosted on their infrastructure. And this is an example of a page load test.

So we're actually looking at, and this goes back to our point before about the full stack observability, right? You're looking at the full transaction. So, we profile everything from the initial DNS request all the way up to actually that page alert. And you can see again that dip in performance. So we're looking at availability. The green obviously is good, so we're at a hundred percent. And then during that outage that we identified, right, that's where the dip goes to 0%. And we're seeing from a global standpoint, you can see on the map below all those different agents we're collecting from and the data they we're getting back, obviously was a failure here.

But what's really interesting I think, is to look at the different phases of this. So, for instance if I go over to our network view, which is called Path Visualization, we'll notice that the connection was actually fine. So the network in this case was perfectly fine, right. We've got all our agents on the left, you can see here, everything's green connecting to their services. And if you were just monitoring at a network standpoint, you would not have caught this. You would've said, yeah, everything looks good. I could connect to their servers from the network, right. Everything was good, but as we know, that's not the case here. So instead we can switch over to something like the HTTP view and this is gonna tell a much different story.

So if I click on table view, we're now going to see what we got back and here's all those 503 errors, right? And this is where the problem was. You know, not on the network side, but you know, here's the actual HTTP errors that we are getting for all those different agents trying to grab their connection to JIRA.

Mike: Yeah, and that's good. I think one of the other things here is again, we've already shown the network path is there and it's available, but the fact we're getting a 503 is indicative of this server side, so we're getting something coming back from there. So as we said from the application perspective, so service unavailable. We can assume it's making some sort of onboard call because we're getting to the front door, which again I think, underlines that point you're making earlier about testing that transaction stack right through. Not just knocking on the front door, understanding what goes right through.

There's been no explanation so far provided about this and that's fine, and there was also not a lot of chatter. But there may have been something in the background, went from that one there that sort of, it wasn't overly impacted. It might have been because this time of day something was taken down there, but it was a fairly long length of time. But also the fact we had this sort of on off, it may well have been some sort of maintenance exercise.

But again, this underscores the importance of understanding that entire service delivery chain so that we can see the dependencies into connections. So, as we were talking about right back to the very beginning of the numbers, if I can minimize that footprint of where my blast rate is going to be.

Brian: Exactly.

Mike: Well thanks, Brian. So let's move on to the next outage. And this one's kind of interesting, not too much to show but it's a Tesla app outage in Europe, garnered a lot of attention with car owners unable to lock or unlock their vehicles or find charging stations with the app. Critical thing, it's like, actually I had an issue the other day with someone—who will remain nameless, but it's my daughter. The battery went out in our remote for the car door so she couldn't unlock the car until I showed her how the actual key worked. So, but you know this, a similar type of thing.

So the issue affected people outside of North America because it's mainly a European thing. So the outage itself like I said, there's ways to work around it. We could actually just put the key in the door, you know. We could remember where the charging stations were. But the reason why I highlighted this is again to talk about the fact that now we're relying on the internet to be able to pass these things around from there. There's these backend issues, you know it becomes really important to understand what's going on from that area.

Brian: Yeah, and I think we've seen these sometimes too, like with connected homes too. Like doorbells and security cameras and locks even, I think we rely so much on it that you need to understand where these outages are and how they might affect you.

Mike: Exactly. Exactly. On that same theme, users of Fitbit also encountered some consecutive syncing issues between the devices and the mobile app. Again, this extended over a 24-hour period. The issue again appears to be some kind of backend or application issue. And the company has since said they've actually resolved this problem.

They've reported the issues. They couldn't sync, they couldn't access third-party applications, the integrations were failing, they couldn't get onto the Fitbit forums. It didn't stop the actual watch functioning, all right. So it actually still functioned—the watch. You're still doing a step count, but couldn't do any sync.

So, again, we're talking about a complex system or a system that has multiple components to it and backends. We're connecting over the internet again for some of this syncing stuff from there. But anyone could pull that chain. Everything was functioning in isolation, but we couldn't put everything together from there. So again, just highlighting this thing about the interconnected world, the reliance on the internet, and in this case the application itself or the system. The Fitbit watch almost became unusable or not usable in its normal state.

Brian: Yeah, and I think an interesting point there too is we saw a recent outage around Google Maps as well, where it's actually embedded into another platform. And I think it just highlights that it's not just your application, but it's APIs that you depend on or something like Google Maps where your page loads but then a widget on your page, or if you rely on Google Maps API or a different API for providing routing services or whatever it is, if you're reliant on that, a component of your application has failed.

We've seen this time and time again now that we're so interdependent on different vendors. So that's why it's so important to not only monitor your application, but you know, those that you rely on as well.

Mike: Yeah that's really important and it goes back to a nice segue or closing theme around everything there. You know we've talked about understanding all these interconnected dependencies, but also any one component can cause you problems. On the flip side of that, if we can identify where those components are, you can also give you those areas where I can optimize my delivery and make things better.

Brian: Exactly.

Mike: Well thanks, Brian. As always, it's been an absolute pleasure and as we said, we could rabbit on about this for hours and hours. But thanks very much.

Brian: Yeah, my pleasure. Thanks for having me on. I'll definitely be back soon.

Mike: That's our show. Don't forget to like, subscribe, and follow us on Twitter. As always, if you have any questions; feedback—either good, bad, or ugly; or guests you'd like to see featured on the show, just send us a note at internetreport@thousandeyes.com. So until next time, goodbye.

We want to hear from you! Email us at internetreport@thousandeyes.com