Lessons From the FAA, Fastly, & Microsoft Outages | Pulse Update

In this episode, we cover the latest internet trends and unpack important takeaways from the recent FAA, Fastly, and Microsoft outages.

Mike Hicks: Hello! This is The Internet Report's biweekly Pulse Update where we keep our finger on the pulse of how the internet's holding up week over week, exploring the latest outage numbers, highlighting a few interesting outages, trends, traits, and the general health of the internet.

Today we're going to be discussing the nuances and resulting outcomes of how networks interact with applications and vice versa, how to make change in an environment that is really 24 hours, and what happens when a planned change doesn't quite have the intended outcome.

And to do all of that, I'm joined this week by my good friend and colleague, Kemal Sanjta. How are you, mate? How have you been? I know you've had a busy week. It's been ages since I last spoke to you.

Kemal Sanjta: Hi, Mike. Good to be on The Internet Pulse. Yeah, it's been busy, but busy is good, I would say. Besides that, I'm looking really forward to speaking about some of the observations that we had and some of the lessons that we learned from these new events that we're going to discuss today.

Mike: Before we get started, in terms of housekeeping, we'd love for you to hit "Like" and "Subscribe" so you too can keep your finger on the pulse of the internet each week. And please keep your feedback coming in, what we've received so far has been brilliant. And it really is appreciated; it helps us to shape the show. So, you can reach out to us anytime at internetreport@thousandeyes.com, and we'll do our best to address your questions in future episodes.

All right, with that, let's get started and look at my favorite part of the week, the numbers.

So if we actually look at the figures we've got coming out from there. So, we've seen this increase as we come out of the January timeframe, we start to see this increase. Then we see a slight dip coming into sort of the first week of this Pulse Update right there. So that week of the 16th of January, we start to see a very small decrease, a 3% decrease. Again, this is kind of seasonal, it comes in there. We can say it's sort of more even there.

But then what happens in going into the next week, is we see this huge increase around from there. So, we jumped from 245 to 373 observed outages around from there. That's like a 52% increase. So, sort of nearly doubled, or, yeah, getting on from doubled there. This pattern's actually matched domestically. So, I look at those patterns coming across from a North American perspective, which are the figures we can see there in black, the 57 to 102. Again, I see this big increase. Now, when I'm talking about this being an impact or giving me the health of the internet, this is what I mean, coming up from here.

So as I said, I expect to see increases, but this is a large increase. And what this kind of tells me is there's some sort of major occurrence going on in this area there, which we'll get to in a minute. We go from there.

But what are your thoughts on when we see these figures? What does it tell you looking at that?

Kemal: This is quite interesting, in fact. If you think about it, we were coming off of a really quiet period. There was a holiday season, change freezes go into effect, people hopefully spend some time with their families, enjoying some time off. And then what you end up seeing is, let's do our work all over again. The fact that you posed your change management procedures and stuff like that, means that they're going to pile up. And probably, with some of these changes that were being executed, sometimes to improve stuff, but often just to scale up the network, if you think about it. Especially for the large providers, if you delay something, that just means that there's going to be more on your plate once you come back. So it looks to me that this is actually just piled-up change management procedure related work, plus scaling efforts and things like that.

However, it's quite a bit of a jump, if you think about it. Outages globally, that's a big jump. Now, I'm pretty sure that some of these things were connected with some of the outages that we saw last week happening, so we can touch on that a little bit later on.

Mike: The other aspect of this issue, before we move off the numbers, is this. If I look at the percentage of the U.S.-centric outages, and this is something we can go into detail in another podcast, I guess, but in accounting for 26%, the previous fortnight period, it was a 27% that the U.S. outages occurred from that. The reason I bring that up, really, is that what we've seen evolving, over the period of time—like I say, I look at these numbers all day every day. I see them in my sleep, dancing. It's like that scene in Dumbo with the pink elephants dancing around. I just see numbers coming from there.

But what I'm seeing is that, if we think about the U.S.-centric providers and network providers around from there, they have large networks around there as well. You make a change in one area there, even if it's an engineering change, you will have almost a domino effect where it starts to come down and impact other regions around from there. We have this large, come across from there. But what I've started to see now, sort of seeing it decrease from a 46% centric influence coming down to sort of these twenties there. So what that kind of tells, and like I said, we won't go into it now, we can go into it in sort of more detail, but there's an improvement around reducing that impact.

Now, some of this might be coming from evolution, I'm assuming some of it might be coming from evolution of the technology. So, we're moving to some sort of more SDN, some software-defined type environments there. But also, some of it's actually becoming more aware of how the interconnection dependencies connect together. What are your thoughts there?

Kemal: Yeah, I agree. You know, the thing is, if you think about networks, it's a tier one, or even tier zero, service. If the network goes down, everything goes down, right? Like, there's a negative cascading effect. And even though we, as an industry, came really, really far, I think it's still, to a certain degree, a fragile environment. Even though the changes that are small in nature, that should be benign, actually could have negative effects.

And now, depending on how good your visibility and network monitoring is and stuff like that, and how much do you care—in fact, not how much do you care, what's the granularity of the visibility that you have? You're gonna see that impact and, potentially, your customers are going to see negative effects of these changes.

So again, networks are kind of fragile, and as you start doing the changes, which you must do, it's an inevitable cycle of the industry. You're going see the effect even for the things that should have been benign.

Mike: Exactly, exactly. So, let's shift gears and let's take a look into some of the outages we've seen over the past couple weeks in a little bit more detail.

All right, so throughout January, we've observed several outages and disruptions that both reinforced this need to actually sort of look at what's going on from there. You just talked about that understanding of the visibility, my dependencies coming from there, but also pose challenges around when it comes up to appropriate monitoring, what do we actually need around from there?

So, the first one I wanna touch on, really, sort of happened on January 11, around from there. It was the FAA outage there and it caused the U.S. airspace to be temporarily closed down when the FAA, the Federal Aviation Administration—NOTAM, the Federal Aviation Administration’s Notice to Air Missions, which is effectively the manifesto, the ability to sort of—information sharing is gone, that's what I'm trying to get out there, for the aircraft to be able to take off. And it's sort of this manifest and everything around there, we actually need the information so that we can have our flight plans locked and everything from there, so they can actually get on. If that system's down—

Kemal: It grounded everything, right?

Mike: Correct. Yeah, yeah. And, this kind of significant from there. So, the initial explanation that came out from that was it was a damaged database file. It then turned out that it was accidentally deleted, that file from there. So, there was a dependency on a particular file. There wasn't a network issue, but it was a dependency within the application itself, this file that ran from there.

But it does pose a couple of problems. Or questions, not problems, right? Poses a couple of questions that I can think of there.

So one is, this is a 24/7 system. There were 11,000 flights that were actually grounded, who couldn't actually take off from there. The flow on effect in terms of scheduling and canceling flights, and I had a cancel flight last week and I know how annoying that can be. So, it's this flow on effect. But, so I can't interrupt this system, I can't do in-house engineering, I can't test this system to my dependencies, how could I go about actually verifying this? Or rather, what are the steps I might be able to take to mitigate my risk that this might not happen again?

Kemal: You know, that's a good question. I think that the majority of these cases similar to this one are going to tie back to operational excellence, honestly. It's really hard to predict what's gonna happen in certain scenarios. And not that alone, like in majority of the cases, to your point, you might be dealing with legacy systems that were built maybe decades ago as part of which deleting a single file is gonna have a negative cascading effect on the complete application and is going to ground a complete fleet in the U.S., right?

Which is kind of really surprising, if you think about it. In 2023, where we are taught to build highly redundant applications that can survive these kind of partial outages and stuff like that, you know? Having a single file being deleted, in the production and on the backup causes a really big problem.

Ultimately, it comes down to the operational excellence. Do you have your change management procedure ready? Is it properly reviewed? Are you ensuring that you are potentially limiting the blast radius of your change? Who reviewed your change? You know, whether a sufficiently experienced person is executing the change itself. Those are all the questions that one needs to ask, to a certain degree, until better times as part of this, these applications are modernized, needs to be asked.

Mike: Yeah, that's good. Yeah, the operational excellence is there. It's a process, and we've got process and people have to be involved across from here. And then the other aspect, of course, is then you add the visibility on top.

Let me just sort of change a little bit just to start to go into another outage there. Okay, so on January 19th, we observed a range of web applications going down. So, we can see here PayPal, Cash App, Target, Reddit going down from there. The fact, when we typically see these go down here, we are looking for some common one. We've just been mentioning about the fact about, we were joking around those common aggregating points, but when I see some pattern like this, I'm spread across different applications, I'm looking for some sort of common host or something in from there.

So, if I actually then go to look at that from the server network, what I start to see here is Fastly. So, Fastly is actually from here, so I start to see an indication of where my potential problem kind of lies. We're looking at an application layer, but also what I actually see there is actually really short. And the reason I've picked this out is for a couple of reasons and why it's sort of notable from there.

And for one of those, it was really clean, right? So, it was actually clean, it was a six minute outage that came from there. It was identified and sort of mitigated going off from there. And it was, what I call, a light switch. There were no problems, there were problems, there were escalation problems and there was a resolution, when it went from there. It was turned off, it was turned on, it went from that one there.

And the other thing I want to point out for this is that Fastly was really good about getting the information out from here. What they identified it as being, is they identified there that it was an issue that occurred after a change they actually made, they recognized that there was a problem and they rolled it back. And then we actually saw straight away this sort of change. We went from there, the rollback and the rectification and then came back online. Very small duration. So, they came back and they said from there, it was really interesting, it was short.

The time of day it was actually occurring from there is also indicative of what was going on in terms of, one, we've identified their terms of change, we saw it happening at the top of the hour and it was also at 6:00 PM, which is kind of interesting, which I'd sort of like to explore.

So, what are the type things we make about that? Why would one make a change at 6:00 PM? I realize it was a very small impact in terms of six minutes, but what are we starting to see or what do you think about patterns of those sort of things happening?

Kemal: Yeah, the first thing that I observed, when it comes to this event, is the fact that they were really spot on when it comes to, "We observed the issue on the status page. We identified it and we are remediating." Like, I think that communication here was spot on and kudos to Fastly on that one.

Now again, operational excellence is a big topic of, I guess, today's conversation. Like, this is something that probably slipped in, did the unexpected thing and, fortunately enough for Fastly, they discovered it really fast and rolled back. However, interesting thing when it comes to all CDN and DDoS providers is the fact that they have this enormous impact when something like this happens, right?

So for example, what's gonna happen is like, essentially, people are advertising all of their stuff to them and then they are readvertising that to the rest of the internet, or depends on how it's configured, right? But essentially, the first request for any kind of service that uses CDN usually gets terminated within the CDN network, right?

And you know, this is a perfect example as part of it, when you show the server network thing, how Internet Insights very quickly collapsed into Fastly CDN or Fastly's network, right? Which actually speaks a lot about Internet Insights itself, how quickly you can actually narrow down where the potential issue is as well. I really like that functionality and that was pretty nice as well.

But yeah, in general, it's operational excellence. These CDNs are going to have the global effect always, just the nature of the business that they are in, similar to DDoS preventing companies as well. However, it's not the first time we are seeing something like this, it's not the last time. If you can recall, mid-last year or something like that, it was like this huge Akamai outage that affected everyone as well, you know? Now the difference between this one and the Akamai outage is the fact that Fastly recognized— the scale was different. Like yes, well, the scale was global. The speed at which you were resolving the issue was significantly faster. So again, kudos to Fastly in this particular case.

Mike: But let's move on now to really quite another subtle outage again. So, no pictures here, but I want to just talk this one through because it was fascinating. So on January 17, with a subset of Microsoft 365 users in the eastern U.S., so Eastern U.S. seaboard up through from there, sort of Massachusetts area around there, reported application issues.

Microsoft opened investigations and after an hour, they said it wasn't a typical application issue there, it was a very specific bunch of users from there and they identified, they notified that the telemetry indicated the Microsoft-managed network environment is healthy. So, everything on their side was good, but it was actually limited to a specific ISP. And this went through and they were then able to identify within the ISP itself, there was a common denominator, which ISP it was from there. They notified them and then, through a combination of resets and traffic rerouting, the ISP was actually able to resolve it and the application accessibility problems went away from there, right?

So the outage, in terms of a global aspect, it was there. I used to live in Massachusetts so I could feel their pain about sort of not having their connectivity. But what was really subtle around there was this concept that it was impacting one application. So, it's only affecting Microsoft 365. All other applications they had running across from there using the same ISP from these endpoints was actually running from there.

It's not clear exactly what went wrong on the ISP side, but I have seen this situation before, right? So, I've seen it in a couple of instances. There's this complex interaction between the application and the network itself. It's not just a question of understanding we have connectivity, we’re more than rails. The network is going to interact with the application and the application is going to interact with the network in different ways, in different areas, across from there.

Kemal: That's correct. And this again speaks to the fact that even though we went through a really long path of improving our networks, improving resiliency and stuff like that, they are still very fragile, right?

The thing is, in this particular case, we are speaking about a really large tier one provider, right? They have a global footprint, and depending on how Microsoft was advertising their prefixes for that particular service, or set of services, it really depends whether the prefix was itself isolated to the service itself, or it was spread across different services, right? That's the first thing that we would need to look into.

The second thing is, it all depends on how this global provider was routing their prefix, right? And potentially, it was routed towards the path which potentially had some issues. Now, the internet is quite an interesting medium, a lot of different things can happen, as you already know. Like sometimes you're going to deal with the overt causation on certain interfaces, right? Sometimes you are gonna have a CRC error that's corrupting a large portion of the network. And sometimes you see, like isolated example, from really not any kind of obvious reason, a single application is being affected. Now, to your point, whether that's like some tuple that's being affected for that specific thing or something like that, who really knows, right?

And that speaks about, also, requirements of having really good RCAs. RCAs are not just for the company to say, "Look, we identified the issue," right? "There's a problem that we observed." And it's not about publishing that so that people can blame them. It's actually for the community to learn from it, right?

Mike: Yeah, absolutely. That's really interesting. And yeah, you talked about the RCA, the root cause analysis around from there, and that's an interesting point because, again, I want to put a pin in that. I want to take that, because how much information should—I totally agree, if I can do it and I can learn from it, but then how much information should I actually pass on to there? And I think we're gonna sort of touch on that a little bit when we move on to the next outage, which in this case, actually is a Microsoft outage, but I'm gonna leave it to you to actually sort of take us through there, Kemal.

Kemal: This was a pretty large-scale outage that Microsoft experienced. So what really happened is that on Wednesday, January 25, around 7:05 UTC, Microsoft observed an issue as part of which multiple services such as Outlook, Teams, SharePoint, and others, started being really unresponsive to customers.

Looking into more details about the event, here at ThousandEyes, we observed that this was clearly a BGP-related event or something that affected BGP to that point. And if you look at the collectors that we have, pretty much everything went to red, which is quite indicative about what went on.

Mike: That's, that's really good. There's a couple of things that come out from there, and I should probably preface this by saying that you've actually just written a rather brilliant blog that we'll put in the show notes around from there. And I know you and Angelique have just done an Internet Report podcast, which goes into detail. So, I strongly urge everybody to go and listen and read that. They'll give you a real detailed understanding and analysis of what went on from there.

But what I wanna sort of just take around from there is there's a couple of interesting points. So we now know from there that Microsoft has said that this was a result of a configuration change where they actually tried to make a change to an IP address, I believe, on one their WAN routers around from there.

Talk us through a little bit, how would that manifest itself? Is there any way they could have actually got out of this massive degradation that went on with it, came around from there, or was that just the nature of the change that was made?
Kemal: That's a really good question, Mike. So, this event is quite interesting in its own nature. So based on the preliminary RCA that Microsoft published, it looks like obviously it was an IP address change on the WAN device, right?

Now if you think about it, and if you think about how BGP works, this really shouldn't have happened, if you think about it, right? Worst case scenario we change the loopback IP address of the device which everyone is peering with, everyone tore down the BGP sessions, prefixes get withdrawn. That's the scenario.

And Microsoft is 100% operating a highly redundant, highly reliable network. Like, it's out of the question whether that was a problem. Like, for sure, that's not a problem, right? However, what then can explain this kind of event, right? That's the real question here. I think in this event, we might have seen two events unfold.

The first one is there was this IP address changed that caused the effect, right? There was a negative, quite big global outage, quite visible to customers and everyone that was taking a look. And then, I'm pretty sure that there was a lot of internal pages, SRE teams and various teams being engaged to see the health of the services that they are usually monitoring. And then they realize, "Okay, there is this networking change that happened." And the typical and normal human reaction to events such as this one is, "Let's roll back," right?

However, in this particular case, if you roll back, you're going to cause all of this change all over again. So, that's what I think was happening at 7:42, right? I think they essentially rolled back and that caused, again, this flux of BGP updates that potentially caused the CPUs to spike quite a lot, trying to compute what goes into forwarding tables and stuff like that.

Now, again, modern capable WAN devices should be able to compute this in a really short amount of time. However, we are not speaking about a typical company here, we are speaking about Microsoft, which is really well peered with multiple transit providers, multiple peering and transit routers. So, the size of the RIBs was really large, probably. And what I'm speaking of now is one of the theories that could explain what we were seeing. Essentially, if this was to happen on a really busy interconnected, inter-peered route reflector, we might see something like that, as part of which, the collector itself is having a problem to recomputate, then it's readvertising prefixes, everyone else is busy churning the large routing tables and stuff like that, and then you roll it back at the top of that, again, causing it to do it all over again.

The other scenario that I was thinking about is, potentially, the SDN controller that holds the complete logic and intelligence of what really gets installed into FIBs, goes offline, from the IP address change perspective or something like that. You know, devices lose intelligence, they start computing stuff and stuff like that. Both of these theories are something that could have happened in a large-scale environment, and I'm just speaking from network engineering experience from a large-scale company perspective.

Mike: Finally on that point there, one of the things, we're always on about status pages and in this case, the status page was reasonably up-to-date; we were still behind the occurring, the alerts, we were seeing alerts coming out from there, it was immediate from there.

What do you think, in terms of delivering the information across the status page, how much should they have put across on those? How much do you think they could have put across on the status page and how much is it beholden, effectively, on you as a user, or you as a customer, to have that visibility?

Kemal: It's an interesting balance, I think, right? So the first thing is you, as an end user or as a company that heavily relies on software as a service applications such as Office 365, I fundamentally believe that you should have your own monitoring set. And not just from this particular perspective, like it could have been any other service, right? And if experience taught us anything, it's that it happens to everyone. There's no company that haven't suffered a similar outage in reasonable time, right? The other thing is, you really want to be in a position to rule out where the issue is, right? And so without visibility, you don't have the possibility to see where the problem was.

For example, if I was a SaaS user at the time, I might be asking myself, "Is it truly my network or it's Microsoft," right? And I would potentially suspect that it's my network, given all the experience and high levels of uptime that Microsoft usually provides. Regardless of what information and how quickly a provider is providing information on there, you should really strive towards your own visibility. Not just for the services that are in the cloud, for your own services that you are operating, pretty much for everything, you want to be in the position where you are in control, right?

And then the other thing is, how much information goes into these public advertisements? I think a better question would be, how quickly do you provide information that you are dealing with the issue? That's more meaningful to the customer, right?

But the second thing is, kudos to Microsoft here for being transparent and providing a glimpse of what really happened. What I ultimately want to see is, essentially, what I can learn from it and how can I improve my own network based on someone else's experience, which is the best case going forward.

Mike: Exactly, that's the whole point. And we've come full circle to that single aggregating point that does it, but also that community learning, providing that visibility. And you've hit the nail on the head there, well, certainly from my perspective, the thing I want to know is, where is the issue? Who's responsible for fixing that issue? And therefore, they have enough information at that point to either mitigate it or sit and wait it out.

Thanks very much, Kemal. As always, mate, it's been an absolute pleasure.

Kemal: Thank you so very much. It's been my pleasure speaking about these outages, and I hope to see you in one of the upcoming Pulse episodes in the future.

Mike: Absolutely. You're top of my list for guests, mate. Don't worry about that.

That's our show. Don't forget to like, subscribe, and follow us on Twitter. As always, if you have questions, feedback, whether it be good, bad, or ugly, or guests you'd like to see featured on the show, send us a note at internetreport@thousandeyes.com.

That's also where new subscribers can claim a free T-shirt. So, just send us your address and T-shirt size and we'll get it right over to you.

Until next time, goodbye.

We want to hear from you! Email us at internetreport@thousandeyes.com