The Internet Report | Transcript: Understanding the Microsoft Outage: Why Were Azure, Microsoft Teams, & Outlook Down?

January 30, 2023 • 28 Minutes

Understanding the Microsoft Outage: Why Were Azure, Microsoft Teams, & Outlook Down? | Outage Deep Dive

This is The Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. Join our co-hosts Angelique Medina, Head of Internet Intelligence, and Kemal Santja, Principal Internet Architect, both from ThousandEyes, as they discuss the January 25, 2023 Microsoft outage.

Angelique: Welcome to the Internet Report, where we uncover what's working and what's breaking on the internet and why. Today, I'm joined by Kemal Sanjta and today we're gonna talk about the Microsoft outage that happened on January 25, 2023.

It was a very significant outage that at its peak or during the majority or the most significant portion of the outage lasted about 90 minutes, but we saw residual impact for more than five hours. So we're going to dive into that and first talk about what happened. Kemal, take it away.

Kemal: Thank you, Angelique. This was most certainly a very interesting event as part of which we observed quite a large-scale impact that affected various Microsoft services.

Now, looking at the event and based on what Microsoft already provided information on when it comes to their limited root cause analysis or RCA, it looks like at 7:00 AM UTC, there was a spike in BGP events that quite significantly affected them.

Looking one data point before that, you can quite clearly see on the left-hand side where our BGP collectors are that Microsoft Corporation autonomous system 8068 is advertising their prefixes to their backbone ASN which is 8075. And as indicated with green color on the left-hand side where our BGP collectors are located you can see that prefixes are getting advertised and propagated the way it's usually working when everything is working fine.

However, if we navigate into the event itself, looking at these various internet BGP collectors, you can quite clearly see that they've changed the colors. All of a sudden they are all red or they are orange or yellow, indicating that something quite significant happened. Also, that's quite clearly indicated in our interface by the fact that we can see a lot of red striped and solid lines as part of which they mean that either the prefixes got installed or the prefixes got withdrawn.

Now looking at this particular collector, if I navigate over that collector and I can click on "show all" in this monitor. That's going to kind of declutter the view to a certain degree. And on the timeline we can see that the metric is “path changes” or essentially we can select “updates” and we are gonna observe a very similar pattern of the events on the timeline as part of which it's kind of obvious that there were certain advertisements at the time from Microsoft ASN 8075, which is their backbone ASN. If I hover over this collector and I click view details of the path changes, we can see that approximately at 7:12 or 7:10—depending on the collector itself—UTC, Microsoft started advertising these prefixes.

But this time with the longer ASN path, autonomous system path, is part of which they installed transit providers instead of their direct peering with companies where our BGP collectors were located which is quite interesting behavior itself because BGP works with something that we call BGP best path selection algorithm which is the algorithm that BGP uses to calculate the best path. That's going to get installed in the forwarding table. Usually it's being done by the BGP attributes that are advertised along with the prefixes that the BGP speaking routers are advertising and one of the first attributes is so-called autonomous system path and the shorter it is, the better it gets from the visibility.

Now if you take into consideration that fact, installing of the longer AS path is kind of interesting and now we can see a certain level of volatility in this particular event as part of which we can see that at 8:12 for this particular—or 7:12 UTC for this particular collector—this particular event as part of which we kind of transition to two different transit providers before installing the direct peering, we can see it repeating several times. And that kind of continued and we can see that that lasted some more time until approximately 8:22 or 7:22 UTC and we can see that it was a lot of routing table churn.

Obviously these kinds of events are gonna be impacting from the customer perspective and in this particular case we saw the impact in the form of large amounts of packet loss which we are gonna touch base on a little bit later on.

The other interesting thing that happened as part of this event is the fact that it affected summary prefixes as well. Now summary prefixes are covering prefixes for these smaller prefixes and in this particular case we are looking at /12 and we can see that BGP updates were happening for them as well. Now, exactly the same thing happened, which means that they were initially withdrawn from the routing table and then later on readvertised over the transit, ultimately settling on path using direct peering, the shortest ASN path.

But it's quite interesting to see that actually this event affected them as well. Now, given the fact that everything here is happening on /24 and we are focusing on /24 due to the fact that these are the most specific prefixes that are accepted on the internet, we can see that this went from 8:10 or 7:10 UTC until 7:22 and then between 7:22 and approximately 7:40, there was a brief period of stability as part of this, like between 7:22 and 7:42 there was a period of stability as part of which this kind of settled down, things calmed down, but then again it kind of restarted.

And then it continued some more as part of which we are seeing installation of these prefixes, withdrew also of these prefixes over the transit providers, ultimately settling down at approximately 8:45 which means that there was a packet loss that lasted for about 90 minutes, but residual effects of this change continues for several more hours.

Now if I click on the path visualization for this particular event, you can quite clearly see that there was a lot of red color on our timeline and our red color on the timeline indicates that there was an elevated packet loss during this event. Now if you think about it some more, you are going to quite quickly observe that these elevated amounts of packet loss quite closely correlate with the spikes in good advertisements that were observed by autonomous system 8075, which means that a lot of connections were being teared down. People had problem establishing connections with these services and there was overall negative experience for Microsoft’s larger customer base.

Packet loss is never a good indicator for the service. Essentially, what that means is that even a small amount of packet loss is gonna have a pretty significant negative effect on the throughput. As part of which, if I navigate to our HTTP server view, we can quite clearly see that there was a significant dip in a metric that we call availability, and availability in our case is nothing else than ability to do DNS resolution, which is essentially the ability to translate domain name into IP address. Connect is the first step in establishing any kind of TCP connection, which is essentially a three-way handshake. Then you can see that some of our agents actually failed to establish, to negotiate as a cell. Some of the agents had problems in receive phase which is essentially they were not able to receive traffic within the dedicated five seconds interval. They were not able to receive the traffic from Microsoft. And then HTTP is essentially what we were getting from the Microsoft HTTP server in this particular case.

If I click on table, you are going to see that some of the services at the time when this was happening were returning 503s, but the majority of the other agents that were executing this were timing out in receive phase or there were kind of DNS issues or they were not able to establish TCP connection, indicating that this was a pretty large scale negative event. As you can see here, the majority of the impact actually happened during these first 90 minutes of the event, but residual negative effects continued for hours after the event, ultimately later on settling it.

Now Angelique, one of the interesting things that we observed over the weekend was the fact that Microsoft published RCA or root code analysis. Could you tell us a little bit more about that?

Angelique: They published a very interesting RCA. It's preliminary, but actually had quite a lot of detail that's very interesting and we'll discuss that. But before we get into that, I just wanted to ask a quick question about the rapid announcement changes that we were seeing. So we were seeing a lot of stability issues in terms of the routes that were installed in routing tables across many different service providers. So why would that cause packet loss of the scale that we saw?

Kemal: So essentially what happened is that there was a lot of churn. So every time you readvertised the prefix, BGP essentially does that through the method of sending update packet which contains something that we called NLRIs, or network layer reachability information, that's tied with these BGP attributes that I talked about. So every time the receiving router gets the update, it needs to look into that, figure out whether that's an update worth considering so to say, and then it calculates a new best path based on that information. Now, sometimes in these updates you're going to see the individual rules, sometimes they're going to be new prefixes, but BGP protocol needs to examine every single one of them and that's a very CPU-intensive task.

So now if you look at this event that we just talked about, there was a lot of volatility and rapidity in how often these new providers were being installed. So as you can imagine, Microsoft is quite popular. These services such as Outlook, SharePoint, and various Azure services and stuff like that are quite popular. So every time that happens, quite a large number of traffic gets redirected on the different networking path between source and the target, target being Microsoft in this particular case, and then that particular path gets again teared down as part of which you tear down a lot of TCP connections and a new path is installed. That, tied together with the fact that services themselves had a problem responding to legitimate requests during this time, it’s gonna reflect in this much of packet loss.

Now the other thing is, the other thing that I would like to touch base on in this particular case is length of the event. Now we are seeing about 90 minutes of core event plus residual effects of the event itself. You have to take into consideration that this is a pretty large scale issue that happened during this time, and as you can imagine, various network availability engineering or service availability engineering teams were probably paged during this effect for impaired services.

And if you take the scale of the issue and given the fact that it was a global problem, we need to be empathetic towards the Microsoft engineering team. This was pretty significant. I myself consider myself a network engineer even though I'm not in the trenches anymore on a daily basis. But, I had experience as part of which I was on the originating side of outages similar to this one, unfortunately. You know, people are gonna panic. You do not want to believe that the change that you are doing is the one that caused this, right? So you're going to triple check that it's not your change, you're going to start troubleshooting, various teams are going to get into troubleshooting mode, there is going to be probably calls spun up with the call leader as part of which everyone is trying to figure out what's happening. So there's a lot of uncertainty in these kind of events and stuff like that.

So overall 90 minutes of events such as this one, given everything that was at stake, and by trying and ensuring not to cause any more harm than there is it's a pretty good response. That's what I'm trying to say.

Angelique: Yeah, well, I mean just on that in terms of what the cause was. So Microsoft stated that the trigger essentially for this issue was the attempt to change an IP address on one of their routers. Now, that seems like something fairly mundane in terms of an update. So why would something like that have had this effect where effectively globally, we saw a massive connectivity disruption. Can you describe how that would actually work? Why would something so simple, it would seem, cause such havoc?

Kemal: Yeah, that's actually quite a good question and you know we can only speculate from this perspective given the fact that we know what they were trying to do. Unfortunately, we don't know too much about the design and architecture of the network. However, thinking about this particular event, there are a couple of theories that might have happened during this time.

Now, we know that it was an IP address change of their WAN routers that had a cascading effect as part of which it originated a lot of messages that other routers were dealing with, right? Now if you think about it, usually when it comes to BGP connections, they’re establishing what we call full mesh. Essentially, every single router is linked and speaks to every other router in the network topology. That's essentially a full-mesh topology.

However, given the Microsoft scale and in general inflexibility of full-mesh networks, people tend to use something that we call root reflectors as part of which you establish BGP sessions towards specific routers in your network that are also BGP speaking routers that are readvertising prefixes based on certain criteria to the rest of the network, ultimately giving the ability not to full mesh with everything. So, essentially, to a certain degree you decrease the complexity of your network. However, at the given risk of if you lose that particular router and that router does not have redundancy, which I don't believe was the case in this particular Microsoft case, you might end up with an outage, right?

Now, just from the experience of being a network engineer at large-scale companies I can tell you that I've experienced similar issues as part of which any kind of networking change that would affect the route reflector that's really busy, that's receiving thousands or in this day and age, right, millions of prefixes from all other transit and peering routers, would cause something like this.

Essentially, what happens is that it relays the message to every single other BGP speaking router, depending on the topology that might be hundreds of routers, if not more. And then all of a sudden, BGP best path selection algorithm starts its work to identify whether there are new best prefixes and that's gonna cause a lot of CPU-related work.

Based on this preliminary RCA from Microsoft, it looks like essentially something like that happened, as part of which, there was an event though the IP address changed, that caused other routers to recompute parts which means for the given period of time there was potentially some amount of blackholing of the traffic. Essentially, maybe the prefixes were withdrawn or the routers were trying to send the traffic to the paths that were not in operation any longer.

Now, in general, for regular enterprise companies that should not happen. Like these changes are very quick and swift and usually when it comes to that, they're gonna work fine. However, if you are at the scale of Microsoft or Google or Facebook or other large-scale company which receives like millions of different routes, through different, who knows how many transit providers, direct peerings, and stuff like that, things can and do get more complicated, right? So that's one of the things, like maybe they were trying to change some IP address on a really busy route reflector which had a negative cascading effect on other routers as part of which it was a lot of CPU utilization spikes and stuff like that on their infrastructure. Routers were trying to send traffic on the path that are not in the function anymore, ultimately blackholing the traffic, right? That's one of the things that could be a potential issue.

The other one is it's 2023, people are using SDN. It's not just like, you know, a marketing term any longer. Actually people did develop solutions and they are working fine. So one of the scenarios that could have potentially happened is that Microsoft had some kind of controller device that's responsible for installation of proper prefixes or informing the rest of the network which prefixes and where they should be installed. And if you kind of nuke that device by changes such as this one, you might end up in a situation as part of which your central intelligence is lost as part of which network goes in the flux phase which might explain what we saw.

Now, which of these theories is what happened, we don't know or whether any of these theories is what happened. But you know, both of these theories about what could have happened actually are possible. So that's what I think about this.

Angelique: Yeah, I mean I think what was particularly interesting about this incident was that initially it wasn't clear where the problem was. So obviously there were a lot of customers, both enterprise customers as well as you know just regular users who might have been trying to connect to various applications and services hosted within Azure and may not have initially understood where the problem was. They might have thought it might have been an application issue or maybe like a configuration issue to do with a particular application, for example.

But almost immediately, we saw that this was very clearly network related and very clearly this was related to announcements that were being made to Microsoft peers and then of course the impact that that was having. So is there something to be learned from that in terms of how enterprises and IT operations professionals approach how they determine what's happening? Again, in the absence of maybe immediate information from the provider who obviously is going to be very busy troubleshooting an issue rather than maybe communicating details.

Kemal: Of course, there are many lessons, but even before we go there what I wanted to mention is the fact that based on some of the data that we were looking at ThousandEyes, as part of which we were analyzing, detecting, reporting about, similar issues. It's not just Microsoft, we are doing this for enterprise customers of all sizes. In this particular case, as I mentioned during the demo part of this talk, we saw this on large summary prefixes, right? And there is one pattern that actually kind of gets shaped as part of it looks that at around 7:40 or something like that Microsoft kind of tried to roll back the change.

Now, if you take into consideration any of the theories that we just spoke about, if you roll back the change that caused all of this churn, you're going to re-churn again, right? Essentially, what's gonna happen is like you take that route back or IP address of that router and you change it to something else. You observe the issue for something like 40 minutes. You are trying to identify what happened. You identify the change, the natural instinct of engineers, network engineers, in this particular case, and any kind of business person is like “let's roll it back,” and if you roll it back, you know you're going to re-cause exactly the same effect again. You know, all of these routers that were affected are gonna get affected again by the potential IP address rollback because they need to relearn the prefixes, advertisements are going to go. So that might explain why we also saw this event happening for as much as it did.

But to your point, one of the lessons that everyone needs to take from this event is that visibility is a paramount that everyone should strive for, regardless of whether you are a service provider such as Microsoft in this particular case or you are using an application that's potentially hosted as a SaaS application. So for example, if you are using SharePoint or Outlook or something like that and you're wondering whether it's your network or your provider's network, it would have been pretty easy to identify where the issue is.

And in 2023, I believe that it's kind of tragic if you are learning about these issues by your customers. So that's completely unacceptable at this stage and age. So, you know, regardless of what it is that you are using, you should really strive towards the proper visibility as part of which you can make your own decisions for remediation steps and stuff like that.

The other thing is this outlines the fact that there's no such thing as a mundane change request. Like you know, this was an IP address change. By any stretch of the imagination, there's no scenario in which this IP address change could cause something like this, right? It's just not something that you would envision. As part of change management procedures usually engineers are being asked about like what are the steps that you're gonna take to limit the blast radius of this change, you know. How can you envision that the IP address change is going to have this negative effect, right? It's super hard, right?

This also speaks about operational excellence as part of which you really need to take into consideration all possible scenarios, as part of which you need to limit the blast radius, maybe test it out in some kind of lab before you started or rolling back this. In this particular case, I'm pretty sure that would be really hard given the scale. This would not have happened at scale.

But for many companies, like these kind of changes you would capture them. And one of the things is like many of these changes these days are executed by automation. So you should really have some sort of automation or audit that's actually measuring the impact of your change. So it should be a kind of blocking out it is part of which you do something, you observe the impact, and if there's impact above a certain percent of maybe packet loss, or there is a spike in latency or something like that, there should be automation that uses that blocking audit to either roll back this change or stop you from executing it further until things stabilize.

So, you know, I'm speaking here strictly as a network engineer. I'm pretty sure that from the business perspective there are many other lessons that people can learn, but from the network engineering trenches, these are some of the things that could be done.

Angelique: Yeah, absolutely. Those are great ones. Is there anything that we should cover off for the event before we close out for today?

Kemal: That would be everything from my side. Thank you so much, Angelique. It's been an absolute pleasure speaking about issues such as this one, and I hope that one of the lessons that our audience is going to hear is the one about the importance of visibility.

Angelique: Yeah, absolutely. Thank you. This was really informative, breaking down the outage. So thank you again for joining us and please subscribe and like our podcast and if you do we will send you a free T-shirt. So go ahead and do that and send us an email at internetreport@thousandeyes.com and provide your T-shirt size and your address and we'll get that right over to you. So with that, thank you and we'll see you next time.

We want to hear from you! Email us at internetreport@thousandeyes.com