The Microsoft Outlook Outage, Explained | Outage Deep Dive

Live from #CiscoLiveEMEA, we discuss the Feb. 7 Microsoft Outlook outage to understand how the event unfolded, why it may have played out the way it did, and what you can learn from this outage event.

INTRO
Kemal Sanjta: Hi, my name is Kemal Sanjta and I'm Principal Internet Analyst at ThousandEyes. I'm here at Cisco Live in Amsterdam, joined by my good colleague Mike Hicks, Principal Solution Analyst.

We gathered here today to record a special edition of The Internet Report to report on an unfortunate event that affected Microsoft as part of which one of their services, OWA, was struck by an outage. This comes after the significant outage that various Microsoft services experienced on the 25th of January starting at 7:10 AM which we extensively reported on.

Mike, could you please let us know what happened this morning?

Mike Hicks: Yeah, absolutely, Kemal. It's good to be here with you, actually. We're normally just doing these things separated by 14,000 kilometers, so it's actually great to be here with you.

Kemal: Exactly, exactly. Good to see you in person, man.

Mike: Yeah, absolutely. Even if it is under these circumstances where we have an outage.

MICROSOFT OUTLOOK OUTAGE
Mike: As you say, this was an outage we saw this morning, which occurred on Microsoft. It seemed to predominantly impact Outlook 365. Specifically that one and looking when we can see through some of the tests, it looked like it actually impacted the OWA, as you said, the online version where you actually go through from that sort of things. What we actually started to see, we first saw the outage around 3:55 UTC was when it first appeared to us, and it just started to impact service. We saw it had a global impact. It went right through. If we actually then cross correlate to see what happened on the network side of things, we can actually see at that point there was no significant network outage.

Kemal: Unlike the previous outage that was predominantly network related, this one is an application outage, right?

Mike: Yeah, exactly right. Yeah, flip the two around, but it's the same outcome. The users cannot access the application. In the January 25 case, it was a network scenario, as you said, we reported comprehensively on and then in this case it was sitting at the application level.

We were able to quickly see that because we look at the network, say "Okay, there's no outstanding network issues.” Or no significant network issues is probably a better way of putting it.

Kemal: Got it. One of the questions that end users might have is did Microsoft report about the issue itself?

Mike: Yeah. Microsoft actually put something out on Twitter. They also report things through the admin page, but that's not only available to the admin type of people around from there. They put a tweet out, it was around the first tweet we see around there was sort of an hour later around sort of 4:55 UTC where they've identified that they've made a change. They don't go into specific what the change was, but then they're looking at seeing if the change had the impact to affect Office 365 applications.

Kemal: Got it. One of the things that actually was quite interesting when I was checking the status page or Twitter feed for that matter, was the fact that it looks like they resolved the issue by unconventional means of restarting the service.

Mike: Unconventional means is an odd way to restart the services. They obviously made the change. We don't know what that change was from there. Yeah, to quote the IT crowd, “Have you tried turning it off and on again?” This isn't the first time and we see this quite often going through this process is where they get to there and it might be something like a queue gets overwhelmed or something like that, so the quickest way is to restart the infrastructure coming down from there. Rather than going through this process, let's just actually restart everything going through and this is what they said they start to do.

Kemal: Within the ThousandEyes, what was the first thing that actually showed us that you are dealing with a significant outage?

Mike: Yeah, it's a really good point. I've said this on our podcast before. I'm a simple man, I love my patterns, and we started to see this stairway coming down. This almost downward trend, you can start to see. And then during that time, so if we're looking at it through Internet Insights, what we actually see is a number of servers impacted. These are the servers which we're seeing test fail or reporting back in this case a service unavailable like a 500 or timing out type of that scenario there. What we then start to see is we see those numbers of servers reduced and it gives us that stairway to heaven, gives us that downward staircase where we actually start to recover.

Kemal: So potentially as they were doing targeted restarts, the services start recovering in some kind of stair-like fashion, right?

Mike: Yeah, 100%. We've seen them do this before. We reported on it in some of the Pulse Update blogs there. And I've actually reported they typically do this quite responsibly where they actually go through and we can say, "Right, these are the times we're impacted, we're impacting Africa, and their business day or these regions there," and then we start to almost follow the sun.

Kemal: Got it. Was this a global outage or was it limited to specific geography?

Mike: Yeah. It appeared to be a global outage. We certainly saw it spread across multiple regions. In terms of the noise and the sentiment that was coming back from the internet, it really depended on who was online at that moment in time because it specifically seemed to impact the Office 365, the online version. Then we're looking at the subset of users.

Kemal: The OWA.

Mike: Yeah, the OWA.

Kemal: Got it. It was quite interesting what we saw in the waterfall. Some of our tests do support waterfalls and if you look at the waterfalls, one of the first things looking before the issue itself was the fact that it took maybe 375 milliseconds for the redirect. After the redirect, you can go and see very long times for the redirect to complete, right?

Mike: Yeah, absolutely. Exactly right. The first thing we do when we hit the page, we go OWA, and it effectively takes you to your instance, your company side from there. That's like a 302 redirect that goes on from there. Exactly where you saw, it averaged out to 332 milliseconds, 345 milliseconds. That was the part that was actually delayed. When we saw the timeouts occurring and when we saw all the errors happening at that point there, at that point then is where we actually saw the service unavailable because then that redirect was the thing that actually took the time so we couldn't get to that service.

Kemal: The previous outage was, as we already discussed, was predominantly network based. We saw BGP events happening followed by significant packet loss, which directly resulted in availability drops for various Microsoft services. This time around, it looks like only the application was affected, there was no packet loss, everything was fine from the networking perspective, which is pretty good. But it's quite interesting to see that these application-level outages have ultimately exactly the same negative effect as the network outage from the 25th.

Mike: Exactly, 100%.

Kemal: Unfortunately, it happened to the service that's very widely used, Outlook. If you think about it, and some of the customers that we spoke of regarding the previous outage were saying that 50,000 emails being queued in their email queues and stuff like that on an organizational level side. This time around probably the same thing happened, right?

Mike: Yeah. It is difficult to say from that side, but, yeah potentially we can see that as we started to get to the backend. It is interesting, you make a point there. The impact to the user is exactly the same, I can't use this service. The difference this time was we were looking at effectively at one service or it appeared to be one service around from there.

The other thing that was interesting to us, and you're right, there was no significant network outage. We're talking about the internet, so we're going to see some sort of blips and latency and those types things, but there was nothing that could be directly linked or coinciding specifically with that outage time. The other thing to that, which was why it's important to look at that application in correlation with the network, we saw announcements coming back, so we saw the 302, we're doing a redirect, and we saw 500 service unavailable in some instances. For some instances it was timing out, but we saw those, which is indicative of we've got a response with coming back from there.

Kemal: Got it.

Mike: Now we saw those on the Microsoft outage on the 25th, but it was also correlated with the network scenario.

Kemal: Yes.

Mike: You've got to put the whole picture together to be able to say this is what we believe is going on there.

Kemal: For how long did the outage last?

Mike: The main disruption that we saw where we'd actually see the outage occur, so that light switch on/off, saw the outage happen from there and then that step come down. It was around 1 hour, 39 minutes impact from there. It's a very lengthy type of period. It's more than the blip on the radar.

Kemal: Yes. It's actually quite interesting, the previous outage, at least the core part of the outage, lasted exactly the same time. I was like, are we seeing a repeat of this event again? Turns out we are not, but the length of the outage actually was very similar.

Mike: Yeah, I think I said to you before when we were looking at this, it was like, "Hey, that looks exactly the same pattern."

Kemal: Exactly. Super interesting.

Mike: I had to check the date to make sure we're looking at a different one. This is where it becomes important that correlation between the network, the application, looking at the details, really tells you specifically where the fault lies.

Kemal: Agreed. Thinking about this event as well as with the previous event, this just outlines the importance of having proper visibility for your applications. Could you elaborate on that point?

Mike: Yeah, absolutely. We go big on visibility across the network, but I've also got to see that entire service delivery chain. One of the biggest things that we want to be able to see is who is responsible for this issue so I can actually take steps to go around it. In this case, I couldn't get to that server. We quickly saw the issue lay within Microsoft, not within my local ISP or my local connectivity, so therefore I can either take mitigating circumstances—i.e., let's use Gmail for this message, or let's go to an instant messaging system, Slack or Teams, to communicate with an organization. Also I could either just sit and wait it out, as we said before to see what's happening. It's really critical to have that correlated view of visibility of that service delivery chain, understanding who's talking to who and what the dependencies involved.

Kemal: The other thing is we are speaking about Microsoft here and Microsoft is one of these hyperscaler companies, super large company. They are a heavily redundant company as well. If you think about it, if you were on the consumer side and you are responsible for application held and network visibility and stuff like that, without proper visibility, you might ask yourself whether it was your network that's affected rather than them, right? It would be a completely logical thing to do. If you think about it, they have their stuff all together figured out. While your network is much smaller, you depend on different service providers and stuff like that. So it's kind of natural to assume that the issue might be at your side. While in this case we can quite clearly and very quickly observe that the issue was with the provider itself.

Mike: Exactly. It's not you, it's me.

Kemal: Exactly.

UNDER THE HOOD
Mike: Thanks, everyone, for listening to this episode of The Internet Report. This is Mike just popping back in after our initial conversations. After we chatted, Kemal and I thought it’d also be great to walk you through the outage on the ThousandEyes platform in a little bit more depth. So without further ado, let’s go under the hood.

So yeah, so as we were saying, so we actually start to see the outage occur at 3:55 UTC and then what we start to see then is this rapid increase. We can see if we're looking at the application itself, it's actually looking to impact Microsoft Office 365. No other service actually occurring there, but we can see sort of global locations where it's actually impacted. Then it starts to escalate up quite quickly. We go from a hundred servers being impacted to up to sort of 459.

Why don't you go in there again looking at Microsoft Office 365 only? Just a quick correlation then just to see if we are looking at an application itself or the network, we can actually see if we look to the network—and this is kind of important because we see that correlation between the two—is there are no specific network outages. Now there are outages occurring on there or impacted from there, but they don't correlate through.

So let's go back to that application perspective and what we can actually start to see there is the types of errors we're seeing. So there we can actually sort of see some timed out errors. If we start to go in where we get across there, we see sort of a series of 500 errors occurring as well.

We see sort of 5xx's and timeouts and some 400s. All of these are indicative of some sort of response coming back from there. We talk about this fact where we are getting a response coming back, but the fact we don't have anything correlated with the network it’s that that differentiates us between a network and an application issue itself.

The other thing then that’s really interesting is we actually start to see this stepped recovery as it comes down through from there. So really sort of decreasing down. The main bulk of the outage occurred, was looking at sort of an hour 39, but we saw this residual effect sort of going right through as we went through this process of where they actually started to do the restoration to the services themselves.

If we just want to take a look, sort of detail what that looked like from a user perspective, first of all we can then start to see and let's take a look here. We're looking specifically just looking at Outlook Office 365. Again, we're talking about that OWA.
If we actually look beforehand of what's occurring there, we can actually start to see as we go through this process, looking at the waterfall charts, we're looking at what happens from an application perspective. We can see the OWA redirect. So 302 coming in from there needs a redirect. Now this is what happens where we actually make a connection and then we move to our own service as we come through from there. So if we actually look at that timing and why this is significant, we can see it sitting around sort of 78 milliseconds there. We're looking at this particular instance there. When we actually start to get into the outage itself, first of all, we can actually start to see when this particular case is canceled completely, but we see this complete wait, we're actually doing the redirect across from there. Then we see this is over 6,000 milliseconds. It's just actually timed out of that process and there. And that’s then when we can actually start to see what's happening.

So again, just taking us back before the outage, we can see the redirect come from there. So this is indicative, we actually make a connection to the system and then we actually have this wait time when it just times out, we actually can't do that. Redirects, we go through to there.

So again, it is difficult to say specifically when they started to do the restarts coming in from there. When we talked about what happens from when they were tweeting stuff out from there, it was an hour after we first identified, we first observed issues occurring from there. But when it started to recover, going through that process, but then we can actually really start to see the recovering, we see the number of servers been impacted, dropped as we say from there. So we're getting into this period now we're talking at 4:25 UTC. We're down to 145 and 457 at its peak. Then we actually have this peak when we actually now start to decrease a number of services impacted. So what that means is then there's sort of less people impacted from a global perspective and we start to recover the service.

Kemal: Mike, thank you so much for this unordinary episode of The Internet Report. It's awesome to be here in Amsterdam with you in person.

For everyone looking at this, thank you so much for being with us. Don't forget to subscribe and we're going to send you a really nice T-shirt. Thank you so much.

We want to hear from you! Email us at internetreport@thousandeyes.com