The Internet Report | Transcript: A Trio of Similar Incidents: Microsoft, Cloudflare, & Slack Outages

February 17, 2023 • 24 Minutes

A Trio of Similar Incidents: Microsoft, Cloudflare, & Slack Outages | Pulse Update

Mike Hicks: Hello, this is The Internet Report’s biweekly Pulse Update, where we keep our finger on the pulse of how the internet is holding up week over week, exploring the latest outage numbers, highlighting a few interesting outage trends, traits, and the general health of the internet.

Today we'll be discussing insights from a recent trio of change and rollback incidents at Microsoft, Cloudflare, and Slack, along with other outage news, including the Comcast outage that impacted some Philadelphia neighborhoods on Super Bowl Sunday.

But before we start, in terms of housekeeping, we'd love for you to hit like and subscribe, so you too can keep your finger on the pulse of the internet each week. Please keep your feedback coming. It really is appreciated, and it helps us to really shape the show. Reach out to us at any time at internetreport@thousandeyes.com, and we'll do our best to address your questions in the future episodes.

All right, so let's take a look at the numbers. So what we saw from the start of the year was we actually started to see this increase when outages had gone up, and we put that down to the fact where people are coming on board, there's maintenance work, and we reflect back to look at time of day when these actually occurred. But then what we've seen this past couple of weeks, is we've started to see this trend reverse, where we started to see—we're talking about global outages here. So we saw initially it drop from 373 to 331, which is an 11% decrease when compared to January 23 to 29. And then this downward trend continues, these global outages dropping again from 331 to 301, which is a 9% decrease compared to that previous week.

When we look at those numbers from a domestic perspective, from U.S. numbers, we initially see them rising in this case. So they're going up from 102 to 117, which is a 15% increase. And then the following week then we actually come into this trend where it should drop, where we drop from 117 to 73, which is a 38% decrease compared to the previous week.

Now what makes this interesting or what we want to discuss here really is that, what this then meant when we saw this increase coming up from there. So the U.S. numbers rose that first stage, but the global numbers were decreasing, as it were, compared to the previous week. What this meant was that combining this fortnight together, we saw that U.S.-centric outages account for 30% of all observed outages, which is larger than we observed between January 16-22 and January 23-29, where it only accounted for 26% of the observed outages.

So what can we actually make of that? Typically what we start to talk about, and we look at this percentage of how they impacted, when we go back, and we've written about this quite a lot, in terms of how the U.S.-centric infrastructure from a provider point of view, has this global reach, because of the way it's set up, the way it’s architected, you sometimes get sort of this domino effect.

But interestingly enough, when we saw that increase coming in there from the U.S. actually, it didn't necessarily have the impact on the global numbers. And what I mean by that is that they now accounted for more of the outages. So we saw that rise from 26% of the outages to 30% of the outages, being referred to by U.S. centric. And what this means, or what we can kind of take from that is that when we'd normally see this going, this was an outage that occurred in the U.S., I said this domino effect would start to go from there. The infrastructure is structured, if we're talking about maintenance, or even some sort of outage occurs from there, just because of the interconnections from there. The interesting part was that first week of February that we observed there, we saw this increase in domestic numbers, but it didn't have a dramatic impact specifically on the global numbers.

Now what we can actually take from there is that the outage was more controlled, so they're more localized around from that area there. So the blast radius, as it were, was kind of smaller and more controlled. It'd be interesting to see how this goes as we start to develop this over the year.

Typically when I actually look back, and I average out over the year, I'm looking around sort of 40% of, or historic, I'm looking around 40% of U.S.-centric outages relative to the global outages there. So again, it sort of tracks low at the start of the year, then it comes up from there. So we'd like to see how that goes as we go forward from there.

Okay, so that’s probably enough time on the numbers for this episode. As I said, left to my own devices, I'd have probably talked about it the whole episode there, but as we've got quite a lot of outages to discuss this week, I think it's about time we take a look under the hood.

All right, so we ended our last installment of the weekly Pulse with a look at the Microsoft outage, the network configuration change on January 25 that caused application reachability issues globally. And if you haven't seen our analysis yet, I really strongly encourage you to go and check out the blog, the podcast, where we go into some detail.

But the reason I want to bring it up now, is it wasn't the only change-and-rollback incident to occur in that timeframe. So on January 24, less than 24 hours before the Microsoft incident, and around 4:55 PM UTC, several Cloudflare services experienced about two hours of downtime.

And this was down to a code deployment where it accidentally overwrote some service token metadata, before Cloudflare recognized it and sort of rolled it back. So before we take a look into this outage, and how it manifests itself, I want to just take a minute to describe what service tokens are, and what they're used for.

At the very high level, a token-based authentication is a protocol that allows the users to verify their identity and they receive this in a unique access token. So during the life of this token there, the users then access the website, the app, the token's been issued for. So rather than having to keep re-entering your credentials, remember your password, so you don't have to copy paste from there, so you go back to the same webpage. So it’s this whole process of trying to, from a zero trust perspective, to try and reduce the complexity to make security, not fun, but to not impact the performance. So, you know, any resource protected with the same token they can actually get to. So think about authentication tokens like a stamped ticket. The user retains access as long as that token remains valid, but once the user logs out, quits an app, the token's invalidated. At that point then, they need to re-authenticate themselves. So in terms of seeing what was happening underneath there, it is not really overly exciting to actually view.

But the point I want to show is the Internet Insights picked this up, so we can see this. So again, around 16:55 we saw this and what it actually manifests itself in. So you could see the agents there started trying to make the connections coming through from there. And then what they actually had when they were going to, and I've got it sorted then on Cloudflare on the right hand side there, as Cloudflare is the server network. And so what it actually came in from there is they were showing themselves as server timeout. So it's service unavailable's and service timeouts are there, meaning we actually couldn't get through and make connections coming onto that system. So that's how it'd come down.

It's an important point I want to say there and show that, is because, you know, we talk about the surface delivery chain a lot, and we talk about this concept of all the dependencies, everything coming into play. If I actually fail that first step, trying to authenticate myself on, I'm trying to validate my service token as I actually go into there. This is where I'm actually going to start to see the system break down. I'm not going to be able to get access to it. So irrespective of any network or performance issue across from there, I actually can't get onto the system itself. Okay, so what actually happened here?

Now Cloudflare is to be congratulated here, because they issued a really nice detailed post-incident blog, and we'll put the link to that in the show notes there. So what they actually start to talk about in there is this change where it sort of overwrote the token metadata. So we've talked about the service tokens being this stamp ticket, so I can actually get back in and out from there. Well, it actually overwrote it, they actually had this problem there where they then couldn't reconnect, this couldn't be used. So essentially the token was invalidated across from there. So they had to get back from there, they actually had to manually restore the tokens, had to restore the data that came across from there.

Now I said up front there were two services, or two areas that were impacted from there. So effectively the tokens been used for Cloudflare services, they were able to manually restore those, but the ones that were actually from an external system or connected from an external system, so they were using Cloudflare for their service token authority, they actually had to restore that from backups. They had to roll it back. But like I said, they actually recognized what was going on, they did the change, and they backed it out. Just this restoration process was the problem that took them two hours, but they identified very quickly what had happened. So that visibility came across from there.

And again, the point I want to reiterate about this, is that we talk about a service delivery chain. Okay, so that's really interesting and the thing I said about this, we talk about this service delivery chain coming from there, so that every part of that chain is required to perform seamlessly, so you have this optimal digital experience. So in this case, we had this situation one while I'm accessing the service itself, but also I’m using that token for a third-party system or application data between an external data center and a cloud service or cloud provider. I'm actually sort of failing at that first stage, so I actually couldn't get into there, so therefore I'm going to be impacted.

So moving on from there, let's take a look at another configuration misstep. This time it’s Slack. We're talking about this trio, so they were in no particular order, it was just the way they came out from there.

But on January 25, between 3:20 PM and 3:46, which would've been around 7:20 AM Pacific. So what it turned out to be was that Slack was unable to open messages and threads. This was reported globally from there, some people couldn't preview files, you couldn't send messages. Relatively short period of time, 26 minutes, but it did have these global impacts.

Users could have been alerted to the fact the issues because they were sent email notifications about the presence of new messages, but they hadn't shown up in the application itself. Now Slack actually identified the issue, and they traced it to a configuration change that they said impacted usability, but they said it was an internal change, and as soon as, again, they identified it, they started reverting it immediately, they rolled it back and once this was reverted, it was resolved for users.

Now, because of how we saw this impacting in the application itself, it's obviously something to do, or we assume it's something to do, with the application configuration itself. Now that could have been some internal systems, such as we're authenticating internally, we are making a call to a particular database, we're just doing a refresh or whatever, something around those, those systems there, but from a backend perspective. So network connectivity at the time was all good and if you can actually get to there. So if I'm looking, essentially, at it from a network perspective, there's nothing to see here. We can move on from there.

But again, because my application started to do that, and in fact you'd need to see something very specific within the application to actually see that, because again, parts of the application itself that weren't functioning. So I could load the screen and I could actually even work around, as I said, by just resetting or refreshing my Slack instance on my desktop there, so I could actually get the stuff coming back from there. So for all intents and purposes, it's actually working. So you wouldn't necessarily have seen some sort of alert coming out from there.

But if you had a specific test running to actually go through that process to do something from there, then again that would've come back with some sort of status code or excessive wait time that you'd have actually been able to track and see that down from there. So really again, it's all getting ahead of the changes that happened, and again there's processes in place, but how would I put this onto my own environment? So if I'm actually making a change to a system in my own environment, if I have the ability to build a transaction test across from there, then I can actually verify various processes of that step going through from there. Now if I can actually tie that into specifically when I press go on my configuration change, because remember we're dealing in an agile world here. Everybody's making changes across the time you know, I can have instant recognition that this is sort of failed, because I'm testing that functionality that I've effectively just changed.

So we're starting to get this new paradigm, or hoping to move into this new paradigm, where as the developers are building out code and functional tests, and they could actually have this test sitting alongside. So when you actually press go, before you push as part of your own DevOps QA process, you can actually validate, is my change going to cause issues and get this instant feedback. So therefore, you know, and in some of these we're talking two hours, some of these we’re talking 26 minutes, there's always that lag, and that's not to say that when you actually make a change, you can't roll it— the effects rolling back.

So we saw with the Microsoft one for example, where they identified that the change and they tried to roll it back from there, they did revert the change back from there, because of the nature of the issue, it meant that the effect had to go through the whole process again, which meant we had this elongated duration of impact.

Continuing on that, on February the 7th, Microsoft experienced a second significant outage in that two weeks’ period, being that on January 25th, which we've actually documented and got into there, we have also done a podcast, The Internet Report podcast, on this outage from there. So I just wanna take you through the process, but if you've looked at that before the January 25 outage, you're looking at this, you'd be, you'd be forgiven for thinking I'm looking at the same outage there, because the pattern looked very similar, but these were two completely different outages that occurred from there.

This wasn't a repeat process that went on from there. So starting around 3:55 on the seventh, this is where we start to see it. Now, interestingly enough, I want to highlight there, it was a global impact. So we can see the issue starting to across there and we're impacting a hundred servers at that moment in time, observing a hundred servers having an impact. The other thing to be mindful of here is we're actually looking specifically at a series of Outlook services. So Microsoft Office 365, Microsoft Online, and those ones. But I do start to see impact in terms of agent locations or where I'm seeing this impact, but predominantly in North America. So this is where it looks like it appeared to start as we get into the height of it.

So as we move into the second period, we again, we start to see the number of services impacted, we start to see the footprint grow as it were. So as the actions increase, we start to see the number of servers impacted increase, before we start to come down this period there. But importantly, and why I want to mention that is obviously we're still predominantly North America-focused, but if I start to come down through and start to see the numbers of servers impacted, or locations, agents impacted from different areas, it spreads across that way there. So my global footprint was there from the start, but it actually almost seems to be impacting more areas globally than it was when it first came around. Again, manifests itself, we're talking about specifically Microsoft Outlook and associated services to that. So very specific application, again different from the previous one, where we took out a whole range of services. This one was just looking in the application.

And again, if I go into the network, there are some network outages there, but there isn't, again, anything associated specifically with Microsoft themselves. So this is again, we can actually say we're looking at application, and indicative of that really is the fact we've started seeing these service timeout areas. We're actually seeing stuff coming back, which means we are seeing connectivity coming into the system itself.

Now if we actually then go into the application that's there, and this is when just looking at one instance and I do this cause I want to just to see what happened. The outage manifests itself in the user’s ability to send and receive, search for emails with connections, so timing out, we saw it timed out, we saw a number of HTTP 500 service unavailable messages, but there was nothing into the network.

Now if I actually look at this from a transaction perspective as well, a page load to see what happened, this is effectively before the outage occurred. So I can see everything around there, I can see what happens. I'm going to my Outlook page and I do a 302 redirect. I actually redirect from there, everything looks good. Now if I start to go into that next interval of the incident where it starts to occur, what suddenly happens now, is what I actually see is, I still do my redirect, but now I can't find a server. I actually can't get to this, I get this excessive timeout in this case, or in some cases, I get the service unavailable. I get the 500 message coming back at that point.

Keeping with our theme of change and rollback there, Microsoft confirmed that a change to some of their Outlook systems was a major contributor to the outage itself and to resolve this, they undertook targeted restarts to parts of their infrastructure, and then to restore the connectivity for the users there. Now really, this is where we started to see that service recover, but also at the same time, it wasn't quite as dramatic as a time coming down that we saw when we're recovering from the BGP issue on January 25. And what we suddenly saw was some of those increases coming in from there, because this was this (indistinct) knock on impact.

So when we were restarting the service, there may have been a service in a different country which wasn't impacted, but because of the way things were connected, this restyle actually caused that system to be— the infrastructure to be impacted as it was, as it came back online there.

I know we really just scratched the surface there. As I said, if you'd like a more detailed discussion, please check out the Internet Report podcast. We'll put the details of the link in the show notes below.

Okay, before we go, I just want to touch on a couple of outages I think are worth a mention. The first of these impacted the online payment system Square. So full disclosure, the reason I'm bringing this one up is it was one of those time of day things that due to timing, seemed to have a larger impact on the Oceania region. Alright, so global impact, so it's a global outage, but the impact felt more in Oceania because the outage was starting around 7:54 AM Australian Eastern time, so it's on the 7th of February there. And why that's significant is because it was small businesses, cafes, they couldn't accept the contactless payments during this morning rush when it was going across from there.

Now what then becomes interesting with this outage then, and again this is one of those ones from a change, but the issue, the source of it, appears to be with a third party, potentially a payment rails provider, although the actual root cause is still to be better understood. But again, the fact of this is that this highlights this reliance on third party dependencies. You know, all these components need to go in. So again, everything within the Square system was working. We could get to it, we could get to the back-end service, we could authenticate everything. But this third-party plugin or third-party application had an issue on it or some issue across from there, again, which then caused the whole system to become nonfunctioning.

So we actually couldn't complete the transaction even though technically it was up and available, we actually couldn't go through and complete it. The moral of that story is the chain is only strong as its weakest link and what you want to be able to do is to be able to have visualization and visibility across the entire service delivery chain.

Now that might not necessarily mean that you have to be able to specifically identify or instrument every part of that, but if I can understand that right, there's a call being made from there, I might want to have a separate test that actually does that functionality from there or test that from there, if I don't want to build something out, looks for a function perspective, but also I can actually take into account, okay, between this step and that step there will be a wait of X milliseconds or X seconds. If that starts to increase, then I want to be aware of it, because then I'll start to have some sort of system failure across from there. At least I can actually start to see.

So things change in these third parties, and as long as I can get visualization around from here, and we've talked about this on a number of occasions, is that one of the main things I want to do is to identify who the responsible party is. As long as I can, I know there's an issue there and I know what I can do, I can start to take steps to work around this situation as it occurs.

And last, on February the 12th, people in Philadelphia's Fishtown and Kensington districts experienced Xfinity service interruptions just ahead of the Big Game, where unfortunately Philadelphia Eagles were set to face off against the Kansas City Chiefs. So really quite pertinent.

Comcast reported that they were able to restore the cable access for a majority of the customers before it kicked off. But there was still some impact. Comcast has since attributed it to vandalism noting that a fiber optic cable was severed in the Kensington section. So obviously it came back to some aggregating point and we've talked about sabotage before, and the fact then that it impacts these local areas themselves. So, you know, it's quite localized. We had the cable cut in Marseille that impacted access to some data centers, that lost their connectivity coming in, effectively, to the world around from that area there. So this happens, it comes from there.

There's a couple of things around from this. So it impacts that last mile typically, but what also, there's different ways to restore the service. So if you think about here, and I'm going to again show my Australian bias there, and I'm sure this happens around everywhere else, is that you can have alternate systems. So if I actually lose my wet string dries out, that takes me to the internet, then it automatically rolls over if the signal’s too weak, it rolls over to a 5G service on the mobile. So it actually maintains that connectivity coming across on there.

So the point of this is that, yes, obviously this was significant because of the timing and the location it occurred from there, the impact itself, I'm going to be really horrible, and say it was actually a small impact. But to those particular people it was really impactful and a huge deal trying to get back on from there. But what it highlights, and why I've actually brought it up around there is that again, it's this ability to have these multiple ways of getting around there. If I understand what's happening from there, I can actually mitigate around, I've talked about the ability to automatically cut over to a 4 or 5G cell system so that I can actually maintain that connectivity. It may be that I have diverse routes coming out, so I'm actually using completely different carriers.

Now, again, if I'm talking about last mile, it may be this situation where even though I've actually paid for this diverse routing, I'm actually saying it in the same pit and pipe structure. So you actually have to think and understand where I'm going through who my peering relationships coming from there. The point is, what I actually want to do, is have the visibility of that, so that if I know this is where the fault lies, right, ok, I can kick it over. So in my case, automatically come over and maintain connectivity, and now I'm no longer going over my fixed wireless, I'm now going over a 5G service to actually do that once the main cable signal’s lost.

So that's our show. Don't forget to like, subscribe, and follow us on Twitter. As always, if you have questions; feedback—whether it be good, bad, or ugly; or guests you'd like to see featured on the show, send us a note at internetreport@thousandeyes.com. So until next time, goodbye.

We want to hear from you! Email us at internetreport@thousandeyes.com