The Internet Report | Transcript: Is Spring Cleaning Causing an Outage Spike?

June 9, 2023 • 27 Minutes

Is Spring Cleaning Causing an Outage Spike? | Pulse Update

MIKE HICKS: Welcome to the Internet Report’s Pulse Update. This is a biweekly podcast where we discuss what's up, what's down, and what's trending on the Internet. And today, this is a very special episode. We're coming to you, you're not hearing the nice therapeutic sounds of Australia, but you're hearing the background to Cisco Live, so we're live here in Las Vegas. And I'm joined today by my good friend, Kemal Sanjta.

KEMAL SANJTA: Thank you so much. It's awesome to be here. I hope that everyone out here is having a good time. as much as we do here in Vegas.

MIKE: It's been pretty good. It's actually been very good this year. We've had a lot of coverage and lots of stuff coming through from there. And a lot of people visiting the booth. It’s where we're actually broadcasting from for today.

So today we're going to be talking about a possible spring cleaning. So it's sort of things we see out there that might explain some spike we've seen in outages since we're at the beginning of the month. So we've observed a similar sort of thing over the past few years. We also want to unpack some recent outages at Twitter, Microsoft 365, Slack, Instagram, Apple's iMessage, and HBO's subscription-based streaming service, Max. But before we get there, let's start with “The Download,” which is a quick summarization, essentially, my TLDR of what you need to know about the Internet in two minutes or less.

When looking at the overall outage trends over the past month, in early May we again saw a brief spike in outages. So we've observed this now for three consecutive years around this same time. So this seasonal spike could be explained by potential spring cleaning. What we mean is when many IT teams in May seem to do some final spring clean of their infrastructure and set-ups at this time. And so they seize opportunities to slot in and schedule time where they can actually do additional engineering or maintenance work.

On top of this, we also saw several outages at major companies in recent weeks, which we're going to dive into deeper. We're going to dive into a little bit of depth into the Twitter disruption and the Microsoft 365 outage later in the episode. But first, I want to give a quick overview of these and some other outages that caught our attention recently. So let's start off with Twitter Spaces.

So Twitter used this live audio conversations feature, Twitter Spaces, on May the 24th to host the presidential candidate announcement by Florida Governor Ron DeSantis. So the announcement was actually impacted by several technical problems that caused issues accessing the Twitter Space. At the beginning—you might hear a forklift backing up behind me there. But at the beginning of the event, these issues occurred at the beginning of the event there. The issues might have been caused by the high number of users actually trying to access the stream concurrently. But it's more likely that the bottleneck was further down at the entrance point of the stream itself, rather than the stream infrastructure being unable to support that number of concurrent users wanting to access it. However, in the case of the Twitter Spaces event, it would have been really difficult for Twitter to identify such a bottleneck beforehand. Typically, they test traffic patterns, but testing traffic volume at that number wouldn't have been possible. And so just testing the functionality wouldn't have triggered the event. Instead, it's likely the potential bottleneck would only have become apparent when this actually started from there.

So moving on to HBO's subscription-based streaming service, Max. It experienced some issues on its May 23rd launch day, with users encountering some inaccessible streams. A spokesman initially attributed the problems to a scale issue, and this does appear to be the cause. A large number of customers all hitting the service at the same time to try it out, this new tries we got from there.

On May 23rd, Microsoft reported issues with the content rendering on office.com and other services that came under Microsoft 365. The services could still be reached and partially loaded, elements of the page content would not render, due to some backend calls and other services that sort of failed to complete. This manifested as the users are either incomplete page loads or unresponsive when they're trying to search or pull information in and just some general timeouts from there.

And then just over a week later on June the 5th, Microsoft experienced another outage impacting the Microsoft 365 services. Now this outage was actually made up of several sustained periods of disruption over 27 hours. I'm going to go into that in a little bit of depth in a minute there.

Slack users encountered error messages when they were trying to connect to the service. This was relatively short, it's a 30-minute period during the mid-morning Pacific Time on May the 17th. The company acknowledged the load problems in a status advisory and they attributed them to operational change. So they made a change that had an error in it and this caused the database to become inaccessible, basically couldn't access from there. So what was happening, you actually couldn't get information when you tried to load the screens there.

So then moving on to Instagram, there was a technical issue that left users globally unable to access Meta's Instagram on May the 21st. And this was end users attempting to open the app were greeted with an error message that they couldn't load the feed. So refreshing the homepage and the profiles did not work. The impact was actually global, went across everyone. The duration was two to three hours. And what we saw going off there again, we saw a lot of 500 errors, again indicative of the problem being coming from the server side.

And finally in this TLDR, the Apple's iMessage. So on May 23rd, Apple confirmed some users were unable to send or download attachments in iMessage. The problems did not appear to impact all users globally, only subset of Apple infrastructure. Additionally, the issue didn't really appear to generate a significant number of outage reports. And, you know, one reason for this could have been that iMessage has this automatic backup mechanism that kicks in when iMessages can't be delivered over the Internet, so it falls back to SMS. So it may have just been that people were having this workaround from there.

KEMAL: So what's really quite interesting here is the fact that a lot of these different mission critical and enterprise applications actually went down at the same time or during the same time frame. And this is something that's really interesting. Now we saw this paradigm shift as part of which users move towards the on-prem applications, towards the SaaS-based applications, and they expect them to actually work. But if you look at this, a lot of these mission critical applications such as Office 365 or Slack that are critical for people to actually effectively and productively do their jobs actually are having problems. So again this underpins the importance of visibility as part of which people need to have an easy way to say whether it's an issue on their side or whether it's an issue on the provider's side.

MIKE: Yeah, absolutely, and they want to be able to understand where it is so we can take steps around it. So if something's not there is it me, is it you, is it the site, where can I drill into? So that's great so we're going to dig into the Microsoft one as we go from there.

So as always, we’ve included chapter links in the description box below so you can skip ahead to sections that are most interesting to you. As always, we'd like you to hit like and subscribe and email us at any time at internetreport@thousandeyes.com.

I'd also just actually like to call out there and welcome you, we always welcome your feedback and questions but just call out the people that have actually come up to us here and during the sessions that myself and Kemal have been running and saying they're listening to the podcast. Thank you very much. We'd like to keep that up there. And as I said, keep the feedback coming.

KEMAL: Well appreciated.

MIKE: Absolutely. And with that, let's take a quick look at the overall outage trends we've been seeing.

All right, so now to my favorite part of the podcast where we actually look at the numbers and the outage trends that we've been seeing over the past couple of weeks, and actually in this case, what we saw over the May period. So following that sharp spike we observed at the start of May, the global outage numbers were more reasonably stable during the end of May and beginning of June, so May 22nd to the June 4th period there. They initially dipped down, we saw this slight drop there from 174 to 170 when compared to that May 15th to 21st time period. This is followed by a slight rise, you know, really marginal, with global outages increasing from 170 to 176, which is a 4% increase compared to the previous. This pattern was not actually reflected in the US, where outages constantly increased across a two-week period. Initially rising, again, as we're talking about slight rises here, so we went from 71 to 78, a 10% increase compared to May 15th to 21st. and then the numbers again increased the following week by 10%, rising from 76 to 86. If you actually look at where the U.S.-centric outages accounted for, what they accounted for in a total global outages there, we saw 47% of all observed outages from May 22nd to June 4th, which is actually larger, much larger in this case than the percentage observed between May the 8th and 21st, where they accounted for 40% of the observed outages. And this actually continues this trend we've seen over the recent weeks there where the U.S.-centric outages are accounting for at least 40% of all the observed outages.

I want to focus on May overall there. In May, we saw the total global outages rise from 1,026 to 1,305, which is a 27% increase when compared to April. The U.S.-centric outages also rose, climbing from 451 to 597, which is a 32% increase. This ends that trend we observed in both March and April, where the global outages decrease each month while the U.S.-centric outages increased. I want to go back to that May, to that start of May, because the fact that these numbers have gone up, we've seen that 47% increase from there. And we had that huge spike in the middle there that makes everything a dwarf down there. But we didn't see too much disruption. And we talked a little about this in the last podcast. We didn't see too much disruption in terms of chatter on the Internet, people complaining about access there. So one of the things, and we sort of mentioned this about spring cleaning. So, you know. Do you think that would be reflected?

KEMAL: Yeah, for sure. One of the things that I was thinking about while you were speaking about it is actually that a lot of these companies during this time of the year actually engage in what they call cleaning of the technical debt. And technical debt is not something that you push for quite some time under the carpets, right? And then like there's always going to be time that it gets you, right? So sometimes it gets you in a really hard way, sometimes it’s actually quite minor. And in this particular case, even the numbers are actually quite large, if you think about it. Global versus U.S. is like—and even for the U.S. it was pretty significant compared to the rest of the period of time that we just spoke about. It actually looks like people were affected, but fortunately enough it wasn't as noticeable as it can get.

MIKE: That's true. I mean because I think and if we actually delve or I did delve down into it. I spend my day looking at the numbers. It's a very sad life I lead. But then actually looking at the time of day they're occurring and they were reasonably contained, but this is the impact. So if you look at the U.S. numbers there, we see increasing. And that fact that we're seeing this over, you know, pretty close to 50%. And these were occurring is sort of, I think, indicative again of this time of day and the fact then that we're seeing this sort of this sprinkling is taking place.

KEMAL: Yes, I agree. I agree. It's actually quite good that they were not reflected as pretty large outages. Absolutely. As they often do.

MIKE: Yeah. And with that, what I want to do, let's actually take a look at some of those outages that were actually user impacting as we're going to take a look under the hood. Alright, so the first outage we want to take a look at was the Twitter. It was more of a disruption that came place. So Twitter used its live audio conversations feature, Twitter Spaces, this is on May 24th to host the presidential candidate announcement for Florida Governor Ron DeSantis. This candidacy announcement was impacted by technical problems that caused issues accessing the Twitter Space. This was the beginning of the event. After about 15 minutes, the space was relaunched, this time on a different profile. So it was initially launched under Elon Musk’s profile and they started around from there and then they actually came and they launched from another profile and then people could access and it actually started from there.

So to understand this incident, it's worth dwelling a little bit on how Spaces operates or how Spaces actually work. So it's actually launched from an individual user's profile, as I understand it. And if you then take the case that it was actually launched from Elon Musk's profile. He's got the most Twitter users at 140 million users.

KEMAL: Pretty significant, right?

MIKE: Absolutely, quite a lot. And then what happens as we go through this? So even, you didn't have to be, you don't have to be a follower of Elon Musk to actually access his stream. It's available. You can get to it through a browser. You can actually get to it from Twitter without having to be a user. Or you don't have to be logged onto Twitter, you don't have to be a member of Twitter. But if you are a follower of Elon Musk, what happens is we go through this process where it puts a notification in the profile that says “Elon Musk is having a space or stream at the moment” there, so then you can actually go onto it.

Now, from what we saw looking at the stream, we can actually see as it goes through this process, we can see it does a little bit of authentication, and it's just authenticating that you're a follower. And it does this so it can actually put up from there. So what this means is you could have potentially 140 million users trying to access and it has to go through this verification process itself and, you know, if we look at the architecture of Twitter from there it's actually fairly robust, it's used to this many concurrent users, so this is why we're saying it was this downstream and it's the little verification process we could see this is going through and streaming, it’s sort of indicative of that.

KEMAL: Speaking about the technical issue itself if you have 140 million users and potentially it was not the number that was, you know, number of users that were trying to access the service.

MIKE: But it was the potential.

KEMAL: Exactly. Imagine how hard it is to scale that service.

MIKE: It's near impossible.

KEMAL: Exactly. But all of that are good problems to have when you are a service provider.

MIKE: Absolutely. Just on that scaling issue there as well. It's also—why would you scale to that?

KEMAL: Yes.

MIKE: Elon Musk is by far the largest number of followers there. So why would I do it for that exception? I mean, unless he's going to run one of these Spaces every day. So there's no harm. And I'll tell you one thing, if you actually think about it, they'd recognize it fairly quickly. And then they restarted for another profile that had less followers and then everything went smoothly and they were able to do it.

KEMAL: That company is very good at what they do. Their SRE practices are fantastic. So I'm not even surprised that what we saw is actually what happened there.

MIKE: All right, so let's change tack a little bit and delve into a couple of Microsoft outages. So first one on May 23rd, Microsoft reported issues with content rendering on office.com and some other services attached under Microsoft 365. The services could still be reached. This is quite an important point when we get into the second one. Services or both of them, the services could be reached and partially loaded. So you could actually get it, I could get a page up there. But elements of the page, the content was failing to render when it came out from there. So as the users, this would manifest itself, even incomplete page loads or unresponsive. So you went on there, so I'm going to hit the button harder until I actually get a response. But then just over a week later on June the 5th, this is when Microsoft experienced another outage that impacted Microsoft 365. This had a far greater impact than just being unable to sort of render page. And when we actually looked at this, the outage is made up of several sustained periods of disruption, spread over 27 hours. Disruption lasted around eight hours in total when you actually look at that. So with that, we actually captured this quite well because it showed up quite well, so it was a global outage. What I’d actually like to do is have you, Kemal, take us through and show you what we saw.

KEMAL: Yeah, it was a pretty significant event and I'm actually quite glad to take you over them. Microsoft unfortunately experienced sustained outages over a period of two days effectively as part of which one of their mission critical applications and in this particular case that was Outlook actually sustained a significant outage.

MIKE: It was service in the back as well so it was the actual connectivity and it was a service in the back there and it was over two physical days with like a 27-hour period so we had the back of one and the start of another.

KEMAL: Essentially, if you were not able to read your email two days ago, one of the things that contributed to that is exactly this outage. So on the 5th of June at 9:15 CDT, we can see that there was a significant spike in server errors.

So if I click on the start of that outage, you can quite clearly see that what's happening is that from various different locations towards the Microsoft Office 365, there were various connectivity issues and outages. So as we progress with this particular event, we can see that it's actually getting worse and worse when it comes to a metric that we call servers.

MIKE: Yeah, so there's a couple of things here as well. So for the people listening on audio, we're actually looking now at the ThousandEyes platform. And what we're seeing here is we're going into the insights and we can actually see this. My favorite thing, we're looking at patterns. And one thing I thought was fairly interesting is like you said, the sustained period, but it is spiky. So when we see things coming up and down, it's like sometimes a nice clear ladder and a clear ladder down. In this case, it was, I say spiky.

KEMAL: Yes. And that's potentially the issue kind of gets detected, you know, and first of all, it progresses to the maximum point and then they observe it and they try to do something about it. Be it a restart of services, adding more capacity, or whatever it may be, that kind of like could potentially explain the staircase kind of effect.

MIKE: Yeah, so in this case, they identified, “oh, we're seeing issues, we know there's issues there. We've done a change, we've actually done some change” and then they immediately started rolling it back. I think that's where we see it clearly.

KEMAL: Exactly. So it's actually quite interesting to see the outages because once you switch over to the locations, so once you click on the locations, you can actually see that this was a properly global outage where we can see that it's 66 global locations that were affected. Going back to servers—

MIKE: Yeah, but it's not like global outage thing as well. I mean, obviously your session here on the network detective, but the fact then we're looking at an application. Is it safe to say that we can almost assume it's going to be global because it's at that application layer?

KEMAL: Not really.

MIKE: Because it's a SaaS, no?

KEMAL: Well, not really, right? It could be potentially isolated to certain regions, right? Because if you think about the cloud deployments, like what's the first thing that they teach you? You should potentially have the redundancy, be it multi-AZ, multi-regional, or something like that. So having something like this actually can tell you whether it's global in the time span of a few minutes.

MIKE: Excellent, so then moving in that, train of thought a bit further on there, the fact then it was at the application we didn't see any network reachability problem, we could reach these services at the front end, we were getting server errors coming back, but the fact then we couldn't go forward and it was this global thing there, we could then say something within almost in that distribution level. So we can see stuff within that distribution.

KEMAL: Yeah, yeah, correct. So here, if I click on Microsoft Office 365, it's going to tell me that there were 504 affected servers at Microsoft 365. If I actually click on that, I can break it down into more details. If I keep on doing that, I can see what are the prefixes that were affected as well, which is really powerful from the perspective of identifying what's happening. So we can see here, for example, that it's 50% of all tests from our platform. Now, before actually switching one of the tests that actually calls this into detail, one of the things that I want to say about Internet Insights is the fact that it's actually data-driven. This is covered by tens of thousands of, if not more, tests that are actually measuring packet loss, latency, jitter, and various other criteria, and it's going to actually give you a data-driven decision whether something's a problem or not.

MIKE: I think it's close to three billion measurements a day we're actually looking at.

KEMAL: Which is amazing if you think about it. So as you can see, this event happened in multiple stages, as part of which we can see like it happened in the morning, then there was an appearance of this in the afternoon, and then we saw it later on, happening the day after actually.

MIKE: Yeah, and it looked like my patterns coming again, so we have the rollback, we clear it, there's a gap, and then it happens again, they acknowledge it happened again. But if you look at the number of servers, we've actually decreased again. So as we saw, we have this decremental effect as it were coming down as we go across each one of these occurrences.

KEMAL: Exactly, and which is actually quite interesting that we even saw the real occurrence of this event. What's really actually interesting here is that and what I think might have been happening is that let's actually move forward with the change that you originally wanted and let's see what it's actually breaking and now this time around we are going to potentially know how to resolve the issue faster and contain it. So this might explain what you are seeing over these two days.

MIKE: So let's go and take a look at what it looked like as we go down into one of those individual tests.

KEMAL: So what are we looking at now is an HTTP server test, as part of which the target URL is outlook.office365.com. One of the interesting things here is that we obviously observed significant impact on the metric that we call availability. Availability is nothing else than the capability of the agents that are assigned to test to actually do DNS resolution, three-way handshake, SSL negotiation, we measure the time it takes to send and receive the first byte of information, and HTTP, which is essentially what's the HTTP status code that we are getting back. So again, going back to the timestamp, we see that at 9:10, or 9:15 CDT, a few days back, we saw significant degradation. Now, looking at the world map, we can quite clearly see, and it's very graphically represented that all of these agents went red. And if I click on the table here, I can see where they were failing. So, receive, 503s, and stuff like that.

So, one of the things that network engineers tend to ask themselves very often is “Is this a server-related issue or is it a network?” And you know, ThousandEyes is one of these tools that can bridge that gap between different silos, right? As part of which server people tend to blame the network. We say, go ahead, blame the network, right?

MIKE: Yeah, it's funny you say that, because it depends on your audience. So I spoke to a bunch of DBAs, they say, “oh, they always blame the database.”

KEMAL: Exactly. Yeah, exactly. That blame game is always happening. Now, fortunately enough, this tool is actually something that can bridge that gap between different teams. And so for network engineers, in order to check whether this was a network-related issue, the only thing that you can do is you can click on the path visualization. And as indicated, the purple lines on the timeline, we can see that this is a period of time when the issue was happening. However, we can see very small spikes of packet loss.

MIKE: Yeah, so we're seeing loss. That's what we're looking at that there, is loss relations in the outage.

KEMAL: Yes, yes, the metric here is loss. In general, like on the path visualization, what we are looking at is loss, latency, and jitter, depending on the different requirements. So here, as indicated by the purple lines, what's happening is that this is the period where Internet Insights actually captured the outage itself. And we can see that while there are very small spikes in packet loss, it's quite easy to say that this wasn't related to the packet loss. And furthermore, if I navigate down to see the path visualization, we can pretty much see that all of the agents that were assigned—so for example, Atlanta, Georgia; Chicago, Illinois; Los Angeles, California; we can see the reachability is there, right? And if I click on the Office 365, we can see different instances that these agents were trying to reach. And we can see that all of this from the network perspective is pretty much reachable. Just for our audience, it's important to mention that all of these circles that we are looking at are essentially networking hops from our agents towards the target itself.

MIKE: I think the other thing we can actually pull from, and when you went back to the other table there, with the fact we're looking at that, my favorite thing, the OSI 7 layer model, we can see the connectivity, so we go back to the HTTP server view there, we can actually see it coming through there, but also then the fact in the table you said we saw 503, so 5xx is something that's server generated, so we know we're getting something back from the server. It doesn't necessarily mean that the backend service is available. But what it means is I can get to this point and now it's a problem within that server. So it's not my client, it's not my network.

KEMAL: Server related.

MIKE: Correct.

KEMAL: Server-related issue, yes. So unfortunately this spanned a pretty extensive period of time, 27 hours for that matter. But I think networking people were quite relieved to know that it wasn't their fault.

MIKE: And the other thing before we move on from there, I just want to say is that actually you can see the spiky nature again. So it was really frustrating. So then it came up, I actually saw this, so we obviously experienced this ourselves, this outage when it came out there. And if I was actually accessing the app from my phone, so I was actually doing it through my Apple client, sort of the mail, connecting through there, I could actually send and receive. If I was trying to do it through OWA, if I was trying to do it actually on the thick client, on the desktop, it wouldn't work. So it was there, so it had this intermittent one, which again is representative of the fact that we have availability, but it's not going down completely to zero each time so it wasn't a complete failure, which makes it even more frustrating.

KEMAL: Yes, agreed, agreed. Intermittent connectivity to the service is never good.

MIKE: Which is then why you need that consistent or that constant visibility to see what's going on.

KEMAL: We cannot stress more and we do stress it quite a bit, but visibility is paramount in what we do.

MIKE: Well that's great, Kemal. Yeah that's excellent, thanks very much. Well, thanks, Kemal. As always, mate, it's been an absolute pleasure. Whenever I can actually stand face to face with you and talk, it's very good.

KEMAL: Likewise.

MIKE: We had so much information we could have carried on for days, but the show is about to open, so we actually sort of need to bring that to a close now.

KEMAL: Thank you so much, Mike, and I hope we repeat this on future Cisco Lives as well.

MIKE: Thanks very much. All right. So thanks very much. That's our show. Please like and subscribe and follow us on Twitter @thousandeyes. Any questions, feedback, or guests you'd like to see, please just send us a note at internetreport@thousandeyes.com. So until next time, thanks and goodbye.

We want to hear from you! Email us at internetreport@thousandeyes.com