The Anatomy of an Outage | Pulse Update

MIKE HICKS: Welcome back to Internet Report’s biweekly Pulse Update. This is a podcast where we keep our finger on the pulse of how the Internet's holding up week over week.

The last couple of weeks, I've actually been thinking a lot about the anatomy of an outage. So in this episode I'm going to discuss what that means, why it matters, and explore some of the recent case studies. So let's start with The Download. This is where we're going to quickly summarize—my TLDR summary, if you like, of what's been happening or what you need to know about the Internet this week in two minutes or less.

So when I say anatomy of an outage, what I really mean is the characteristics of a particular type of an outage. So going back—I'm very old, when I go back in time, I categorize applications sitting in two different camps, one being latency-orientated and one being bandwidth-orientated or bandwidth-dependent, latency-dependent. What I mean by that is the application was dependent on low latency, for example, to be able to make it. So think of something like voice. I need a low latency to make sure my voice connection works. Whereas, and again, I’m very old, if I'm thinking about FTP, I'm really constrained by bandwidth. How big my bandwidth is going to make it come through from there.

So if we take that and flip that and look at it from an outage perspective, we can still take those characteristics of the application itself, but we then overlay them on top. So if we think about a SaaS outage, this typically impacts all subscribers to that service. So we're looking at sort of a global impact. Whereas if I look at an ISP outage, it's going to impact specific geographies. So I'm taking it from there. Then put the application characteristics on top of that and I can actually really start to see where I'm actually looking for that particular problem.

Now understanding the typical anatomy of different kinds of Internet outages is really important because it can really help you recognize the type of incident you're actually looking at, where you need to go, what steps you need to take to mitigate that, to work around it, or to make sure it doesn't impact you as a service.

In recent weeks, we've seen a number of outages that serve as helpful case studies and examples of these various types of outage anatomies. Obviously, we can't cover all of them, so we're going to pick a few out from here. So we've seen some security-related instances, so Western Digital, SD Worx outages, Outages like these are normally characterized by users being unable or denied access to the service. So they actually can get to the service, but due to some sort of security concern or some certificate issue, they actually can't go any further.

So next we have a single-point-of-aggregation issue. In this case, talking about an outage at SpaceX’s Starlink. Outages like these are normally characterized by wide-scale disruption with services reachable across the network, but typically unresponsive and it's down to a single point where things aggregate into that area there. So it could be in a load balancer, could be a part of that service delivery chain where we start to have that issue.

And finally, I want to talk about last-mile challenges. So this is one where we had a Vodafone UK outage. Outages like these are normally characterized by a localized impact. So in other words, it's typically restricted by local geography. So I'm actually just local to my particular area around from there.

And now let's dive in and explore what happens in each of these outages, but first let's take a look at the overall outage numbers and trends this week.

As always, we've included chapter links in the description box below, so you can skip ahead to the sections that you're most interested in. And as always, we'd love you to hit like and subscribe and email us at any time at internetreport@thousandeyes.com. We always welcome your feedback, questions, criticism, and general advice. And to discuss all of this, I'd like to welcome Brian Tobia. Brian, it's great to have you back, mate.

BRIAN TOBIA: Thanks, Mike, glad to be back on.

MIKE: All right, my favorite part of the podcast. Let's take a look at these numbers.

So the last two weeks, we've seen these numbers sort of stabilize a bit. Initially we saw a slight drop, where we saw a drop from 242 to 235, it's a 3% decrease when compared to April 3rd to 9th. This was followed by a really slight rise with global outages nudging up from 235 to 239, a 2% increase compared to the previous week. So if I actually look at that overall between that two-week period, it was really only like a 1% decrease, so very, very stable.

This pattern was reflected in the U.S., as outages dipped slightly, dropping from 105 to 100, which is a 5% decrease compared to April 3rd to 9th. And again, followed by a very small rise, where we went from 100 to 109, which is a 4% increase.

A couple of really interesting things here is that the U.S.-centric outages accounted for 44% of the observed outages. Now this is larger than we saw in the previous period where it was actually only 39% from there and this is also significant because this is the first time in 2023 that U.S.-centric outages have accounted for more than 40% of all observed outages, whereas last year we were starting to go the other way. So by this time last year, we started to drop below 40%. So it'll be interesting to see how that comes across there.

But in relative terms, it's been a quiet period over these last two weeks. I said we’ve stabilized around from those numbers, it's called a flat. And of course, outages happen. We're still looking at sort of reasonable numbers from there. But Brian, why do you think these appear to have not had such a big impact on the users?

BRIAN: Yeah, great question. I think we need to look at the numbers a little more. If we look at the actual timing of some of the events, I think we can see them late at night or something like on the hour where we're seeing maintenance windows happening. So I think it can be a little bit different based on when the outage is occurring. You can see potentially who it's affecting. Late at night, maybe not as many users, or again, the maintenance periods where you have those kind of repeated actions. So I think maybe that's some of the trend that we're seeing.

MIKE: That's very good, yeah. And actually, that sort of goes into that anatomy of an outage. We're talking about these outages there. So anatomy of a maintenance or an engineering or a scheduled type of work, is it occurs outside of business hours. And I think you're right. I think that's what we're seeing here.

Now let's discuss the outages in the past couple of weeks as we go under the hood.

First, let's take a look at a couple of security-related instances. Now, as I stated, in broad terms, a security-related outage is one where the user is denied access to a system due to a security concern. In the first case, Western Digital, who is a computer drive, data storage, and cloud storage service provider, took some of their My Cloud storage services offline. When users tried to authenticate the services, they were unable to, often experiencing HTTP 503s or service unavailable. Western Digital noted that they'd identified the security incident and actually took the step to take My Cloud services offline as a precaution while they investigated what was going on.

So while in that case, the service itself was taken offline, in security-related outages like these, some companies may instead choose to leave their services online and just shut off public access to it. And this is what we saw happen with SD Worx.

SD Worx is a full-service HR and payroll company. What SD Worx did was they shut off public access to its systems, following what they described as an anomalous traffic detection. So they saw some anomalous traffic, they said, this looks iffy. So in this case, they appear to preventively isolate their systems and servers to mitigate any further impact by making their systems inaccessible. Brian, do you want to take us through the example of what we observed?

BRIAN: Yeah, I'd be happy to. All right, so for those who are on audio only on the podcast, we're currently looking at a share link within the ThousandEyes platform, but we'll narrate through it so you won't miss out. What we're looking at here is a share link for the outage that Mike was just describing. This one's pretty interesting. So again, instead of shutting the actual services off, they chose to shut the network off so that any users are trying to get into the application, just from the network side, your connection would fail. So you can see here we have a bunch of our cloud agents which were trying to connect to that application and they were getting 100% packet loss. So essentially being cut off from that service.

If I move to the Path Visualization, we actually see this hop by hop. So all of our agents are over to the left. And then as you see along the path, the red dots are where basically the connection stops, which is right before they hit the actual application. So you can see here everything up to that application was perfectly fine. And if I hover over, you can see that actually was where the 100% packet loss started. So it's a really interesting view because a lot of times you'll see when an application goes down, you'll actually see the provider being cut off right before the application. So this was a pretty good indication of what they were doing, as mentioned, cutting the network access to prevent any use of the application while they troubleshoot it.

MIKE: So that's a forwarding loss, Brian. So what they're saying is we actually don't have any path going into that.

BRIAN: Exactly, yeah. And the interesting part, to go on the path piece, if I go over to our BGP view, we can actually see how they did this. So if I scroll down a little bit, we're now looking at a view of the different advertised paths through BGP for the application and who they were advertising out to. So we have our route monitors on the left, and if we hover over any of them here, we can actually see, and if I go to “view details of path change,” we can actually see where they withdraw that route.

So as you were just mentioning, Mike, we can see the history, you know, currently before this happened, all users had a valid path. And then when we get to the time of the outage, we actually have that route being withdrawn, which essentially means, you know, entry taken out of the phone book. They don't know how to get to that actual application. They don't know how to traverse the end of that network. And they can't make the connection there. So this is a really interesting way of being able to view what they actually did from a service provider perspective. When they saw they had the outage occurring, they were then able to withdraw that route and that's what caused the outage.

MIKE: So that's good. And that's very clear. We can sort of see there. We see they take it down there and it's a pragmatic way to do it. We'll go up there, let's take us offline, and we can see what's going on there. Very good.

Now, SpaceX’s Starlink also experienced a security-related outage, but the anatomy of this incident was quite different from that of Western Digital and SD Worx, characteristic of what I like to call a single-point-of-aggregation issue.

So I might be stretching the security-related bit a bit here, but I'll get on to that in a minute. But it amounted to the same thing with access being denied. So if we take that sort of anatomy there, access denied caused by some sort of security-related issue or security concern. And I said, we'll get to that in a minute there.

But the root cause in this case was it was caused by an expired security certificate at a particular ground station. So this represents a single point of failure. This led to hours of downtime for Starlink’s global customer base. So this is an example of another common type of outage that we often see where there's a single-point-of-aggregation issue. So this occurrence, it was actually down to something related to the security itself, i.e. a certificate. In other cases, it may be something within a load balancer. But again, this is a single point of aggregation that causes the application itself to fail across there.

So just for clarification here, Brian, what is the importance and I suppose the relevance of the security certificate?

BRIAN: Yeah, so a certificate basically allows you to verify your communication and that who you're communicating with is who they actually say they are. So it's just a way of checking their identity and making sure both sides agree that yes, that is who I'm talking to and that's when I should be exchanging my private data.

MIKE: Got it. Got it. So basically it's that just authenticating who I am and so forth, we're actually going to share that. So it's a very small part of the service, if you want to say that. It was one actual element. But again, what this highlights, I think, is that criticality of that service delivery chain. I called it an aggregation point or aggregation-related issue because it's where everything comes into one area there. But it just shows you how we can bring a whole service delivery chain down by the smallest cog failing to function.

BRIAN: Yeah, absolutely. And I think if you look at automation platforms, sometimes when you forget about, like you mentioned, the little things, you have the infrastructure automated, or you have the application side, but if you forget about something like a security certificate, if you don't have that enrolled in something like automatic renewal, yeah, the little pieces can kind of bring it all crumbling down, unfortunately.

MIKE: So that's a good point you bring up there about this sort of validating. So if you can actually check things are going on there, so check expiry dates or check they’re valid all the time, you could actually almost bring that into an automation process there. So, okay, this one's renewed and then we have some sort of policy verification and then we issue a new certificate or we do it across there at that time.

BRIAN: Yeah, absolutely. You can run a certificate authority, which would allow you to do that. So kind of keep track of all those certs that are not going out of place. So I think that's a really important thing to have, especially when the infrastructure is that critical. You know, I think it's really important to have that as well as the monitoring. So like you just mentioned, being able to understand when it's expiring and if it expires, how do I know if that's the problem, right? Can I check on that layer within the test or within my network visibility? Can I check to make sure, is it the network? Is it a security certificate like it was in this case? So I think that's important as well.

MIKE: Yeah, it's a “not you, it's me” type of situation going on there.

BRIAN: Exactly.

MIKE: But again, another important point is, I don’t want to harp on this, but coming back to that monitoring you’ve identified. I glibly threw out there “aggregation point.” But you need to have visibility across that complete chain to understand, all right, what is my aggregation point? And that could be something done in advance. Yeah, I've got, do you know what? This is where I do my verification at this point, or this is where I'm going to a load balancer. All my traffic comes into this point from there. And yeah, identify where it is up front.

BRIAN: Yeah.

MIKE: All right, so finally, we also observed an outage that impacted part of a customer base in a single telco. So this is Vodafone in the UK. So the anatomy of this outage suggested the problem was in the last mile between the customer's premises and a traffic handoff point, essentially there. So as typically for these types of last-mile outages, the impact was restricted to this local area. In this case, it was quite a big area, quite a big subscriber base. But what it means is you don't see disruption to that transit path. So where we've talked about in the past, where we've had these outages occurring because a transit provider dropped a link and it's a main one or through a main point from there. This is really just impacting this local area. So rather than hitting a single specific service, so basically I'm sitting here, I can't get to any services whatsoever. So it could be my local Wi-Fi or it could be, as it was in this case, my ISP where I actually don't have that sort of connectivity from there.

A couple of points on that, Brian, is that as a consumer of a broadband service, I always have a Service Level Agreement (SLA). This was out for several hours there, but would this type of thing breach that SLA that occurred there, do you think?

BRIAN: Yeah, and I think it's important to understand what that SLA is and be able to prove where the problem actually was. So, yes, it absolutely could have. But to your point, I need to understand, is it with that ISP, is it something local within like a cable cut along the line on a physical path, or is it a local provider outage in that area? So I think it's definitely important to understand and be able to give some evidence, because I think that's really key when you're having a discussion with them. But yeah, it definitely can affect the SLA.

MIKE: Yeah, it's interesting. And you make a couple of points, obviously, you know, back to the visibility to understand where it is, problem across from there, but also to understand what your SLA is. I mean, and if you consider an SLA, an SLA will probably be based around not the service you're getting, so not can I get to Facebook, but more “do I have network connectivity around from there?”

And it's important to understand the hours that occur in there, you know, and understand what it is. Because ultimately what I want to do is to work around this. I want that service because that's critical to me being able to use, especially as we're living in this hybrid work environment where we're working remotely, we're working from anywhere. So if your service is disrupted, obviously it's always good to get service credits back. But really what I really want out of that is my service working. I don't want the disruption of having it out for a period of time, because that's essentially, that's my business, that's me unable to work. I know when mine goes down, I'm actually rushing out into the shed. It's normally my local power. I'm rushing out to the shed to get my batteries, to get my generator going. And as much as I love the service credit back, what I really want is my service to be operating. So if I identify it's on my local Wi-Fi, therefore I can actually just, I could hit in there or it's on my local ISP, what other things can I do, do I have a hotspot? So what other workarounds can be activated do you think?

BRIAN: Yeah, good point. I mean, I think hotspots are great one, you know, dual ISP path if you have that ability. I think, yeah, it's definitely a redundancy plan and having a backup, I think is definitely key, as you mentioned, as we rely more and more on the services, that's really important.

MIKE: Yeah, I like that dual ISP one, but obviously that then comes into the consumers. And so the size you're consuming, you know, sort of me sitting at home here, don't particularly want one, two paths, but I need that connectivity.

BRIAN: Exactly.

MIKE: So hopefully these case studies have been helpful in illustrating the value in understanding the unique characteristics of different kinds of outages so you can quickly recognize the type of incident you're dealing with, take the right steps to mitigate its impact, or decide what you can do to work around it. We've also created a handy cheat sheet that summarizes the common warning signs and several types of outages that we’ll provide a link for you in the description box below. So definitely check that out, please.

So once again, Brian, absolute pleasure. I love your radio voice, mate, and always great to have you on the podcast.

BRIAN: Thanks Mike, definitely happy to be on.

MIKE: So that's our show, please hit like, subscribe, and as I said follow us on Twitter @thousandeyes, and any questions, feedback, or guests you'd like to see, just send us an email at internetreport@thousandeyes.com. So until next time, goodbye.

We want to hear from you! Email us at internetreport@thousandeyes.com