Twitter Performance in the Elon Era, a Ransomware Attack & More Outage News | Pulse Update
Mike Hicks: Hi, everyone. Welcome back to the Internet Report’s biweekly Pulse Update where we keep our finger on the pulse of how the Internet is holding up week over week.
And what an interesting two weeks it's been. Today, we'll be discussing some Twitter disruption, the aftermath of a ransomware attack, a software release that resulted in some DNS resolution issues, and how a quickly executed manual process lessened the pain for some concertgoers in Australia. And of course, we'll also cover the global outage trends.
With so many interesting things to cover, let's start with “The Download,” my TL;DR summary of what you absolutely need to know about the Internet this week in two minutes or less.
THE DOWNLOAD
Twitter experienced two outages in five days. The first saw users greeted with an empty timeline, along with the welcome to Twitter message. The site itself appeared to be functional, could still be reached, trending topics worked, and users could still tweet, but the timeline did not render properly and was blank. This was followed five days later when a modification to an API prevented some users globally from accessing the service.
Dish Network was also forced to pare back to only the most essential online presence when it became a ransomware victim, when they more or less fell off the Internet for a period of time, appearing to cut off connectivity for all their services, both internally and customer facing, before slowly restoring partial service and essential services. Looking at global outage trends, we also saw total outage numbers rise briefly before resuming their downward trend, while U.S. outages increased consistently, accounting for 33% of all outages.
And now let's dive in further. As always, I've included the chapter links in the description box below so you can skip ahead to the sections to those that are most interesting to you. We'd also love you to hit like and subscribe and always feel free to email us at internetreport@thousandeyes.com. We welcome your feedback and questions.
And to discuss all of this, it's great to have you back, Kemal. I think the last time we did this, we were actually filming a spontaneous episode from Cisco Live in Amsterdam, working out of a makeshift studio complete with an upturned trashcan and Coke can for counterbalance.
Kemal Sanjta: Thanks for having me, Mike. It's awesome to be on the Internet Report Pulse Update again. Yeah, what a time at Cisco Live in Amsterdam. That was really fun and crafty, I must say.
Mike: It was good, it was good fun. All right then. So let's take a look at the numbers this week.
BY THE NUMBERS
Global outages initially reversed a downward trend seen over the previous two weeks, rising initially from 316 to 337, which is a 7% increase when compared to February 20-26.
However, that downward trend we saw before returned the next week. And where the global outages dropped from 337 to 271, a 20% decrease to compared to previous weeks.
Now this is what we're seeing, typically. We actually kind of see this as a normal trend. If we go back to the previous year, this is what we've seen like a calendar trend coming across there.
But this pattern wasn't reflected domestically. We saw the upward trend observed over the previous two-week period where it initially increased from 62 to 95, which is a 53% increase compared to February 20-26. And this was followed by another rise from 95 to 105 which is an 11% increase compared to previous weeks.
And this is kind of what I wanna look at. Now, we've talked about this before, but where we saw the number of U.S. outages rise it normally impacts or reflects in the global outages we see. But again, this is the second week we're actually seeing this happen.
What this then means is that the U.S.-centric outages account for a large proportion of it. So I have a theory about this, and my theory is that what we're starting to see is almost this control of the blast radius as it were. So when we are doing some sort of change, and a lot of these as we've talked about in the past are maintenance engineering type of work because of the time of day they're done, we're actually seeing they're able to control this. We're not seeing so much of a domino effect. So make a change in North America, the only way that has an impact that directly flows down to the other parts of the world that then reflects in the global outages. Does that make sense or am I way off base there?
Kemal: No, you are not. And to your point, it's actually quite good to see that even though outages in the U.S. are rising, the global number is going down. To your point, it's a good and positive trend.
It looks like people started taking resiliency and change management procedural, or operational excellence for that matter, much more seriously than was the historical case, right? So hopefully we see this trend both for the U.S. and global going down. Ultimately, we want to see this number as small as possible, but being completely realistic, outages are going to happen. To which extent they're going to happen, that remains to be seen.
Mike: Absolutely. To your point, outages will continue to occur, and hopefully what we'll see is that blast radius, that impact zone for necessary maintenance and engineering work continue to sort of shrink down as we are able to control it and keep it contained.
Now let's discuss some of the outages from the past couple of weeks as we go under the hood.
UNDER THE HOOD
They do say beware the Ides of March, and although technically that's the 15th of March, we also saw some disruptions in early March this year, including Twitter. The first was on March 1 where Twitter users were greeted with an empty timeline and a welcome to Twitter message as if they’d just set up the account and logged in for the first time. The site itself appeared to be functional, it could still be reached, trending topics worked, users could still tweet, but the timeline which normally displays recent tweets from accounts users are following did not render properly. It was blank. So it just got in that screen.
But what I want to focus on really was a second outage that occurred five days later on March 6 when around 3:45 PM UTC, Twitter users globally began receiving HTTP 403 forbidden errors, which prevented some of them being able to access the service or click on links or tweets.
So, Kemal, do you want to just take us through what we saw?
Kemal: Absolutely, looking forward to it. So what we are seeing here are the application outages from our Internet Insights. And as you can see here the Twitter outage was clearly captured on this view. So on the timeline we can see that starting at 16:45 UTC on March 6, the outage got recorded and it lasted as visible here until 17:50 UTC.
Now, looking at the different metrics, we can quite clearly see that approximately 100 servers or somewhere around that number were affected during this outage. And when it comes to locations we can see quite an interesting spread of these servers being located in the United States, South Africa, Japan, United Kingdom, and other regions. In fact, if I switch over to the metric for locations, we are going to see that approximately 26 locations were affected globally.
So we are here grouping by the applications, but we can do something similar to the server network, as part of which we can see that quite obviously Twitter’s autonomous system was affected. If we hover over this 147 number here, and we click details of this, we can see that users—to your point, Mike—were getting HTTP 403s. In fact, like we show it as 4xx response code from the majority of the agents that were affected. So this is what we saw in the Internet Insights, more specifically application outages part of it.
Now we're going to share the page load test to see more details about this particular issue and how it looks from the performance-related perspective from that test. Page load tests are actually the tests that are quite good at showing user experience. So effectively one of our software components called BrowserBot has the capability to execute a page load test, which is essentially simulating what a user would do when they are opening the page. And we have transaction tests as well, as part of which you can quite easily script the complete workflow. In this case, we are looking at twitter.com as a target URL and page load view.
And we can see that the metric that's selected is page load time. Now looking at the event, you can quite obviously see that on average it took like 3.91 seconds for the page load to complete for Twitter and all of a sudden when the outage struck, essentially we can see that that went to 450 milliseconds. Now, I'm pretty sure that a lot of people at Twitter would like to see this as a normal behavior, but unfortunately we know that it wasn't the case. So that decrease in page load time—and again this is a page load time across multiple different agents. So in fact, if I click here on the table you can quite obviously see that this test is executed by tens of different agents, right? So on average it takes approximately four seconds.
Mike: And I think this is really important and I do love the page load test you said because it shows what the end users are experiencing. And you mentioned about the page dropping. So we now know that this was a change to an API, but what we're seeing on there is that page isn't loading. So similar back to the one where we had the people getting the welcome page, what this is is we get the connectivity coming in. So we actually see a connection point coming in, but then if I'm looking at that page load, all the green lights are on. I'm getting to the page, I can actually start to see stuff, but it just doesn't render properly. It just goes through and shows a partial one. In this case, I think you can show this is what we see is that the 403s mean we're actually making a call to something else. And that's some sort of authentication that’s then being knocked back. You don't have the authority to actually execute that.
Kemal: Exactly, and to your point, and we spoke before about this as well, very often regardless of the company that's affected by a similar issue, you start by asking yourself, “Is this problem with my infrastructure or is it actually somewhere else?” And having this kind of view actually can give you a really quick answer to that question. Now all of these agents at the same time when they are probing the application itself synthetically, they are also trying to execute path visualization probing as part of which we are trying to check whether there’s packet loss, latency, or jitter. And then we are trying to visualize the path between the agents that are executing this test and the target itself. And as visible here, there was none of that, right? Which indicates that this was clearly an application-related issue. There's no packet loss that would explain a significant dip in availability which further reaffirms our theory that this was 100% related with the application itself having the issues rather than the network having the problems to reach the resource.
Mike: Yeah, that's really good. Just to emphasize your point there, why it's important to be able to see that whole exchange is because we can identify there's a problem on the network. So if there's a network issue to see, maybe a high loss rate, you might have seen a break in that path or something down from there, but the fact we're seeing a 403 means we're actually getting response, means we're actually executing it from there, and then combining those two together, those layers together so that complete service delivery chain. Because we know this was an API call. Twitter came out and said we made a change to an API and it had these issues. So we are calling a service, that could have been a third-party service for all we know, but what we're doing is testing that functionality coming through that system. There actually was a third party in some cases. So some of the people were reporting where they actually were going to post or do things into another application, then they actually couldn't do that from a feed perspective. So important to see that service delivery chain.
Kemal: And it's important to mention that HTTP Server is essentially an agent to server test as part of which we are probing the target from the specified set of agents in this particular case. And it's very important to mention here these two phases of the test: send and receive. Send is essentially the time it takes to send the first byte to the server and receive is the time it takes to receive one. Now if we saw the receive having the problems we would straightaway have the capability to say that potentially something on the return path was something that was problematic. So for example, we know that the Internet is asynchronous in its nature, so the path that traffic takes on the way to the target is likely going to be different on the return path, right? Like essentially from the path that takes from the server to the agent that executed the test in the first place. So in this case we see none of that. Essentially receive is completely green for every single agent, which indicates that everything from the network side was working fine. It's purely an HTTP-related issue.
Mike: So that's great, Kemal, that was really interesting as we went through there. So carrying on this minimalist theme, and what I mean by minimalist theme is what we have was when on the Twitter one where we couldn't actually load a full page, so we could get to it, we could do it, but we couldn't actually functionally get to there. So what I'm talking about is I can get somewhere I just can't actually see everything that’s going on.
So unfortunately Dish Networks was forced to pare back to only the most essential online presence when it became a ransomware victim. Detection, isolation, and containment are critical parts of security incident response and these steps were clearly followed as Dish’s issues unfolded.
All right, so what we're looking at here then is I'm just looking at the public-facing website for the Dish around there, but this impacted all their internal systems, as well as some of their customer-facing ones. Basically what we said they did, and there's nothing wrong with what they did, they actually just turned everything off or removed their connectivity. But it's kind of interesting to actually look at what happens.
So everything's going across quite nicely and then as you were pointing out in the last one, so I'm looking here at a page load test, everything sort of falls off a cliff and then we actually get to that point where we get no connectivity coming across there whatsoever.
Now the difference in this one is that obviously this is a network related in terms of remove connectivity, I have my paths there, and then when actually go across, I don't have any connectivity whatsoever when I get in there. So interestingly you can see, and I'll go back into this where we start to see some of the systems almost start to recover because again, so I fall off a cliff, light switch moment, things go off, remove connectivity from there. While my understanding is we actually go through and we look and we start to clean systems up and go through that process there, but then we start to get some of them coming back online after this period of time. So we start to see from here, we can see obviously still a vast amount of them having connectivity issues, but we start to come back on and get more connectivity. Again, I'm only talking to this front-end system. Interestingly, when I start to actually see what they were, when I look at those to come in, to go into that page load view as you were to see from the waterfall diagram, what I saw was quite interesting is that I'm having connectivity and I'm coming in, but all I'm doing is effectively just loading a logo screen. So almost like a test card from the old TV systems where I'm there just to test the transmission there, which I thought was kind of interesting.
Kemal: And the packet loss is actually quite interesting as well. So if you look at that it lines up with the moment that they started advertising again or when they started being online, which is kind of normal. Even if you go back to HTTP Server and page load, you can see this really steps-like-looking recovery going up. So you can obviously see there.
So essentially imagine the amount of traffic that hit the available resources at that time. It was probably a lot. So this infrastructure for quite some time was probably being hit by multiple requests, probably more than it could handle at this time, until more resources got spun up depending whether they were on-prem or on cloud or regardless in this particular case, right? But it's actually quite interesting the exercise, and this is the example of the application that went off completely. Now let's gradually figure out what was happening. There were obviously remediation steps that were taken and then you can quite clearly see that this step effect of let's turn this up and then let's scale it up as we go. And then you can see obviously full recovery. It's actually quite a good example of, you know, what happens when these kinds of things happen.
Mike: Yeah, I like that. So it wasn't just minimal online presences that were observed in the past period. We also saw incidents where either harm minimization was used or where it would've benefited impacted end users by limiting the duration of a disruption. The first of these incidents, Akamai reported Edge Delivery DNS resolution errors on February 27. This started around 7:00 PM UTC and with a total duration of around 20 minutes. So Kemal, do you wanna just take us through what we saw?
Kemal: Gladly, my friend, This is quite an interesting event as part of which we observed it again, both in the Internet Insights and we have a test that shows what really went on. But in general it was a short event, much shorter than the last big Akamai outage that we extensively reported on. But here you can quite clearly see on the left hand side, we have again regions: United States, United Kingdom, Australia, Germany, and you can quite clearly see that we have a lot of locations. In fact, if I click on the metric called “Locations,” we are going to see that there were like 53 different locations across the globe at various times during this event which means that this was a global outage.
Now it does not surprise. Akamai is one of the biggest, if not the biggest CDN provider out there, right? They are very famous for their offline cache and what they do and they are quite good at what they do, but when these things happen, they tend to affect a lot of customers, right? Or users for that matter. So, if you look at the right hand side, you can see that Salesforce, Microsoft, SAP Concur, Oracle, Walmart, U.S. Bank, SAP, and other companies were affected. In fact, like there's quite a big list of them.
This was a short outage and likely a majority of the customers that were affected by this particular issue were not doing anything to resolve it. This got auto-resolved very quickly. However, last time Akamai had a similar outage, it lasted for hours, right? And now very large financial customers were affected as well. We saw that in the data, right? And if you're a financial customer, you are going to have regulators to ask you questions. Like what's with your redundancy? Like what's the full tolerance plan and stuff like that. Those are all really hard questions to answer, but you are getting pressed by the regulators, right? So having a plan for these kinds of things is really important.
During the last Akamai event, I was actually studying it a little bit in more detail, and it was quite a clear distinction in what companies that handled this event did correctly versus these that were affected for a few hours, right? So the companies that responded to this event had a quite clear mitigation plan for this type of event, and it was fairly simple, in fact. If you think about it, Akamai not only does CDN-related stuff, they also do DDoS prevention-related systems, right? And don't get me wrong, all other DDoS prevention companies and all other services similar to Akamai are prone to these things. I'm not picking on Akamai for that matter.
Mike: Just an example.
Kemal: Exactly. So essentially, what companies did that responded to this type of event correctly was just to stop advertising directly to Akamai and in fact like just use your regular transit, advertise it to the public Internet. Yes, you're going to expose yourself to DDoS, but at least, you know, what you're going to end up having is the ability for your customers to reach you versus like—
Mike: You mean connectivity.
Kemal: Yeah, exactly.
Mike: All right, that was really interesting. We could probably go on there for way too long, but let's get to the final one. And with that, I wanna selfishly return to Australia where Ticketek, an event ticket retailer, had an app reportedly fail to display the necessary ticket barcode. So the barcode on the bottom that you had to scan for entry into an entertainment venue, it was actually into the MCG and it was to watch Ed Sheeran.
So what this meant was the tickets couldn't be validated. Now, the nature of the issue suggests it was a backend problem, potentially connectivity related, because even the e-tickets that have been downloaded previously couldn't be validated. So it was actually that authentication part. So was this a real barcode or was it one someone created? And what it was doing was showing up as void or to an expired event, basically because of that connectivity issue. So that again, was that back-end issue there.
What I thought was brilliant was the response to this. So when they identified this was happening, what they actually did was they got people to—and I can't remember the exact numbers, but it was around a hundred thousand people going into Ed Sheeran. Now all of them, some of them would've printed off their tickets previously, but a lot of them would've had it on their phone systems. So what they did, they effectively mobilized all these printers and these people and they had them going through and printing off the tickets. So even if we're talking 20 or 30,000 people to do this, they're able to identify quickly they had this issue, which was this validation, which is the backend; assessed they couldn't resolve it in time to get people into the venue to see Ed Sheeran; and therefore, then got this process in. And although there were complaints about people having a queue to get in from there, everybody got into the venue, everybody was able to see the show. So I just thought that was a really brilliant example of identifying a technical issue and then implementing a remediation plan that was a physical process.
Kemal: That's awesome. I mean I'm glad that people got to see Ed Sheeran. That's the most important thing here.
Mike: The other point there as well though, I guess on the serious side of it, is what this does highlight is that again the whole digitization of everything that's going on, we have e-wallets, we have pay systems on the phone, all require this validation system to go back and forth. Again, talking about these multi dependencies. So back to your previous point there, this visibility, being able to understand everything within our dependency, every dependency connected or not connected within that service delivery chain is now critical. And then having a plan to mitigate it.
Okay, so that's our show. Don't forget to like, subscribe, and follow us on Twitter @thousandeyes. As always, if you have questions; feedback—good, bad, or ugly; or pop culture references, I’ll take them all. Or a guest you'd like to see featured on the show. Send us a note at internetreport@thousandeyes.com. So until next time, goodbye.