The Internet Report | Transcript: Exploring Application Errors at Okta, Twitch, Reddit & GitHub

March 31, 2023 • 25 Minutes

Exploring Application Errors at Okta, Twitch, Reddit & GitHub | Pulse Update

Mike Hicks: Hi everyone, and welcome back to the Internet Report’s biweekly Pulse Update where we keep our finger on the pulse of how the Internet is holding up week over week.

And what an eventful two weeks it's been. We saw lots of 403, 503, and 504 error codes as multiple companies, including Okta, Twitch, Reddit, and GitHub, experienced application degradations and outages.

With so many interesting things to cover, let's start with “The Download,” my TLDR summary of what you absolutely need to know about the Internet this week in two minutes or less.

First, let's talk about Okta, the popular single sign-on service that many companies use to let employees easily log into many apps they need to do their work. On March 12th, Okta experienced some issues that rendered their service partly unusable. Users could still sign in and access their usual Okta dashboards, but it didn't look like it normally did. A subset of the apps that usually displayed on the page didn't render properly, and as a result, users didn't have full access to some of the apps they normally used during their workday. We'll chat more about what caused these issues later, but given the critical front door nature of the Okta service, the incident highlights the importance of building redundancy in your tech stack and service design to avoid downtime for critical service malfunctions. Okta has been doing some interesting thinking on this front, which is great to see.

Several other notable companies also experienced outages this month, including Twitch, Reddit, and GitHub. The GitHub outage is especially interesting because due to the way the problems manifested, some customers initially suspected a cloud infrastructure outage, or so it did seem from the chatter we saw on social media. However, further inspection and problem tracing identified GitHub as the responsible party. This once again highlights the importance of having good independent visibility when it comes to complex cloud native environments, so your team can quickly and accurately discern the source of an issue and respond accordingly.

Looking at global outage trends, we also saw global outage numbers continue the downward trend seen over the previous two weeks, with global outages dropping 33% over the two week period. And in the same period, the U.S. outages also dropping 38%, with U.S.-centric outages over 34% of all observed outages.

Then let's dive in further. As always, I've included the chapter links in the description below so you can skip ahead to the sections that are most interesting to you. We’d also love for you to hit like and subscribe and always feel free to email us at internetreport@thousandeyes.com. We welcome your feedback and questions. And to discuss all of this, I'd like to welcome back Kemal Sanjta. It's great to have you back, mate.

Kemal Sanjta: Awesome to be here. Thanks for the invite, Mike.

Mike: No problem at all. Right, so with that, let's take a look at the numbers this week. So what we see is the global outages continue the downward trend seen over the previous two weeks. So what we start to see is we see initially dropping from 271 to 247, which is a 9% decrease when compared to March the 6th to 12th. And then this downward trend continues the next week, with the global outages dropping from 247 to 181, which is a 27% decrease compared to that previous week.

This pattern was reflected in the U.S., with the outages decreasing over the past two weeks. In the first week of this period, outages dropped from 105 to 82, which is a 22% decrease, when compared to March 6th to 12th. And this was followed then by another drop, which I said, mirrored the ones before, where we dropped from 82 to 65 the next week, which is a 21% decrease.

The other thing there is that the U.S.-centric outages accounted for 34% of all observed outages, which is a slightly larger number than a percentage observed on February 27th, March 5th, and March 6th, March 12th, where they only accounted for 33% of the outages.
But the thing there is that if we actually look then back the previous year, I compared the same month there, the percentage of U.S. outages has been sitting pretty consistently this year around 33% for the first quarter, whereas last year, we actually were consistently above 40%, and then dropped coming into the end of the year. So do you think there's anything significant about that?

Kemal: Yeah, I mean, like, it's pretty awesome to see this downward-looking trend, both for the global outages and U.S.- centric ones. So I think one of the things that's probably happening is that companies are actually realizing the operational excellence far better than it was the case, potentially even last year, or even for the previous periods of time. So it looks like the change management procedures, operational excellence is being taken more seriously and it reflects in numbers. Numbers don't lie. So this is quite nice to see actually.

Mike: Yeah, and it's a good point you make, because the other thing is we've said the outages and the numbers have dropped over these last two weeks. But if we actually look at the numbers year over year, the actual number of outages are growing. But we're not necessarily seeing this reflected in user disruptions. So, and there is a question in my point, I'll get to this point somewhere there. But, and I think you sort of mentioned operational stuff coming around from there. But what I'm also considering or what we've seen is this blast radius. So the impact level. So we're seeing an outage occur, we're seeing more outages, but it's contained. It's almost like a learning from a chaos engineering type of theory where we're starting to decrease that. So with these overall outages, we're seeing less impact on the users. Would you say that that's a fair assumption?

Kemal: Yeah, I think so. And not all the outages are the same. So even the numbers might be upward or down, you know, like the scope of the outage could be completely different. So that's a really important point to take. But I think this downward-looking trend looks really good. And I hope to see this going forward.

Mike: Yeah, absolutely. Now, I want you to hold that thought about the anatomy of an outage because I want to come back to that. I think it's a very important point. But for now, let's discuss some of the outages from the past couple of weeks as we go under the hood.

So first I want to talk about the Okta disruption on March 12th. The outage underscores a valuable lesson about what's needed to provide that seamless user experience every company wants to give their customers. Each part of the service delivery chain is critical and it's not enough for them to be available. They have to be functioning properly too. So as I said on March 12th, in some geographies, including North America, users experienced problems accessing their corporate applications when Okta's single sign-on service encountered issues. Kemal, would you like to run us through what we saw?

Kemal: Sure, gladly. So to your point, we saw the outage pretty clearly. And if we look at the share screens from the ThousandEyes, we're going to first focus on the Internet Insights that showed the event pretty clearly in detail.

So first of all, on the timeline, we see for how long the event went on. And from there, we can see details such as which servers, how many servers, what are the geographies that were affected, and so on.

So here looking at this, we can see that issues started on the 12th of March, 2023, at approximately 3:55 UTC, and it lasted until 6:40 UTC on the same day. So it's almost an hour. We can see that on the timeline in general, whenever you see on the timeline the purple line, we know that we saw the outage. So in this particular case, we can see that 265 servers were affected. We can see for how long it went. And down below we can see details. So first of all, like we can see that United States, Canada, Brazil, India, all are pretty much pointing towards the Okta. And now looking even more here, if you look at the metrics called “Locations,” we can see that 36 locations or geographies worldwide were affected, which speaks about the global outage nature of these particular issues.

So here, if I go back to “Servers” as a metric, you're going to see the same screen that we started from. And essentially, it looks like this. Regardless of how we group this, this will point towards the Okta.

So to see a little bit more, I'm going to share the test that actually observed this issue in detail. So here what we are looking at is essentially a page load test, which is quite a good test for figuring out what happened on the front end. So looking here again at 3:55, it started, we can see the before mentioned purple line indicating that Internet Insights saw the outage and it lasted until 4:45 UTC. So looking here, we are looking here on the page load time. And first of all, we are focusing on the Seattle, Washington agent and this is the outage from that particular region’s perspective. And looking here, we can see that the average page load time before the issue has started was 858 milliseconds for Seattle. On average it was around 961 milliseconds. However, if we click into the outage itself, we can see that page load time actually significantly dropped to 210, which is like four times improved, right? And similar to some other outages that we were exploring before, they would probably want this to be the page load time for the service in general. However, we know that this wasn't the case. This is a reflection of the issue itself.

So now if I click on the table, we can see various agents and their page load times. And, you know, we can see what was happening, but far more interesting from the perspective of this outage is actually what was happening on the waterfall itself, right? So if I click here just before the event has started, I can see that all of the different pages or objects on the Okta sides were loading fine. We can see 200s, we can see errors, we can see from where they were loaded, and we see that different components take different time to load. Now, this is really important in general. If you were to have an issue with some JavaScript file that's used for the site, this would be the perfect place to check what's the load time for that particular object and stuff like that. Unrelated to this particular issue, you could potentially use that for figuring out whether your potentially CDN provider is having issues or not. And here we can see various times for various objects for this particular Okta page, right? So if I click on the event, we can see that during this event, the page itself was returning 403, which is a forbidden one. So you can see here that this is the GET request that we have sent. And then we were getting essentially 403 back.

Mike: I just want to focus on that. I just want to focus on that because I've been calling them 403, 503 errors, right? Which is what the common vernacular is, but in reality, they're status codes. And there's a couple of things here. So what these indicate to us in a 403 is basically, as you said, forbidden coming back from there. So it's something, effectively, we failed some authentication or some API call hasn't been authenticated, or whatever it is from there. But this is really good. And this is what I'm coming back to on that anatomy of an outage. There's a couple of things here. So, one, we see the page load time come down and you correctly identified, yeah, they would really be happy with this. And if we're looking in isolation at that, we just see that number drop. But what that means to us in conjunction with that 403 is parts of that page aren't loading, we're not pulling on various functions within that page itself.

Kemal: That's absolutely correct, right? Like we can see different objects. And that's the beauty of a waterfall as well, right? It exposes the granular detail of the problem once you happen to have one, right? It’s not just about the overall issue, I cannot load this page. And that's it, right? It's actually what cannot be loaded on that particular page, which is significantly better information.

So essentially, throughout the complete event for pretty much the complete duration of this event, this is exactly what we were seeing, right? We were seeing 403s being returned. And to your point, we know that everything coming from 4xx is essentially errors. So now if I click on the HTTP server, still we are looking at this from the perspective of the Seattle agent perspective. So for the time being, I'm just going to drop it. So we can see that availability was 100% until the start of the event. Purple line again shows up, indicating that there was the Okta outage. And then all of a sudden, we can see that there was 57.1 availability. Essentially, what that means is that certain agents were not able to complete the phases, phases being DNS, three-way handshake, SSL, send and receive, and HTTP. And we can see quite clearly that in this particular case, the HTTP bar is not fully green. So if I look on the table itself, I can see which agents were actually getting 403s, and which agents were getting throughout the event, 200s, right? And we can quite clearly see that the status code here was 403.

Now, the thing is here, if we look at the path visualization for this particular issue, we can see again the purple line indicating the outage. And during the outage, we saw the intermittent spikes in loss up to 1.4% or something like that. Yes, loss is really bad, we know that, but this clearly wasn't the issue in this particular case. First of all, this is average loss across all of these agents that were added on this test, meaning that this is a very small amount of loss. And then the other thing here is that we don't see consistently, which would potentially serve as an explanation for the event itself. So it's safe to see that we can rule out the networking part as the root cause for typical networking-related issues such as bucket loss, latency, and stuff like that, for the root cause of this particular event. So yes, this was 100% an application-related issue. We saw it inside out, we see that the traffic is making it based on the fact that the send’s fully successful, we can see that receive is fully working, which means that we are getting traffic back. So everything from the networking perspective here worked completely fine. However, the application front-end itself or the application itself actually had a problem dealing with the request.

Mike: And there's a couple of things as well, I want to drill in on here. And this is consistent, obviously, with what Okta came out with. So we’re seeing that it’s not actually impacting all of the sites from there. So our agents' tests coming through from there wasn’t all of them. And what Okta reported, it was a number of cells, which is some of their stuff coming on or where they're actually connecting to that. So that's one aspect of it as well.

The other thing as well is that something just struck me when you were talking about the loss rate we saw there; we're talking about 1%, and it's that context. So you said this is across all the agents we got from there. So this is all the tests we're looking at from there, this source and this. And in the scheme of things, 1% is very low. But if I'm looking at a large peak without taking it into context of what I'm looking at, it's quite easy to go down to the wrong area.

So I guess the point I'm trying to drive at is it's not just about gathering the information and saying, “oh, here it is, we know there's an outage.” It's being able to layer on top, and this is where the skill of someone like yourself comes in and utilizing the ThousandEyes stuff, is to be able to sort of put those two together to add the human context or the intelligence context to what we're seeing to do that true causation correlation.

The next set of outages all had a similar theme. Not only did they impact the user's experience despite the main service remaining available throughout, but in the case of Twitch, Reddit, and GitHub, the users appeared to experience content loading issues.

So let's start with the Twitch outage there. So on March the 3rd, some Twitch users experienced issues accessing video on demand streams, which is essentially the service. So the issues in Twitch's words prevented some services from loading. Impacted users would have been presented with a timeout and a black screen trying to access the stream. So the service was reachable, but it was unusable or at least some parts of it. But in this case, it was actually a pretty main part of it that was unusable.

And we often talk about the fact that the internet has no SLAs, which is absolutely true. So fantastic statement, it’s true. But even if we did, in this particular case, it wouldn't have constituted as a breach of that because the component parts were all available. People could actually go on and see that the system was up there. So it's just this interaction of the composite components of functions that failed. And this goes back again really to our point that we're making about this, that all the components within that service delivery chain, you talked about the network, you showed about the paths there, we talked about the context from there. But all of these have to be operating and communicating, every cog needs to work, we need to have everything there to get this smooth function happening.

Kemal: Exactly, like if it's available, it does not mean that it's working, right? So again, we can speak about the granularity of the testing, right? For example, while you were speaking about the nature of that outage, I was thinking about the workflow of testing potentially with our transaction tests or something like that where we can actually go a little bit deeper than knock on the door, right? As part of which you can test whether the different components of the services are working. So yeah, that's really important.

Mike: It is. It's that functional performance testing that we want to talk about there.

Let's move on to the Reddit outage. So users had a similar experience with Reddit on March the 14th, which is the Ides of March Eve for you Julius Caesar fans out there and that’s such a niche reference there, but that's good.

Starting at approximately 19:05 UTC, ThousandEyes observed an outage impacting global users of Reddit. And as we'd observed for Twitch and as you sort of showed us going through from the Okta one, the network paths to Reddit's web servers hosted on the CDN provider Fastly were clear of any issues. And essentially the site was reachable.

The fact that this had a global reach pointed to an app issue straight away. But a quick view of the other services using Fastly confirmed that it actually was an application issue essentially rather than anything else.

And I found this interesting that again, it wasn’t that the site wasn't reachable and it wasn't that the application was down. Some content appeared not to be loading. And what this does or what this highlights to me was that, you know, if you're actually logging on to a service, it becomes seamless. I consider it to be a monolithic service. So I'm actually hitting this front page, I'm accessing it. I'm not aware of everything that's going on the backend of that. But in fact, we rely on multiple dependencies, both connected and what I call non-connected dependencies. So these are things like, you mentioned there about the BGP and these sorts of things. But again, these all need to work together to provide that full functionality to deliver that user experience from there. As you said, it doesn't matter if it's available, if it ain't working.

Kemal: Yeah, exactly. And also the paradigm shifted significantly on how we build these applications. Historically, as you pointed out, they were pretty monolithic applications, large big applications, and all of a sudden, they switched over to this model as part of which there are so many different microservices involved and making sure that all the cogs, as you said earlier on, are working is actually quite instrumental in how the complete service operates.

And that's increasingly hard. The more services you have, complexity increases, the harder it gets to actually make sure that all the cogs are working the way they should be. So to your point, it's really important to monitor all of these and actually expose the various critical components and then ensuring that we go a little bit deeper than first knock on the doors.

Mike: All right, so let's move on to the last outage of the day, which is GitHub. So this was actually on the Ides of March. So this was March the 15th, where GitHub users encountered difficulties when they were trying to use Actions and Packages and Pages. So GitHub's platforms for the continuous integration, continuous deployment for hosting and managing packages and websites, respectively.

Now the way this problem manifested, it caused some customers to initially hypothesize on social media that it was a cloud infrastructure outage. So again, coming back to this context, the anatomy of an outage there that we've talked about throughout, is this a quick check of the paths, the common service identified, showed that was actually unlikely. So we're actually looking for something that was common to it. The common point from here was we could reach the system, it was all coming from the application side itself.

Unlike the Okta incident where all the users could still access the service to some extent, we could actually get onto it and there's some pages there and you could, like I said, if you actually knew the application you were going to, you could get to it without the icon from there. GitHub users were unable to reach the service altogether and GitHub confirmed that user requests were simply timing out from there.

GitHub did provide regular status reports during the degradation and the actual event, which is great because one of the things we talk about is the need to have constant information. I want to be able to understand what’s going on. It doesn't necessarily matter if I have a performance problem, what I want to go, and I'll apologize, my dog's chiming in because she thinks the degradation is really important. And what she's reminded me of there is that all these composite parts need to work together. So this is really important. You know, this is yet another example where we had a single point of failure, which happens to be in the application, but it rendered the whole thing sort of unusable. And this was something in the backend, where we actually couldn't make that connection coming through. So again, like I said, we're talking about everything together in this service delivery chain having to be operating smoothly together to be able to deliver this service.

Kemal: Awesome. Just to close this section, my friends on social media were posting—mostly developer friends—were posting like “Day Off Provided by GitHub” kind of joke because they could not do their work, which speaks to the severity of these kinds of events, right?

Joke aside, it's a pretty significant event. Imagine the scale of the engineering and the number of the engineering hours that are potentially, unfortunately I have to say wasted when it comes to events such as this one.

Mike: Well, thanks Kemal. As always mate, it's been an absolute pleasure.

Kemal: Thanks for the invite and it was my pleasure being on here.

Mike: So that's our show. Don't forget to like, subscribe, and follow us on Twitter @thousandeyes. And as always, if you have any questions; feedback—good, bad, or ugly; or guests you'd like to see featured on the show, send us a note at internetreport@thousandeyes.com. So until next time, goodbye.

We want to hear from you! Email us at internetreport@thousandeyes.com