The Internet Report | Transcript: How Outages Can Impact Distributed Dev Teams

May 26, 2023 • 27 Minutes

How Outages Can Impact Distributed Dev Teams | Pulse Update

MIKE HICKS: Hello everyone, welcome back to the Internet Report’s biweekly Pulse Update. This is a podcast where we keep our finger on the pulse of how the Internet is holding up week after week.

Today we're going to be talking about software development and what we consider a follow the sun model and the ripple effects that can have in sort of regional outages. It can affect other geographies around from there. We're going to be looking at some instances that happened at GitHub, Apple, and Google Cloud—not necessarily all of them related, but I want to use that as a theme going through there.

So let's start with “The Download,” which is a quick TLDR to understand what's happening on the Internet in two minutes or less.

So in recent weeks, we saw three incidents at major tech companies being GitHub, Google Cloud—now we covered that one in a previous episode, but I just want to revisit that because there's been a few updates—and Apple.

We're going to discuss all three further in this episode, but I actually really want to highlight the GitHub service degradation as this raised some interesting questions about the refractive impact of regional outages and what effect they can have on other geographies. Now this is partially due to a follow the sun model of software development, which has been embraced by many large companies.

Now, I don't necessarily know if this came into play specifically with GitHub, but what we observed suggested it might've been a factor. But regardless of that, I think it's a valuable topic to consider since it's a standard practice at many companies now.

So as we know, there's no good time to have a global degradation or outage, but it can be something of a relief when that outage isn't occurring to you, but when it's actually occurring outside of the business hours or outside of your local business hours. But what we're starting to see is that the regional outages can have a refractive effect on other geographies. An experience we said that might be heightened by this follow the sun model.

When we talk about follow the sun, what we're actually talking about is doing work outside of your business hours, so that it doesn't necessarily have an impact on another region that is actually in business hours because of the multiple time zones we have around the world there.

For example, many large U.S. companies, and indeed other large companies throughout the world, have application development teams in India. So if they're doing some work and they actually have an outage that occurs, what could happen is that because of the dependencies that are sort of associated with the particular element of work they're doing, it could either have a roll onto dependencies or it could stall other work, becomes a blocker for what they're actually doing within their part of development. So while appearing to pass the test in isolation, it could have a detrimental effect on the holistic performance of the overall service.

The refractive impacts of regional outages are actually supported by our own data. Year over year, we consistently observe that between 70 and 90% of these outages occur outside of U.S. business hours. As a result, one might think that these outages would have been limited in their impact on U.S. businesses and customers. But what we're seeing is the blast radius of these outages often extends beyond the U.S. boundaries, impacting other geographies, which then comes into their business hours. Down here in Australia, I'm directly sort of 12 hours outside of these. So anything that occurs in there sort of flows right into our business hours and has this ripple on effect coming through from there.

So for organizations that utilize the follow the sun model, it's really important to have this holistic visibility. So what we mean by that is to be able to see the complete service as we go. So to the end to end, but all the dependencies so that you can understand if I make a change in one area. what effect that's going to have on my other users. So we're eliminating or reducing those single points of failure and addressing them before they become impacting to the users in the other regions around from there. This also starts to come into play when we talk about disaster recovery and business continuity plans.

So I'm going to pause there. I'm excited to discuss this subject and the other three incidents further. As always, we've included the chapter links in the description box below so you can skip ahead to any sections that are most interesting to you. And we'd always like you to hit like and subscribe, and with any questions you have, please email us at internetreport@thousandeyes.com. We always welcome your feedback and questions.

And to discuss all of this, I'd like to welcome back Brian Tobia, the Lead Technical Marketing Engineer here at ThousandEyes. Brian, as always, it's great to have you on.

BRIAN TOBIA: Thanks for having me back, Mike. Glad you're not tired of me yet.

MIKE: Never tired of you, mate. All right. So we're excited to discuss these outages. And before we do that, let's take a quick look at the overall outage trends we've been seeing.

Okay, so now to my favorite part of the podcast where we start to look at the numbers over the past few weeks or past couple of weeks. So these are some really interesting ones this week, not that they're not always interesting, but what we're looking here is the global outage numbers initially rose, increasing from 310 to 574. Now that's a significant 85% increase when we compare that to May 1-7. And this was followed by an equally significant drop with global outages dropping from 574 to 174, which is 70% decrease compared to the previous week.

Now this pattern was reflected in the U.S. as outages initially rose from 175 to 231. So this is a 32% increase when compared to that week of May 1-7. Now U.S. outages numbers then dropped significantly from 231 to 71 the next week, which is 70% decrease.

If we look at these numbers, what I want to drill into is that sharp 85% increase in the global outages that we saw during the week of May the 8th to the 14th. And this represented the highest number of global outages observed in a single week this year. Now, why this increase was significant, when we investigate the source, the distribution, and the impact of these outages, what it revealed was this increase appeared to be very localized in terms of reach and seemed to have a low to negligible impact on the majority of the global community. So when we actually look into those numbers, what it looks like—whilst it is a big number—if we take the outlier of the other regions that were affecting this, in this case, it was APJC, and we then refactor that to the typical levels, then the numbers, while still high, they start to be quite seasonally reflective. And what I mean by that is, if I actually go back year over year, around this same time span, what we actually see is this increase in outages. And it seems to coincide, and as we talk about this follow the sun, they're occurring outside of business hours, which is essentially for their local region and they’re regionalized. We actually don't see sort of ripple down effect, which is why we don't see this global impact. But it seems to sort of coincide at this time of year, if I go back year over year, and I'm looking at, let's just call it something like a spring clean. Well, I go through and we're using this time of the year to go through and effect maintenance, do some upgrades, whatever is taking place. We don't necessarily know from there, but all we know is these outages are occurring and they're having this negligible effect that we actually start to look at there.

Now, another interesting trend we saw in the past few weeks is that the U.S.-centric outages accounted for 40% of all outages observed. And now this is a smaller percentage observed between April 24 and May 7, where they only accounted for 52% of observed outages. But this continues this trend in which U.S.-centric outages have accounted for 40% or more of all observed outages. Now, why this becomes interesting, again, if we actually reflect back to the previous year, at the start of the year, we actually saw these numbers occurring. We saw the U.S. outages accounting for 40% or more of all total global outages or total outages we saw. But then as we moved into the year, they decreased below that 40%, where we constantly saw them at 30% or lower. So we actually came out for the end of year, we were sitting around at 40%.

Now this year sort of flipped around at the moment, we actually started to see the year started off with the low percentages, we're doing sort of 30%, 35% around there. And then the past period, the past trend we've seen is where we've seen them all at this 40% or greater going forward. So it'd be interesting to see how that continues over the year, where we continue to grow or where we actually drop back down again to see what's happening from there.

BRIAN: It's interesting to see them from a macro scale too, like you were talking before with the follow the sun, how now that since maybe you're not seeing the outages, since they're so spread out, maybe you don't consider there being that many, but since we look at them from a global scale, we see the numbers going up. It's interesting to see how that kind of trend is a little bit different too, based on if you're looking just if your application’s unavailable or as we do from the entire Internet.

MIKE: Yeah, that's actually a really good point. And exactly what you said there, this follow the sun. If I'm only impacting in my area there, what's this third-party dependency outside? How's that having an overall impact? So 100% right, this is where you want to see that holistic perspective.

Cool. So I could pause and I could go through the numbers and I'm happy to talk about them all day and probably do to most people, so I apologize for that. But now let's discuss some of the outages we've seen from the past couple of weeks as we go under the hood.

So the first outage we wanted to discuss, really probably wasn't an outage, more of a service degradation, but it has some really interesting points. So on May 9, GitHub experienced a service degradation that was global in scope, but has some really interesting regional aspects I'd like to discuss. So the outage manifested itself in service unavailability. So again, as we've seen in the past, you can actually get to GitHub repository or you can get to GitHub site there. But then there's certain services in the back end that you actually couldn't get to.

And we were sort of seeing 500 errors and errors around from there. So that's one thing that we actually wanted to talk about. And then although this was actually reachable, and why we talk about service degradation is because we could actually get into the service, we could connect into it, but like I said, it wasn't working consistently. So because I was receiving these errors coming back, it meant that I was actually, if I was trying to access something, I couldn't actually get to that back-end service. And so I'd retry and I could actually get through from there.

So if you drill into that, Brian, would you actually be able to show us how that manifests itself and what we actually saw?

BRIAN: Yeah, absolutely. We can take a look. All right, so I'm sharing my screen for all you audio listeners. We're looking at a share link, which is a snapshot of some data taken from the ThousandEyes platform that we'll use to talk a little bit about this outage here. And if you're not familiar, ThousandEyes is a kind of a central place where you can come to understand visibility across all of your networks and applications. Easy to set up tests, as you'll see here, that we have many different monitoring points across the globe that we can run tests against, and we'll see some data that we're coming back from some of the tests here, and it allows you to visualize that to really understand any performance or network bottlenecks.

This is a quick view of a share link that we took during the disruption. And this is one from Internet Insights. So this is actually showing us the application. So GitHub on the right, as Mike was mentioning. And then all the different locations where we actually saw the disruption from. If I scroll up on the timeline, right, so we caught this data, you know, around the, on the 9th. And it's pretty interesting to see, again, from the global scale. So this wasn't a single network not being able to get to it, or maybe a single ISP having an outage. This was really disruption across the entire spectrum that we're seeing here, that the measurements that we're taking from. So this was really interesting to see. And then we have another share link, which I'll share out here. And you can see actually the details of the network and the service status like Mike was mentioning.

MIKE: So the first point, before we move off that one, Brian, I also want to say is that it was interesting from my perspective that if you actually drill into the host on there, we see—and GitHub was acquired by Microsoft—we actually saw it was actually hosted in both of those. So we can see it was affecting everything. So like you say, it wasn't just affecting local or GitHub infrastructure. It was actually sort of seeing where some of the workloads and effects being moved into Microsoft there, we saw it impacting both of those.

BRIAN: Yeah, that's a great point. And I think it helps speak to your point before about understanding service dependencies and what different infrastructure it's hosted on or other pieces that it depends on where we're seeing that outage too. So maybe like a rolling change happened that affected other portions of the infrastructure. I think that's an interesting thing we can definitely see here.

All right, so now we're looking at a share link for the detailed test that we were looking at for GitHub. And what's interesting here is I've started us on the Network view. So you can see from the timeline, we're looking during the purple period, which is when we actually detected that outage. And then from the Map view on the network side, everything looks green, right? Everything looks good from a global standpoint. You can see all the different testing locations we have.

If we go over to a Table view, we can see if there's any packet loss or latency associated with any of those connections. And there's really not. So that's a story from, can I get there on the network side? And as you mentioned before, that part is perfectly fine. I could even potentially load the website, but then it's when I started to try to do something that the errors of the service components were having problems. And we can see that.

So the network view looking okay, we can now transition over to something like an HTTP Server view. And that'll show us a different story. As Mike was mentioning, the different components, those are the ones that are actually having the problems. So even though you could get there from the network side, you can see from this view, now we're starting to see some disruption. So agents not being able to load the specific page, or they're not getting HTTP responses back. If I go to the Table view, we start seeing those timeouts that were occurring. So that's how this manifested itself. Not so much a network outage, but now since the different components or something wasn't responding to request properly, now that's what we're seeing within this share link.

MIKE: That's really cool. There's a couple of things here. The first of all, I think the actual disruption is around 25 minutes. And if we're looking at that, we can say, “okay, yeah, that's obviously disruptive there,” but it actually occurred in the early hours of the morning before North American business hours. So the U.S. region was, but we're seeing from here, we see that the impact there, services as a global one, they were still severely, heavily impacted there.

Now, we said this is a network issue and I'm going to make a huge assumption here and we'll make a jump and—well not an assumption, we're not implying that this was any sort of software development change or any software push. But if you consider the processes that are occurring, what actually happens is if we're in an agile development mode and we're doing a CI/CD workflow, so continuous integration, continuous deployment workflow. What we actually do is we'll do our module and we'll actually push that out.

Now this differs obviously from if we go back to—I keep saying on this podcast, I'm very old— if we go back to when I started, we're dealing with this monolithic architecture where we had a single branch of code coming through from there. We would do like a waterfall approach, we'd actually make the changes and we'd do a big push all at once from there. But because we're actually agile and we break it up into these modules, I can do testing and I can do unit tests to my particular bit of code and I can actually then push that out. So I can do that effectively any time of the day. Obviously, I'm trying to time this. We talked about this concept of follow the sun and these distributed development teams. Let's consider that if I'm actually doing that, I might actually do it and I'll do my push. Everything looks green on mine. I'm actually doing everything right, but that one module has dependencies outside from there. And this is what we're talking about here or potentially here. I make a change to a very small element. It's not supposed to have an impact from there, but I actually start to flow outside to see what's occurring across global impact.

BRIAN: Yeah, and it's interesting too to see sometimes, like we'll see here, some agents were responding fine and some weren't. So it's interesting to see, and you can even sometimes even kind of take a look at or predict what the rollout schedule was too. Like you could see rolling it out by geo or a few different areas if a component was changed and then you can actually, it's interesting to see kind of the timeline of that and, you know, and which ones are responding, which ones weren't, and maybe even kind of get an idea of what the plan was or what was actually happening. So it's pretty cool to see that kind of view.

MIKE: It is, and that's just a good point you mentioned there. And if we think back to, there was a Microsoft outage, and I think we praised it at the time, and I still praise it. And you can apply these follow the sun strategies right through. So I understand where my users are, where do I want to focus my priority to bring something back up? Outages occur, incidents occur, we're doing these pushes, this is what we're doing.

But when something happens, how do I bring it back up online with the least impact to my users? Obviously, if everyone's asleep, then okay, let's not focus on those guys. Let's come and bring it up from those. And as you say, bringing things back up in that way or applying a follow the sun strategy is really cool to actually see how it brings back up. It's almost like, you know, you get those maps with the times they do. They can see how the time zones come across with the sun. It goes through there. You see these patterns coming across there. It's very cool.

BRIAN: Yeah.

MIKE: So the other aspect of this is what I want to consider or what I want to talk about briefly is that if we are going to adopt a follow the sun policy, right? So we're actually going to do this. And you know, if we think back to, I was talking in terms of when we're looking at the numbers, we were talking about this aspect that time of day, then they're occurring, they're localized, it doesn't have a filter down effect. Obviously, I don't want to use the domino effect because that has a different connotation. If I'm looking at an outage that radiates out. So I make a change. I wanna know how big that blast radius is or how big that impact zone is. So it's very important, I think, to be able to understand holistically what's going on. So if I make a change here, how does that impact everything else? One of the things we'll talk about, and I will pause for breath, one of the things we talk about quite often is this aspect of something happening on the other side of the world that you're not aware of, but might have a direct impact. Now this becomes critically important when we're talking about a follow the sun type of approach.

BRIAN: Yeah, and I think it's a good point. I think it's a good strategy and definitely has advantages, but it just highlights the importance of global monitoring to understand, you know, different rollout plans and where it might, if those changes might affect and also components, making sure you're monitoring third parties or individual services, which alone might not seem like a big deal, but if you're not monitoring them and they affect other portions or bring the whole app down, that's critical. So I think it just highlights the importance of a good.

MIKE: Yeah, thanks Brian. That's a great point.

So aside from this GitHub incident, in recent weeks, there have been two other major tech companies that experienced disruptions that I think are worth just sort of briefly touching on as we go through.

First, the Google Cloud outage. Now, in our last episode, we discussed the water intrusion incident that occurred at the Paris data center that caused problems for Google Cloud’s in europe-west9 region region, which is home to three availability zones being a, b, and c. I wanted just to briefly revisit that today to give a few updates.

The company has actually published a preliminary incident report explaining the sequence of events. And what they're saying there is that there was a water leak that occurred in a cooling system that led to a battery room fire. This leak was initially confined to europe-west9-a, but because the subsequent fire required all of the europe-west9-a and a portion of west9-c to be temporarily powered down, we actually sort of lost all the availability zones, which is what occurred there, which is what we saw and what we talked about as we went through.

Whilst this was partially unavailable, mainly the regional services were affected. So again, this is back to our follow the sun. So if I actually have my workloads in that area there, and again, we talked about this, that if I was reliant specifically on that one area there, that one region, I would have actually lost everything from there. But if I was able to move it around from there, and as they sort of brought back the others, they were able to bring customers in zone c and b where they could actually then sort of come back online and they were there. And they started to move workloads across as well because there's the time we're talking here that the other zone a is still sort of suffering outage problems, they still don't have it from there.

A few days after this report was released, the customers in zone c also experienced a near five-hour outage of multiple cloud products, but it's unclear if it's actually related to these recovery processes. I said we're actually going through this aspect of moving workloads across because west9-a is still down. So they need to make sure that all the workloads are moved and migrated across anything was there. Then that might even be part of that, but so there's no definitive evidence actually to show from there.

One of the things I want to talk about, like I said, so that was good they came out with their report, it confirmed what we saw from there, but the thing we have to consider here as well is, and we're saying if a is still down, that ultimately our software runs on infrastructure. And these are going to be impacted by outside conditions.

Now, the story I always like to tell when we're considering this, and the point I want to make is that there's outside environmental conditions that might have a direct impact that you need to be aware of or need to take into consideration. So down here in Australia, we have tornadoes and then these can actually take out some of the regional areas. So if we're looking for those, we actually know you can look at the weather map, you can say something's actually occurring there. We know this is the cause of the outage. So the point I want to make is that when we're talking about, we talked a lot about disaster recovery plan last week, don't forget to consider the location of not necessarily where your data centers are, but also all these external conditions that could take place. Obviously, you couldn't foresee the water leak occurring from there, but things like this may happen. You need to be able to understand what's going on and have visibility into everything that's happening there.

The last outage we want to talk about really is some Apple authentication issues. So on April 11, Apple experienced a 43-minute outage that impacted iCloud accounts access and sign in. The company briefly acknowledged the issue as a status advisory saying that for some users, this service may have been slow or unavailable and the users in the U.K. were among those who reported being impacted. So again, we're coming back to this time of day. We're sort of looking at, you know, what's happening from this regional perspective.

But one thing I want to talk about here is then this again highlights this concept of a single point of aggregation, all components need to work together. So if we're considering, you know, we talked a lot about DNS, but in this case, it's some sort of authentication issue. If I can't authenticate, then I can't use a service. So again, this sort of talks to this whole aspect of a service delivery chain, understanding all the components. And again, as you said, Brian, these external outliers that may have a direct impact that you don't necessarily consider to be in the chain.

BRIAN: Yeah, it's interesting. So one example I've had in the past was a customer experiencing an outage who had their authentication tied through a single sign-on provider that was third party. So they weren't considering that I'm trying to log into my application, which was having no problems at all, but because it was a third-party outage that they couldn't access a single sign-on, they weren't able to log in. It was a component they hadn't considered, but wouldn't let them into the application. So I think, to your point, understanding the whole chain and what dependencies or what other third parties are involved is really important in looking at is my application up or not and why is that?

MIKE: Yeah, absolutely. And then another interesting part of this, I've said it was localized, apparently sort of impacting the U.K. And the other thing I want to emphasize here is it wasn't necessarily an outage, it was delayed. If we look at what the status codes were saying, if we look at the chatter that came out from there, it was actually a degradation. So it wasn't necessarily unavailable. So to your point, considering from there. But what this also then means is that there may have been a backup system. So, you know, I go to authenticate to this system there, it times out, I can't get to it, I have some sort of redundancy in play or it'd be an active-active situation. And I actually, maybe I go to authentication systems in a different region, and therefore then I've got this delay in authentication, we didn't actually stop it. So to your point about understanding all my components within there, then this also goes into how if I understand where the components are, then I can actually have backup plans or contingency plans to make sure I maintain that connectivity or maintain that service as I go through.

BRIAN: Yeah. I mean, as odd as it sounds, sometimes it's easier if it's just like a network is totally down because you know exactly why it's happening, but when you have a component that's failing, there's somewhere along the way, it's a lot harder to track down. Or like you're saying, if it's recovering with backup systems, it's super hard to track that down sometimes. So to have it all mapped out is really important.

MIKE: Absolutely, yeah. Well, thanks, Brian. It's been an absolute pleasure. Always great to have you on the podcast.

BRIAN: I appreciate it. Happy to be on.

MIKE: So that's our show. Please remember to like, subscribe, and follow us on Twitter @ThousandEyes. And of course, any questions or feedback or guests you'd like to see on the show, please send us a note at internetreport@thousandeyes.com. And if you want to connect with me in person, I'll be at Cisco Live Conference in Las Vegas from June 4-8. I'd love to have you stop by the ThousandEyes booth just to say hi. Happy to chat more about Internet health and obviously always happy to talk about outage trends and numbers and networking in general. And we've included the link to register in the description box below, so definitely check it out. So until next time, thank you and goodbye.

We want to hear from you! Email us at internetreport@thousandeyes.com