Redundancy in the Cloud Era: Two Case Studies | Pulse Update

MIKE HICKS: Hi, I'm Mike Hicks. I'm the Principal Solutions Analyst here at ThousandEyes. And welcome back to the Internet Report’s biweekly Pulse Update. This is a podcast where we keep our finger on the pulse of how the Internet is holding up week over week.

This episode, what I want to talk about is redundancy and two incidents that happened at Google Cloud and Microsoft 365 that reinforce its importance and really help us to talk about the evolving strategies to meet this goal.

Before we get into that, let's start with “The Download,” which is a quick summarization—a TLDR, if you will—of what happened in the Internet in the past two weeks.

When it comes to technology strategy, it's a good idea to have more than one way to access every resource just in case something happens. As those IT environments have changed, so has the thinking around approaches to achieve this. If we go back in time, historically, when we looked at how redundancy and disaster recovery plans were actually sort of undertaken, we were basing it on the fact we had almost bricks and mortar data centers. So we could have sort of dual data centers there where we could have diverse cabling coming into these systems and we can actually sort of split those away where we went down. Redundancy meant having separate data centers with separate power supplies. And if one site was actually lost, an organization could either just switch over to one, you could run them in parallel, but you had some capacity till the primary site was actually restored.

Now, if you think about where we are today, redundancy is going to mean architecting the applications or workloads, now we've got to run across multiple availability zones. But more than that, what we've actually seen change, there's more proliferation of SaaS applications, so we're using those, there's distributed architectures, our users are distributed. So we have more than one aspect. Where our application was essentially hosted in a data center, we now have to contend with this really distributed environment that we have to have all these different contingency plans and really an understanding what's happening when we go into that area there.

So the key part, we live in this 24/7 world and what you want to have is continuous without interruption. So you don't want to have these down times as possible there. And you've got to have plans and ways to actually work around from there. And obviously visibility is the key.

So in recent weeks, we've seen two incidents that reinforce the importance and this need for evolving strategy. It really sort of brought it home, how we actually need to consider what we're doing in terms of this disaster recovery planning and to cope with redundancy. So the two we're talking about there, one was in Google Cloud, the other was in Microsoft 365.

And then let's dive in and explore what happened in each of these outages. As always included in this podcast, there are chapter links in the description box below, so you can skip ahead to sections that are most interesting to you. Hit like and subscribe, and as always, email us at internetreport@thousandeyes.com. We always welcome your feedback, questions, and suggestions.

Before we get into this, I want to take an overall look at the outage numbers and trends this week. I want to introduce my good friend Kemal Sanjta. It's a great pleasure to have you on board, mate.

KEMAL SANJTA: Thanks Mike, it’s awesome to be here.

MIKE: So let's take a look at the numbers for this week. Okay, so if you take a look at the numbers, actually my favorite part of the podcast, where we actually go through the patterns and what we're seeing. If we actually look at the global outages, they initially dipped, dropping slightly from 239 to 213, which is 11% decrease when compared to the week of April 17-23. What then followed was a significant rise. The outages jumped up from 213 to 310. This is a 46% increase compared to that previous week. This pattern was reflected in the U.S., with the outages initially dropping 13%, which was 109 to 95 initially, before they actually again rose significantly from 95 to 109. This time it was an 84% increase for the U.S. outages.

What this then reflected in, if we then dive in to see what it was, the U.S.-centric outages accounted for 52% of all the observed outages. Now this is larger than the percentage we saw on the previous episode, where between April 10-23, where they only accounted for 44% of all the outages we saw there. And this is the second consecutive fortnight where these U.S.-centric outages have accounted for more than 40% of all the observed outages. We hadn't actually previously seen this percentage rise this high in 2023, so it'd be actually interesting to see how this continues.

If I actually delve into those, and I did dive into those, can't help myself, to actually see what was happening, 93% of these outages occurred outside of U.S. business hours. So we base U.S. business hours on Eastern daylight time, 9 AM to 6 PM from there. So those outages occurred outside of those hours there, which is probably why we actually didn't see too much in terms of user impact coming around from there as well. So if we look at this—

KEMAL: That also indicates the fact that it may be actually due to the maintenance related work during the off hours, but the timing is actually quite interesting. I'm just trying to think about why the start of May, right? And potentially just ramping up capacities for the holiday season, doing regular maintenance work and stuff like that.

MIKE: Yeah, absolutely. It's interesting. I find the numbers interesting anyway, but it's this, like you say, what actually did this one? And if I go back and look at the previous years, now that percentage wasn't as big when I go from there, but the actual numbers sort of resonate the same. If I actually look between 2021 and 2022 for the corresponding months, they actually look the same when we come from there in terms of actually outage numbers, you know, give or take a few there, which is kind of interesting.

The other thing is if we actually look, this period takes into the end of April. So if you actually look at the April outages, we actually had a drop. So if I look at total outages from April, it went from 1,077 to 1,026, which is a 5% decrease when I compare that back to March. The U.S.-centric outages on the other hand, rose from 369 to 451, a 22% increase. Again, if you actually want to look to start to see those, we go to your point about the maintenance types of figures, which you see them occurring outside.

Now an interesting thing, and we'll dive into this probably in the next one, because I don't have time here, but we talk about the fact then that these U.S.-centric outages, I said, you know, there that they have sort of a minimal impact or we hadn't seen too much of a user impact. But we're also starting to see this sort of spread out globally. Now, previously we've actually seen the domino effect as it were sort of being restricted to a particular area. And this is why we saw the U.S.-centric outages sort of figure lowly in terms of how many they impacted. We've almost seen some of this sort of domino effect. Like I say, interesting pattern, we’ll sort of dive as we go further into the year.

Okay, so with that, let's discuss some of the outages in the past couple of weeks as we go under the hood.

So driving home this importance of redundancy, the first outage I want to discuss was actually caused by a water intrusion incident in a Paris data center that caused a shutdown of multiple zones in Google Cloud's europe-west9 region. Now this occurred on April 25. And according to Google, it was a water leak in one of their data centers that led to a fire in a battery room that subsequently Google then experienced the infrastructure failure that affected the europe-west9 cloud region, impacting multiple cloud services.

But what happened where the water leak initially impacted that portion of europe-west9-a, but the subsequent fire required that europe-west9-a and b and a portion of europe-west9-c to be temporarily powered down while they coped with the fire. So many regional services were affected while europe-west9-c was partially unavailable. And as these regional services were restored, once europe-west9-c and b was actually back online, europe-west9-a was sort of still suffering, but we're able to sort of move stuff around from there.

So we saw sort of different impacts around from there. We obviously see if I'm looking at having my workloads situated in that particular specific data center or that part of the data center there, then it will impact sort of just the regional surfaces. If I had different availability zones, look at different geographic regions, or my workloads could be shifted from there, then I could actually start to pick up and I sort of had those. So it was just different impact in terms of some of it being sort of local, some of it sort of spreading out globally there. So Kemal, do you want to just show us what we actually saw briefly from the outset?

KEMAL: So what really happened here is that we observed from the ThousandEyes perspective a lights off event. And for the audio only listeners, what we actually observe or what we are looking at is the timeline and the timeline indicates that everything was working fine until about the 26th of April, 2023, at around 2:45 UTC when we started observing 100% packet loss. Essentially, this is an agent-to-agent test as part of which we are actually testing bidirectionally traffic between the agents that are probing the target agent, and the target agent in this particular case was the enterprise agent that was hosted in west9-a GCP availability zone.

And as you can clearly see, what happened in that particular timestamp is that we started observing 100% packet loss, which goes hand in hand with what you Mike just explained. If we go back to the latency, what you're going to see is that everything worked fine. We are seeing an average latency of approximately 155 milliseconds. Bear in mind that this is an average latency across different agents that are probing this particular target. And then the lights-off event, as part of which we can see that there was no more data that was collected.

MIKE: This is kind of interesting for that point, and this might seem obvious, right? And I'll take that, but we obviously saw the loss rate, as you said, lights off, lights on, type of situation—lights on, lights off situation, goes off from there. But if I was looking at this without that context, I would go, “Jeez, look at that, my latency's improved. I've gone from 150 milliseconds to zero latency. Hey, everything's improved there.” So this is what we're talking about, looking at things in context when we come from there. So, you know, I've got a loss rate, that's why the latency's dropped, almost like a causation rather than a correlation perspective.

KEMAL: Exactly. And you know, very often we speak about these things from the perspective of, you know, these large companies having the issues. Well, you know, you might be the smaller enterprise company or a retailer or who knows which business you are in. And it's a very legitimate question to ask yourself whether the issue is on your side or whether that could be actually something external to your control.

And having visibility and observability, such as ThousandEyes in this particular case actually tremendously helps, because with this test, you can straight away see where the demarcation point of the issue is.

MIKE: That's actually really interesting. And actually a really nice segue into this. We're talking about redundancy. If I start to come to this concept, then of redundancy and my disaster recovery planning, then what I'm really interested in when that happens is identifying who's responsible. And that's not just a question of passing blame or cost from there, but this helps in two counts.

So this helps from my planning perspective to understanding how my traffic goes from there, but also how quickly can I react when we go in from there. If I go back right again, like so, very old, I'm very old. We keep coming back there. My disaster recovery plan in a previous life was we basically—and this is going to really age me—we had to unload the tapes off the machine, we got in the car, we drove down to a port, we go across on the ferry, we then drove to a data center somewhere in Europe. Took us two days to get this stuff back up and going from there. But it didn't actually impact what was happening in terms of the business that we're running there because we weren't dependent on that, but now we are dependent on there.
So a couple of things that I want to delve into or ask about here is that the first point there is, you know, okay, this was an issue that occurred. They've identified this water intrusion that then generated a fire that then caused them to shut these down. But those three availability zones were all essentially sort of reasonably co-located because this one fire sort of took out or meant they had to power down those. Now, admittedly, two of them came back up reasonably quickly so they could actually start to sort of move workloads across from there. But if I was dependent on that one or had everything in those same availability zones, I'd want to know about it, but also I want to know what decisions I can make to sort of go around from there.

KEMAL: Looking at this particular issue that was experienced, you know, I'm pretty sure that companies are going to take a look and they are going to reconsider how they are doing the availability zones from the perspective of interdependencies, whether the power plays a role from A to B, from B to C, from A to C, and stuff like that, right? But beyond that point, to what you have said previously, the nature of the applications and how we do work significantly changed. So there was this paradigm shift that happened approximately 10 to 15 years now, as part of which we are seeing that software as a service is now a new application stack. The applications are not anymore single hosted in the on-prem data center with a clear disaster recovery plan.

Also, the way that we build applications significantly changed. If you think about it, like microservices and everything, it's significantly spread out and the nature in which these applications are working is significantly different than we are historically used to doing things, right?

So that's the first point that, you know, in this particular case, like in general, all the cloud providers, regardless of which you have this narrative as part of which the redundancy is your responsibility, you know? And you as a customer should potentially think about putting your eggs, so to say into multiple baskets, in this case, baskets being availability zones, you know? But, even such as this one actually can get us to think about some kind of hybrid cloud deployments, as part of which you might actually consider having something on-prem or disaster recovery plan on-prem and then your main application being completely deployed in the cloud.

Or potentially, you should consider beyond the Inter-AZ availability, right? Yes, Inter-AZ brings a lot of value. Based on our cloud report that we published in 2022, we said that Inter-AZ latency was approximately two milliseconds, which is astonishing if you think about it, that they are able to provide that level of latency. However, you know latency does not really go into the consideration once you have the full outage. It's not important that it's two milliseconds when it's not working.

So the other thing is, the other thing is, customers should really think about disaster recovery plans in the form of should we go beyond Inter-AZ? Should we have something on-prem, something in a different region potentially, or potentially something in a different cloud, right? So multi-cloud deployments, multi-regional deployments, beyond what's typically done with the multi-AZ stuff could be one of the answers here.

MIKE: Yeah, there's a couple of interesting points you've raised, there are many interesting points you've raised there, Kemal, but there's a couple I want to go back and one I want to dive into in a minute, a little bit further. But when we're talking about from there, you're talking about the cloud providers, the hybrid, I 100% agree with you, that is exactly what you want to be doing looking from there.

But you also then have to look at that in the context of the performance. So what is my trade-off? Obviously there's costs involved to actually start to put—in an ideal world, I'd have availability zones in different countries from there and everything's protected. But if it's not going to impact my performance, if I come down to, even if I've got a manual system, I've got you want to push an instance somewhere to do that, post there. That might be more preferable than actually sort of burning costs of having something or running up costs of having this complete system. So by having that holistic visibility, you can start to make decisions.

And the second point there, and I want to hold this thought because I'm going to dive into in a second one, is you mentioned there in terms of this, we'll call it complexity, but this diversity now that's occurring, we have so many different moving parts involved that just simply saying, I'm going to have a hybrid cloud environment doesn't necessarily mean I'm going to have an effective disaster redundancy plan. I need to be able to have all these other considerations. What's the path coming into those? Do I have an aggregating point there? What's the application look like? How is it architected? What are the other dependencies that may come down there? So as I said, don’t you answer that now, because I want to hold that as we dive into the next outage.

On April 20, some users of Microsoft 365's apps were unable to access certain apps from the central Microsoft 365 login page. So this is that main landing page, where you go into it and you can say, I want to choose this application, this application. Some of the users were actually locked out from this login page completely, while others could actually reach, let's say, part of the applications. But when they actually got to those applications, there were sort of functional issues with the apps within there. So when they were at that app, then trying to reach, parts of the app were not rendering, they couldn't actually do search, those types of things were occurring.

So again, talking about redundancy or considering that into this outage there, if we consider what type of redundancy would have made sense in this type of place or in this situation there, we sort of come back into aspects of process, as it were, on top of there. So we're not necessarily looking for some automated action, but it could almost be in a workaround or alternative approach to access the apps.

In this particular case, you could actually access the apps. So you could actually put in the URL and you go directly to the app. So outlook.office.com for Outlook, microsoft365.com/word/launch would actually launch the app as well and so on, you go from there. So if you had prior knowledge of where you're actually going to, you could get to the main part of the application. So there was parts of that that weren't necessarily functioning when you actually got to those. But for the whole, I could actually use it or you could use it, the user could use it, and it wasn't actually too much going on from there.

This incident had similarities with the one in March involving Okta, where we had a single sign-on a portal where many users could access the enterprise application suites. So in this case, Kemal, we actually had reachability across the network. So we could actually get there, and if we actually looked at the network, it appeared that everything was connected to it from there. So from that, we could conclude it wasn't a network outage. But instead we were looking at something within the application infrastructure itself.

KEMAL: Yeah, while you were explaining what actually happened, I was thinking about how hard it would be for the end user to actually figure out, or the IT administrator who is potentially on the receiving side of the many complaints that are coming his way about the Microsoft suite not working fine, right?

This just paints a picture of how multilayered observability actually can help you quite a bit in these kinds of situations. From the perspective of the fact that you can actually say, “Okay, it's not me, it's them,” fairly quickly. And then you can actually go back and say to your internal stakeholders, something along the lines of like, “Yes, we are aware that this is happening. We are speaking to a service provider about this.”

The other thing that I'm just thinking about is that how fascinating is it that in 2023, the nature of the work changed so much. It's like yesterday that everything was hosted in your own computer, that you were doing your work and then you were figuring out how to transmit the data that you actually did. Now everything happens in a SaaS environment. So I think going forward, monitoring and having deep visibility into the performance stack of these applications is going to just play a key role.

MIKE: Absolutely. And if you actually go to Microsoft, as I said, so the root cause of this outage appeared to be something—and I'm going to call it middleware functionality, all right? So what actual component was on is sort of there. But they actually reported seeing these high CPU utilizations on these components that were doing this back-end navigation.

So, you know, this is then we actually, because I said we could get to the front end, we talked about the contextual, the network was there, I could reach it. It was then how was my request for an application then directed in the back end. And also then if I went to the application, how was that then accessed and requested from there? So it was actually impacting that, so that navigation feature for the APIs that we're going to communicate with across from there.

What then turned out was that Microsoft reported that they'd reverted, when they identified this high CPU, they'd actually done a service update, they reverted a service update, and things started to fall back in place and come from there. But the point you just raised there was incredibly important, I think, because it brings us full circle as it were around this disaster recovery planning or this contingency planning is for one of a better ways. If I don't understand how things are connected together or how they're working and remembering exactly what you said there, we've changed. Everything used to be on the computer. I could actually sit and turn stuff off. Everything's on a punch card when I started. Now everything was on your own desktop. You can actually do it there. You could work. We could actually have completely disconnected and I could actually maintain that connectivity.

Now I have different APIs. My workforce is distributed, my workload is distributed, I'm using SaaS applications. And underpinning all of this I have the Internet, which is dynamic, it is sort of self restoring in places there, but all these things are going to combine together to make my digital experience seamless. But at the same time, I've also got to understand if there is a particular issue in there, to your point, “is it me or is it you, who it is?” And then I want to kick into what my playbook starts to come in from there.

So in this particular case, it's “Okay, we understand it's there, it's connectivity.” They could communicate once they get the help desk calls coming in, this is the URL we want to go to. So the workaround, so the process doesn't necessarily always need to be this automated function, but understanding of what it is and where that problem is allows you to implement it.

And I'm gonna pause for a breath in a minute here, but when we start to talk about disaster recovery plans, we said it sort of changed dramatically over this period. Now we go back to the time, even just a couple of years ago, there wasn't probably a disaster recovery plan that considered the amount of SaaS applications that were going to be in use and relied upon. And also probably wasn't one out there that really considered that we had this distributed workforce as well. So—

KEMAL: I agree. I agree. Like what you previously said is actually something that really resonated with me. Like there was this paradigm shift, like as part of which we are seeing that the cloud is your new data center; SaaS is your new application stack; and underpinning all of that is the Internet, which is your new network, right? And obviously, home is your new office, right? So these four pillars of the paradigm shift actually changed how we work, how we live, how we have fun. Both the operators and the providers are still in the process of adapting to this change.

MIKE: Absolutely, absolutely. So on that point, I think we'll leave that for today. So thanks, Kemal. As always, mate, it's been an absolute pleasure. Always great to have you on the podcast and hope to get you back soon.

KEMAL: Thank you so much. It's been my pleasure recording this session with you today, Mike.

MIKE: So that's our show. Please like and subscribe and follow us on Twitter @ThousandEyes. If you have any questions or feedback or guests, please feel free to send us a note at internetreport@thousandeyes.com.

And if you want to connect with myself and Kemal in person, we're going to both be at Cisco Live Conference in Las Vegas, which is occurring from June 4-8. And we'd love to have you stop by the ThousandEyes booth to say hi. And would love to chat more about Internet health. I'm always happy to talk about SLAs and everything there. Outage trends, networking, anything you want to see.

Also on that, I know that Kemal is going to be leading a fascinating breakout session on rethinking network monitoring. So this session is going to be talking about how to be proactive rather than reactive. And we're going to include a link to register in the description box below. So please definitely check it out. I think there's a few places still available.

So until next time, that's our show. Thank you, goodbye.

We want to hear from you! Email us at internetreport@thousandeyes.com