The Internet Report | Transcript: SharePoint Outage and Security Certificate Considerations

August 4, 2023 • 27 Minutes

SharePoint Outage and Security Certificate Considerations | Pulse Update

MIKE HICKS: Hi everyone, and welcome back to The Internet Report’s Pulse Update, the biweekly podcast where we discuss what's up, what's down, what's working and not working, and generally keep our finger on the pulse of how the Internet is holding up week over week.

This week we're chatting about security certificates, manual changes, and outages at SharePoint and other disruptions at Slack, Starbucks, and NASA.

And joining me to discuss all this this week is Joe Dougherty, a Product Solutions Architect at ThousandEyes.

Joe, it's really great to have you on board. Welcome to the Pulse Update.

JOE DOUGHERTY: Thanks Mike, I’m glad to be here. What are we getting into today?

MIKE: That's great. So let's first of all start with The Download, my TL;DR summary of what to know about the Internet this week in two minutes or less.

On July 24, Microsoft experienced a global issue that impacted connectivity to SharePoint Online and OneDrive for Business services. We're going to dig into this in more depth later in the episode, but in short, the issue appeared to be due to an erroneous change in an SSL certificate that prevented the establishment of secure connection to the services.

Slack also experienced a system-wide issue on July 27 that left some users unable to send or receive messages for just under an hour. Again, we're going to explore this a little more later in the episode, but the issue was identified to be caused by a change to a service that manages the internal system of communications. This resulted in degradation of Slack functionality until the change was reverted, which resolved the issue for all the users.

Alrighty, so slightly off our normal orbit, NASA experienced a communications outage between the Houston Mission Control and International Space Station on July 25 that impacted command, telemetry, and voice communications. The root cause of the issue was reportedly a power outage stemming from an upgrade works in a building housing Mission Control at NASA's Johnson Space Center in Houston. The outage appeared to be solely impacted communications only, which essentially means that this was a ground-only issue. And actually NASA sort of confirmed that it's a ground-only issue when they went through. The point is that while serving users in space or a bit closer to home, it’s vital to understand the what, the why of any issues that may occur in order to ascertain quickly and take the most appropriate action. In this process, in this case, it wasn't that NASA didn't have backup systems or even mission controls, but they were actually able to quickly assess what was going on; understand the root cause; and based on that, determine the most appropriate course of action was just to wait to see the power restored. And this is actually a really good sort of process to undertake and go through.

The final one there is I like to think Australia is the coffee capital of the world, well at least home to the best coffee, which is why this outage actually caught my eye. So on July the 20th, Starbucks sent a push notification for its app, notifying all customers that their order is ready, whether they actually ordered the coffee or not. Now, while this may have caused some confusion, it may also have inadvertently prompted some users to think, “you know what, I actually need a coffee now.” And I'll say, I actually probably need a coffee right now.

This mass push notification coincided with a partial app outage that affected one specific portion of the Starbucks app's functionality, which was the order ahead and pay feature. However, it's unclear if the outage and mistaken push notification are actually related. Certainly, you'd actually expect the push notification to suggest that there was some active work on the messaging portion of the app. You'd also expect some sort of acknowledgement notification to be a function of that order ahead and pay feature, which was then impacted by the outage itself. And as Starbucks attempted to fix the outage, it's possible that change or an introduction of a test message was then mistakenly pushed to production, which may have been the erroneous “order is ready” notification we saw.

As always we've included chapter links in the description box below, skip ahead to sections that are most interesting to you and if you haven't subscribed yet we'd love if you’d take a minute to hit the subscribe button right now. It really helps us out and also makes sure that you're the first to know when new episodes drop. Please feel free to email us at internetreport@thousandeyes.com and we welcome your feedback and questions and any suggestions you’ve got for the show going forward.

And now let's take a look at the overall outage trends we've been seeing over the past couple of weeks.

So as the Northern Hemisphere Summer continues, we continue to see this reflected in seasonal outage numbers. So what that means is we have quite a flat trend coming across the past couple of weeks. So what we saw is the global outages trended downwards over this two-week period, initially dropping from 192 to 186, which is a slight 3% decrease when compared to July 10-16. This was then followed by another drop from 186 to 156, which is a 16% decrease compared to that previous week.

This pattern then was reflected in the U.S. where outages initially dropped from 96% to 74%, a 23% decrease when compared to July 10 to July 16. And then again U.S. outage numbers then dropped again from 74% to 60% the next week, which was a 19% decrease.

The U.S.-centric outages accounted for 42% of all the observed outages from July 17 to July 30, so that two-week period there, which is somewhat smaller than percentages observed between July 3 and 16, where they accounted for 51% of all observed outages. So while this was a drop, it is actually interesting that this trend observed since April, in which U.S.-centric outages have accounted for at least 40% of all observed outages, carries on. It's actually different to what we saw in 2022, where by this time of the year, they start to drop below that 40%. So it's going to be really interesting to see how that continues throughout the year.

As we talked about in previous podcasts though, these numbers we've seen, these outages are occurring, the raw numbers are essentially sort of rising, but what we're actually seeing is that these U.S.-centric outages, although they're sort of occupying a large portion or this greater than 40% portion of the overall outages seen, we're not actually seeing sort of these global impacts. So we're not seeing this domino effect spreading out from there.

So now with that, let's discuss some of the outages from the last few weeks as we go under the hood.

Shortly after 19:05 UTC on July 24, ThousandEyes observed global users unable to reach SharePoint Online and OneDrive for Business sites due to what turned out to be an erroneous change in SSL certificate that prevented the establishment of a secure connection to the service. This was quickly identified by Microsoft, but not before users globally appear to be impacted. So with that, Joe, would you like to take us through what we were able to see in ThousandEyes?

JOE: Absolutely, thanks Mike. For all of you listening on the audio-only podcast, what I'm showing here on the screen is the SharePoint outage visualized in ThousandEyes. And don't worry, I'll explain everything that we're viewing on the screen as we go through so we don't lose any of you audio-only folks behind.

For those of you who aren't familiar with ThousandEyes itself, it's a platform that gives you a view into your entire digital supply chain. So ThousandEyes has agents deployed across the Internet and around the world to get visibility into network and application disruptions. And this is very useful so that if and when problems occur, you have the information you need to identify where is this problem, who's responsible for fixing it, what can I do to either remediate it myself or to proactively inform my users or my stakeholders.

What I'm showing here on the screen right now is the SharePoint Online outage as seen through ThousandEyes Internet Insights, which is a kind of macro-scale view. it within ThousandEyes. And what I'm showing here is a timeline which is indicating to us the number of affected servers as part of the SharePoint outage. And so we can see on the timeline at seven o'clock PM, there were no issues. But if we fast forward five minutes to that 19:05 UTC time that Mike just mentioned, we can immediately see on this map view that there are hundreds of servers affected, which means that when users are trying to reach SharePoint online or OneDrive for Business, they were unable to do so due to some underlying error. Now the map view is great to kind of give you that geographic scope and quick visualization into the scale of this outage.

But we also have a topology view, which I'll switch to now. And in the topology view, on the left-hand side of the screen, we have the source agents. On the right-hand side of the screen, we have those application servers. And so quickly again, you can kind of get that geographic scope, but you also get the quantity or the number of agents affected in each location. Now what's interesting here is we can see that when this outage first started, it was not small but not massive. About 200 servers affected, affecting locations around the globe, but mostly isolated to the United States. But if we move forward another five minutes into the future, we can see the scale increases significantly. And we see a few things change in our topology view. On that right-hand side, we can see there are now 1,700 SharePoint servers being affected. We can see there's a thousand agents in the United States that are being affected by this and an increasing number of agents in other countries as well. So we've identified the scale and the scope of the outage, but now we want to try to determine what the underlying problem or cause is.

And again, we can start to do that from this Internet Insights view that we're in by hovering either over the affected servers or the affected agents. It indicates the type of outage that this is. And we can quickly see that there's something going on with an SSL handshake or SSL negotiation here. We can drill in a little bit more deep to see were there any other errors encountered. Maybe there's more than one problem occurring here. But when we do that, we can quickly see that 99% of these errors are due to SSL.

MIKE: So just on that, Joe, you pointed out some good points there. So first of all, you said that we now see this as a global issue there. And also for some of those follow-on ones, is it safe to say that when we're seeing things like response times, we see the SSL negotiation, could they also be sort of almost chained impacts as well? So we're saying, oh, so it's timed out because it's actually failed to make a connection, for example?

JOE: Yes, I would say that's generally possible. I would say that the way we are gathering and rendering this data, that wouldn't particularly happen here. We’lll identify the fact that there was an SSL error as being its own distinct error separate from that response time. But that said, the changes that were being made by Microsoft at this point in time—configuration updates, distribution of a configuration to global infrastructure, you have to imagine things like web servers are being rebooted. You know, DNS records may be updating. And so if you have requests in flight in the middle of some of these kind of config updates or changes or pushes, that's when you might see some of those things like an HTTP response timeout or similar.

MIKE: Got it. Thanks, mate.

JOE: Awesome. So let me switch into a different view here, moving out of this macro or broad global scale view into a more micro view. And let me just reset this a little bit. So what I'm showing here on the screen, again, it looks similar. We have a timeline at the top and what it's showing us is the availability of a given web server. In this case, for this test, you can see that URL here, acmehero-my.sharepoint.com. So this is a OneDrive for Business URL. And again, if we look at that timeline, we can see leading up to 19:00 UTC, things are pretty stable, maybe a small blip here or there, which when you're testing from 20 locations around the globe every two minutes is not uncommon. But what we're really looking for is that same kind of dip we saw or drop in availability that correlates with the outage we saw on Internet Insights.

But what I'll do now is jump ahead to that 19:05, or in this case 19:06 interval here. And what we can see here is that first initial blip, right? We can see an agent in Chicago. And what do we see when we hover over this? SSL, no alternative certificate subject names. So that's indicating to us again, some SSL issues are happening here. I'm going to jump ahead just a little bit. And again, we'll see the growth of the scale of this outage, right? It went from one or two agents now to, let's see here, 17 of them or so that are now failing as they're trying to get responses from OneDrive. So I'm going to switch out of this view I've been showing you here with the status by phase and the map of the agents to go into our table where we can drill into even more detail. And again, we can quickly see here all of these agents are reporting SSL errors. Now with ThousandEyes, we're sending the synthetic traffic, we're sending HTTP requests and measuring HTTP response time and validating status codes, but on top of that, we're also checking for certificate validity. So when we see an SSL error like this, we can drill into it to understand what's going on.

And let me just scroll up a little bit so we can do a quick comparison. We can see the entire certificate chain, right? Because while it's important to have a valid leaf certificate, it needs to be signed by some trusted root certificate authority, right? So we're able to trace through that entire certificate chain. But what's interesting here in this case is that the common name and the subject alternative name for this certificate is a wildcard subdomain for sharepoint.de. which of course is the German top level domain. But that's not the target that we're monitoring here, right? We're monitoring sharepoint.com. So that's the underlying issue here is that at 19:06, initially just on some servers, and then by 19:10, you know, significantly more, all of these SharePoint servers were responding with a certificate which was otherwise valid, right? It wasn't expired, it was signed by a trusted root CA, but the host name in that cert was incorrect. And so that's where the underlying issue was here.

MIKE: Yeah, and I think that's an important point here. So, is that we talk checking for cert validity, but it's also then everything has got to be in context. Everything's got to be right. So I think as you're going to show through from there, we saw beforehand or we see afterwards there, we're not actually pointing to the right domain there. So we've got a top level German domain we're going to there. So for all intents and purposes, that was actually a valid certificate, but it didn't completely match everything coming through.

JOE: That's exactly right. And that's the challenge with so many of these problems is there's a lot of complexity. There's layers throughout the stack and there's complexity in each one of those, in each area within those layers, right? SSL on its own is kind of a beast and there's a lot that can go wrong. It was a valid cert, it just wasn't put in the right place or maybe the right one was taken out of the right place or something to that effect.

MIKE: So we're saying again, there's all these different things and like you said, sort of TLS or SSL, it's like it's a eight-way handshake before we actually start. All of these add complexity, but they've all got to be working in unison, effectively for you to be able to gain access to the application or to critical service you're trying to do. So not only have they got to work, but really you've got to be able to understand how they're all working together. So you've got to see right across that service delivery chain as we call it.

JOE: Yep, that's exactly right. Yeah, I mean in this case this is an application layer issue, right? We could drill into the network, we can look at end-to-end network metrics and hop-by-hop path visualizations and so on. And that's fine. That's not the problem here, right? The problem is at the application with that wrong certificate.

MIKE: Yeah, and that's actually another really interesting point because being able to see everything together is that, yeah, I can know the network's not the issue here, but where is it? The users know they've got a problem because they can't access the service. So what is it that's actually causing my problem? And the only one I can do that easily is to take a holistic view of that delivery as it goes through.

JOE: That's exactly right. Yeah, end to end and depth through the stack, yep.

MIKE: Cool, just one other thing on this. So we've seen the change going from here. Is this kind of an unusual, I mean we don't know what was going on in the background there, but can we assume this was like a manual change going on there? We talk about certificates being a critical part and if it lives in there, how would we normally go about sort of changing the certificate? Or they obviously expire. It's not like my password that I set and leave for 30 years. We have to sort of change these, certificates are the same, they have an expiration date. We saw that when you actually expanded on the TLS certificate there. What can we actually do basically? So this was a manual change, I guess what I'm trying to say, or it appeared to be a manual change. What will we do sort of to make sure these things don't time out because they are critical parts of the delivery chain?

JOE: Yeah, absolutely. I mean certificate management has been a challenge for organizations big and small for as long as certificates have been around. You know, this is not the first outage due to certificate issues that we've seen, not by a long shot. It is hard.

Now, if we go back, you know, maybe five plus years ago or so, getting a certificate was not trivial. Again, this all works through what's called PKI or public key infrastructure. And we won't get into all the weeds of that today, but it comes down to that if you want a certificate, you need somebody to sign that certificate. They need to vouch that certificate is valid and accurate, that you are who you say you are. And so this used to be a very manual process, right? You might have to upload files to your certificate authority, they may need to review things. If there's an EV cert or extra validation, which includes things like organization names, addresses, and so on, there may be extra identity checks and things of that nature.

And so all this is to say that because it was an involved, tedious process and more so than it needed to be, we used to generate certificates with long lifetimes, right? Usually at least a year, sometimes two years, sometimes longer, so that way we wouldn't have to go renew these certs all the time. The first problem with that is that if our private key for our certificate leaks somehow, that allows an attacker or someone malicious to basically man in the middle or intercept traffic or maybe spin up a spoof site or a clone site or things of that nature. So there's a concept with certificates called certificate revocation. The idea being if I have a key that leaked and I know my cert is potentially vulnerable, I can revoke it, right? Now that was great in theory, but it didn't really work out in practice.

What we've done since then is increasingly moved towards certificates with shorter lifetimes. Now we primarily did this so that we could kind of handle that leaking key certificate revocation kind of case. But the other benefit of this is that if you have a certificate that's only valid for somewhere to be between 30 and 90 days, and you have to renew that cert somewhere between, you know, once every month, once every three months, it's not the kind of thing you're going to want to do manually, right? You're going to want automation to be able to handle that. There's a number of different ways to do that. Some folks run scripts on cron jobs, some folks have monitoring that can trigger automation, but I would say that certificates are far less commonly manually generated today than they were a number of years ago.

The only thing I want to add to that, as you said earlier, this is likely something that was triggered by a manual process. I would say there's still automation at play here, right? It's not like one engineer or one SRE or what have you deployed a bad cert to 2,000 servers all by him or herself. But yeah, it's very likely, maybe they were running something like an Ansible playbook. And for those of you who aren't familiar with Ansible playbooks, it's a way to automate certain tasks against servers and other resources. Maybe they had a playbook to go deploy the cert, but with an Ansible, for example, you can limit what they call the inventory or the hosts or the servers that you're going to run that playbook on. Maybe this was a manual trigger of automation and someone forgot to set that filter. We were only supposed to go deploy this to our hosts in Germany, we deployed it to our entire global infrastructure.

MIKE: Excellent, that's really good. And then so just to there as well, to Microsoft's credit, they did recognize this really quickly and sort of reverted it and then she went through again. So again, maybe picked off another automated process and then went back and sort of lessons learned and refined maybe a playbook in this case.

JOE: Yep, yep, absolutely. Yeah, this was, you know what? Roughly 10 minutes or so, 12 minutes on the outage, which is pretty quick. I mean, I'm sure they started detecting problems before that deploy was even finished. Right? We were looking earlier at the relative scale, we were seeing things like 40, 50%, but not 90, 100% outages. So it certainly seems like the kind of thing where yep, they made a mistake; they noticed it pretty quickly; and they immediately started canceling it, rolling it back, or what have you. But it is, I mean, it's just a testament to how challenging this is, right? But you know, there's a reason we measure availability in nines, whether it's five nines, six nines, what have you. No one's got 100% uptime, just can't do it. Just can't do it.

MIKE: Yeah, with maintenance windows, everything go across there. But again, this comes back to one of our overarching themes or repeated themes or a barrel I'm pushing constantly through the podcast is that it doesn't necessarily matter what your plan is. Things are going to go wrong, but let's understand what's going wrong and then we'll implement a plan or process to mitigate that.

JOE: Yep, certainly.

MIKE: So thanks Joe, that was actually really good. So with that, let's actually now take a look at the second outage we want to sort of dive into a bit on a similar sort of thing, which occurred on Slack. So Slack experienced a system-wide issue on July 27 that left some of its users unable to send or receive messages for just under an hour. A brief post-incident report notes that an issue was identified “after a change was made to a service that manages our internal systems communication. This resulted in degradation of Slack functionality until the change was reverted which resolved the issue for all users.”

So a couple of points here, Joe. The first one is that Slack came out and sort of said what was going on. Again, similar to what we were talking about just now from Microsoft there, this was the result of a change. They rolled the change back as well, so they reverted the change to sort of fix the issue. The other thing is that this issue occurred sort of very early in the North American morning, so it was just after 2 AM Pacific Daylight Time, which suggests again that this was some sort of maintenance work, it was intended to happen outside of U.S. working, but again, because of a change that happened that had an impact they weren't expecting, it actually sort of had this flow-on effect where it affected global users. And then what this meant from a time zone perspective is that people in Europe, this was in the middle of their sort of working day, business day.

And the other thing then, I sort of want to note on there, I said a couple, I've got a few there. The other one is that this was a breaking functionality. So they're talking very specifically here about a breaking functionality, not an outage per se. And if something isn't, if I can't use it, it's not functioning. To me, it's actually sort of still an outage there. But what appeared to happen was this breaking communication between the front and the back end. So everything was up and working. And if I'm looking at status pages, everything's green. If I'm looking at my online status from a Slack perspective, everything's green. It just means that I couldn't communicate with a backend system. So everything looked right. The lights in the house were on, but no one was home essentially.

JOE: Yep, absolutely. And this is another thing we see, I don't want to say all the time, but maybe more commonly than you'd expect, right? And part of this has to do with the way web application architectures have changed, right? We are moving away from monoliths. We are moving to distributed microservices- based systems where, you know, you kind of think again, 10 plus years ago, you kind of had your web tier, you had your maybe a backend or an app running on Tomcat or what have you, and maybe a DB server, right? And that was kind of the extent of your average application. These days you've got application and/or network load balancers in cloud regions all over the globe. And those are talking to dozens of different services behind the scenes. And so just by the fact that this architecture, this paradigm has become so popular, it's almost created a new potential for problems, this front door to backend communication.

MIKE: Yeah, yeah. And just on that as well, so I mean, these changes have happened for the good. This is a case where demand or user expectations have driven it. These are global products. These are global services. So the best way to serve it, so me down here in Australia, connected by my bitter wet string out from there, I want the same performance as if, you know, you sort of sitting in North America there. So I need to have sort of that same one. So the only way to do that, or the most efficient way to do that, is to have a distributed architecture. So it's a real need to have it, but as you say, this has now introduced the potential for other points of failure, I guess, or points of contention throughout that system.

JOE: Yep, definitely.

MIKE: All right, so Joe, I've really enjoyed having you on board today. That's been fantastic, and I'd love to get you back on very soon.

JOE: Definitely, I'd look forward to it, Mike. Thank you so much for having me on.

MIKE: So that's our show. Please like and subscribe. We really appreciate it, and it's valuable to us. As I mentioned at the top of the show, not only does this ensure that you're notified as soon as a new episode drops, but it also helps us in shaping the show for you. Please follow us on X, formerly known as Twitter, at @ThousandEyes. Any questions, feedback, or guests, please send us a note at internetreport@thousandeyes.com. So with that, until next time, goodbye.

We want to hear from you! Email us at internetreport@thousandeyes.com