Understanding the UK Virgin Media Outages on April 4 | Outage Deep Dive
ANGELIQUE MEDINA: Welcome to the Internet Report, where we uncover what's working and what's breaking on the Internet and why. Today we're going to cover the Virgin Media outage that happened on April 4, earlier this week, in the UK.
So this was a couple of outages, actually, that impacted the reachability of Virgin Media's network and services hosted within its network. And the two outages shared pretty similar characteristics. Both of them stretched over hours. Both included the withdrawal of routes to its network and both showed significant traffic loss and also saw intermittent periods of service recovery.
To unpack all of this in today's episode, I am joined by our Principal Internet Analyst here at ThousandEyes, Kemal. Welcome, Kemal.
KEMAL SANJTA: Thank you so much, Angelique. It's awesome to be here with you and I'm looking forward to unpack this pretty significant outage that affected many customers in the UK.
ANGELIQUE: So there’s quite a lot to go through with what we observed during these outages, in particular the significant amount of BGP activity that occurred throughout the incidents. But before we do that, I'm just going to give a quick summarization of the incident.
Virgin Media, which is the network service provider, about 30 minutes after midnight, we saw that they started withdrawing routes to their service, their network service. So this is withdrawal of prefixes. And without a route to its network, internet and transit providers all around the world would have simply dropped traffic that is destined to its network and the endpoints that are within the network. And users would have experienced this as drops, as prolonged connection attempts that eventually failed.
And while a lot of the loss that we saw throughout the incident was due to the fact that there was no route available, we also saw that throughout the incident there was still some advertisement of routes. So there were still some active routes and in those instances we saw traffic would get all the way up to the edge of Virgin Media's network and then it would get dropped. So it seemed like there was maybe something going on beyond the withdrawal of routes. We saw periodic advertisements and withdrawals throughout the incident which lasted about seven hours.
It started about 30 minutes after midnight local time in the UK, and then it resolved about 7:20 UTC. So that was 7:20 in the morning local time. And there were instances throughout that seven-hour period where there was availability, so it was somewhat intermittent. But then later in the day, around the middle of the afternoon, the outage then resumed, and it lasted for two hours in that particular case.
So there's quite a lot to unpack here, as I mentioned, and lots of stuff related to BGP advertisements, packet loss, service availability. So to go through all of this, I'm going to turn this over to Kemal and he's going to show us quite a lot of interesting visuals that we were able to capture during this incident.
KEMAL: Thank you, Angelique. Yes, to your point, what's really striking about this incident is actually the duration of it, right? So first of all, it looks like it started 30 minutes after midnight and that should have been the period of time where there's low activity. So the pattern of the outage itself really indicates that this might be related with certain maintenance that was going on within the network.
However, the second instance, which happened during the really busy business hours during the UK time is actually what really is completely non-common, except for the fact that it lasted as long as it did.
So when we are looking at this particular incident, what's really important to see is that we saw this outage inside out. So first of all, if you click on the BGP route visualization, we're going to see the pattern of the path changes. So here you're going to select one of these slash 16 prefixes. Obviously they are advertising slash 14, slash 16, slash 17, but for the time being we are just going to focus on one of these that had the complete visibility from the perspective of all of our different BGP collectors, which are essentially devices that are listening for BGP updates and different events on the Internet and then we use that information to visualize it.
And as visible here we can quite clearly see that predominantly Virgin is advertising everything using their primary upstream provider LGI-UPC (Autonomous System 6830). And as you can see everything is working fine. The first path change or even if we click on the updates we can actually see that there was certain small activity even before that it started approximately, if you look at this particular monitor, it's gonna be around 15 minutes after midnight, but that still was a minor activity. So here we can see that approximately at 18 minutes and 48 seconds after midnight, the provider started to change. So essentially the autonomous system path was from the Virgin to the LGI-UPC and they started changing one of the upstream providers.
This was still very minor; however, the really big change happened at 0:51, as part of which we can see that the provider itself tried to advertise the prefix to two different providers. Essentially, what we are looking at here is that the initial path was Virgin LGI-UPC and the autonomous system where our collector was hosted. We can see that changed in favor of Zayo and then ultimately Cogent before the prefix itself got completely withdrawn. We can see here, unfortunately, a lot of these red colors indicating that there was a lot of activity. At the same time, we can see that Level 3 started showing up in the picture. So essentially what was happening is that they were redrawing the prefixes from the LGI-UPC. And we can see that certain collectors, specifically mostly of them being located in the UK actually saw the Level 3 in the path.
Now, if I move forward with this event, one of the interesting things that we are going to observe here is that a lot of these collectors that were listed on the left-hand side all of a sudden show complete unavailability of the prefix or essentially reachability of the prefix from their perspective went to zero. However, as I mentioned, some of these collectors from the UK are still seeing connectivity through the Level 3. So that's the first instance. So the other important thing is to actually see how it reflected on the service itself. So looking here, we can see that as soon as the event has started, so approximately 30 minutes after midnight, when we observed the first withdrawal of the prefixes to LGI-UPC coming from Virgin Media, we saw the significant spike in loss.
And here, bear in mind that what we are looking at is average loss, right? So even on average, this was significant, right? We saw 100% loss, we see that on the path visualization, we see that the traffic is getting dropped, sometimes on the edge of the network, sometimes even before that. But essentially what that indicates is that a lot of these networks show complete lack of reachability to the network itself. Essentially, they did not have the path to reach the network as a result of which the traffic is showing 100% loss.
And the other striking thing is essentially that it went from, as we said, so the first spike in loss happened just 30 minutes after midnight. The really big event happened at 50 minutes after midnight. And then we can see that it lasts for a pretty long time, until 2:20 UTC in the morning, and then it resolves and we can see that everything was clean.
If we click on the BGP route visualization, we can see that during that time there was certain activity from the BGP perspective. Namely, if I click here, we can see that these monitors around the world started receiving advertisements that actually indicate that things are getting better. So there was a significant improvement at 2:30 AM UTC as part of which all the connectivity was working the way it's supposed to. There was this short period of time where everything worked fine. However, unfortunately, it did not last for long. Here if I focus on this particular collector, we can see that at three minutes after they fixed it there was again a prefix withdrawal as part of which we know that it's going to have a negative effect on the packet loss.
And if we go back to path visualization, what we're going to see is essentially the significant spike in loss. So we can see here that traffic is essentially going towards the destination; however, traffic is being dropped. So here what we are looking at is GTT essentially is receiving traffic that’s destined towards Virgin Media. However, what's really happening is that it's being dropped. So here we can see that at 2:50 UTC there was an improvement.
So at 2:50 what we can see is that Stockholm, Sweden agents essentially having a packet loss, but this time around what's really happening the traffic is now getting to the Virgin where the traffic is being dropped. So this time around, it's not only that the traffic is not making it due to the fact that there was a problem with BGP advertisements, it looks like something else might have been contributing to the loss within the Virgin networks as well.
The other point that's really important to make here is the pattern of these events up to this point. So as we said, the event started 30 minutes past midnight. And, you know, that would be a typical time for the regular maintenance. So regular maintenance hours would be usually during the low hours for the potentially affected customers. However, as a result of the length of this outage, what that tells us is potentially that even though that might have been the outage, it’s something that did not go according to the plan.
Now that we saw how it looks like from the BGP route visualization, if we move here towards the clear event, we can see that pretty much the majority of the agents here during this particular time, so 0:50, started having normal reachability, we can see that the issue seems to have been resolved and if you go to the path visualization during this time—So looking at the path visualization timeline, we know that event has started at 00:30—30 minutes past midnight—where we observed the first spikes in packet loss that closely correlate with the advertisements that we saw. And we can see that a lot of packet loss persisted throughout this event, essentially indicating that all of the agents on the left-hand side that are trying to send the traffic towards virginmedia.com on the right-hand side had a problem reaching the prefix. And just looking at this, essentially what we are looking at here is the fact that they had no route towards the service itself. And we can quite clearly see that on the path visualization. So if we click on the correct traffics that we were looking at previously, we can see that the majority of the agents that were executing this did not have any visibility. And that issue persisted.
And here on the left hand side, now we can see that there was certain activity and if you select one of the agents, we can see that approximately at 2:49:23 UTC, Virgin Media advertised their prefixes to LGI-UPC, ultimately reaching the autonomous system where our collector was. And now if I go back to path visualization, what’s going to happen is we’re going to see a certain period of stability here. So essentially we see that there was a drop in packet loss, followed by essentially service working the way it's supposed to.
Now, looking at this timestamp at 3:20, we can see that again, loss on average spiked. And even though there were no path changes during this time, we can see that there were certain BGP updates happening at the time. So we can explain that by convergence-related issues and stuff like that. And if we focus on the right prefixes, you're going to see more of that. So essentially, what we are looking at here is prefixes getting advertised to Virgin Media, Virgin Media advertises that to LGI-UPC, and then essentially, prefix propagates to the rest of the network.
Going back to path visualization, we can see that a period of stability started at approximately 3:40 and it lasted until 4:20 AM. And if you look at that particular timestamp at 4:20 AM or 4:15 AM, we can see that there were certain changes. So if I select the prefix again, we can see that there was a lot of activity coming again from LGI-UPC. And in this particular case, we can see that they were changing primary ISP or primary transit provider from the perspective of the LGI-UPC level, essentially NTT Level 3 Cogent, and then essentially at 4:16:20, essentially they withdrew the graphics all together.
ANGELIQUE: Right. So, Kemal, so what that's indicating, as you pointed out, is there is a direct correlation between the withdrawal of routes and the changing of routes with the intermittent packet loss that we saw throughout the incident, including the long stretches of loss where there may not have been a route available to the service.
KEMAL: That's exactly correct. So what's happening here is that we see 100% correlation between the BGP-related activity with the packet loss. But what's really interesting here, Angelique, is the fact that even though these withdrawals happened, there were certain activities, as part of which prefixes were advertised to Level 3, for example.
So if I click on one of these events as part of which there were changes, we can see that just, for example, the initial event itself. And looking at this particular event, we can see that withdrawals happened from the LGI-UPC and the traffic went to Level 3. But even though it went to Level 3, it looks like these advertisements, the way they were crafted, were severely limited in scope. So we don't see this prefix propagating globally. In fact, what we see is essentially they are making it to certain locations in the UK. And regardless of that, again, they are failing because we saw like 100% loss. So I think what we are potentially seeing is the issue that's potentially twofold. The first issue is the result of the prefix withdrawals, right, as part of which prefix disappeared for the majority of the global collectors. However, there might have been something that was happening within the Virgin Media's network as part of which even though certain locations had the reachability to Virgin, so we saw it from the Stockholm agent perspective, for example, in one of the data points, the traffic was being dropped within Virgin’s network, which is quite an interesting point.
So what I'm trying to say here is that, you know, even though you might have your disaster recovery plan outlined, one of the things that operators should really do is to test the assumptions, right? So for example, as part of the operational excellence and improvement of operational excellence, they might want to for example, advertise a certain prefix that maybe does not carry the significant customer traffic and then observe how it actually propagates on the Internet. Those kinds of things are quite important for the operators.
It looks like in this particular case, the prefix was going from the LGI-UPC to Level 3. However, the limited scope, it did not really make a significant difference. And now again, going into this particular event, if you are the operator, right? What really happens is that you are going to be potentially quite misled by that sequence of events. So first of all, you have got a clear outage, you do your disaster recovery plan, and you assume that the issue should be resolved for at least customers that were originally affected. However, what's really happening in reality that you did not resolve anything, which seems to be the case here. So again, this issue seems to be twofold as part of which even those collectors that had connectivity, they seem to have been dropping the traffic on the edge of the network itself.
ANGELIQUE: On the Virgin side, yeah.
KEMAL: Exactly.
ANGELIQUE: And I think another thing that you brought up is also very interesting, which is even when there were periods where the routes had been re-advertised and they were active or reinstated and some of the service providers where we had our BGP collectors we were monitoring here, it wasn't broad. It seemed like it was limited, it was regionally limited, which would have meant that even in the periods like we see here where traffic was successfully flowing to the service during those periods where we don't see loss, that that would have then likely only applied to service providers that were in region versus say if you were in the United States connected to a service provider trying to reach Virgin Media or say from Australia or other locations around the globe. Is that reasonable? Is that what we see here?
KEMAL: Exactly, exactly. So looking at this particular Sydney-7 BGP collector, once the event happened at 0:50 UTC on the 4th, what we are looking at here is a prefix withdrawal. And then for the majority of the event, it’s quite obvious that this particular BGP collector did not have any connectivity or did not have prefix. We can see that it appeared again, withdrawals happened, re-advertisements happened, and during some of these periods, they had it, but essentially when this particular collector was suffering from the route loss, or not having a route for that matter, traffic was supposed to be advertised by Level 3, and we know that these advertisements were severely limited in scope.
So we can see this event unfolding in intermittent packet loss spikes. They lasted for a very long time. They would recover and then exactly the same pattern of the event would happen. And the last instance of this actually was around 7 or 7:20 to your point, Angelique, from earlier on, as part of which the issue got resolved.
The really important thing additionally to mention here is the negative effect it had on the services. So essentially looking at the services that were provided by Virgin in this particular case, or for the services that were reachable through Virgin, because sometimes they are selling the transit, right?
What was happening here is that we can see that availability, and availability is nothing else than the capability of our agents to execute these five or six different steps: DNS, so DNS resolution, domain name to IP address; connect is three-way handshake; SSL is capability to negotiate successfully TLS; send and receive are essentially timers for the first byte and last byte of the information; and HTTP is essentially just what we receive from the HTTP server itself. And we can quite clearly see that during this time when there was a BGP activity followed by significant spikes in packet loss, that the service itself suffered really significantly as part of which we can see that essentially the agents were not able to open up the page, nothing was working the way it was supposed to.
And then at 7:20 we see that everything started working the way it's supposed to be, there was no packet loss. And quite interestingly, what happened in the afternoon at approximately 3:20 UTC, we see the significant spike of packet loss again, essentially repeating the pattern, and that lasted until 17:30 UTC when the issue was resolved for good.
But what this outlines is essentially just the really significant importance of having visibility. So here in this particular case we can see how BGP updates—potentially unwanted BGP activity in this particular case, even though it started at 30 minutes past midnight, which indicates that it might have been maintenance—we see how it unfolded in a multi-hour event. We see the effect of that unwanted BGP activity on the path visualization in the form of the extremely large amount of packet loss and we know that packet loss really has a negative consequence on the user experience as part of which it shutters the true output and then we saw this impact quite clearly on the HTTP server, as part of which we saw inside out the negative effects of these events.
So BGP route visualization to see what was happening, path visualization to see how it reflected on the network, and as part of which we can see latency and packet loss and jitter. And then ultimately on the HTTP server, we see how it unfolded from the service-related view.
ANGELIQUE: Right. So anytime that a BGP change is made or a withdrawal is made or a re-announcement is made, understanding the impact on the data plane. So 1) how the routes are propagated out to service providers globally, 2) how that traffic is then able to reach your service and being able to verify that it had the intended effect that you wanted. And then in turn, that impact on the service and understanding all those things in parallel is really important.
All right, that's all we have time for today. Thanks again, Kemal, for joining us. And if you like our content, please like and subscribe to our podcast here. And we do these updates periodically as incidents come up, but if you're looking for a regular update on the health of the Internet, our colleague Mike Hicks has a biweekly podcast called the Internet Report: Pulse Update that we highly encourage you to subscribe to and check it out.
So with that, thanks again for joining us.