Unpacking the Dec. 12 Quad9 BGP Route Leak | Outage Deep Dive
(gentle music)
(gentle music)
- [Mike] This is Internet Report,
where we uncover what's working
and what's breaking on
the internet and why.
I'm Mike Hicks,
Principal Solution
Analyst with ThousandEyes.
I'm joined today by Kemal,
Principal Internet
Analyst at ThousandEyes.
Welcome Kemal, how you going?
- [Kemal] Thank you Mike.
It's good to be back
on the Internet Report.
Thanks for the invite.
- [Mike] It's great to have
you mate, always good to chat.
Today we're going to unpack
the recent Quad9 BGP incident
and Quad9 is an open
DNS recursive service.
It replaces your default
ISPs
or enterprise DNS configuration
and why is this important?
But when your computer performs,
any transaction that uses DNS
and most of them actually do
as we go through from there.
Quad9 effectively looks up
and blocks some malicious host names.
So it's obviously quite
important if we actually sort of
have an incident going across from there.
So with that, let's dive in.
(gentle music)
- So yes, Mike, what
really happened is that
we got
reports of
large spikes in packet loss
when it comes to Quad9's service, right.
And it looks like it affected
several different large scale ISPs,
or tier one providers for
that matter, such as Verizon.
And as you can see on the shared screen,
we have path visualization,
ThousandEyes path visualization,
which essentially shows
three different agents,
Dallas, Texas from the
perspective of Verizon,
Chicago, Illinois, from the
perspective of the Verizon
as well and Los Angeles, California.
Also from the perspective of the Verizon.
Now Verizon is tier one provider,
like they're quite big and large,
but you can quite clearly
see that somewhere around
UTC
on Monday December the 12th.
It was a significant spike in packet loss.
It actually went on for quite
extended period of time and it
ended approximately around
sorry, 13:35,
same day.
However, as you can see,
like when I'm hovering over
the path visualization
spike of the packet loss
on the timeline.
We are showing average values
across quite large number
of the agents that are assigned
to this particular test.
However,
if we focus on,
for example,
on Chicago agent,
we will see that actually the
impact was quite significant.
So from the agent dropdown
list, if I type in Chicago.
Chicago, so, and I saw.
- So just on that point, what
we're saying there Kamal,
is that when we're
looking at this view here.
We're looking at a percentage there
because as averaged out
across all the agents,
it's essentially showing a low percentage.
But now when you've gone in Chicago,
we're seeing it sitting up a 100%, so yeah
- Exactly, like you can
quite clearly see that
it was going to 100%, 90
something percent and you know,
it was pretty significant.
Now with the dark red color,
we are seeing Chicago, Illinois
a agent actually having this large spike
and the same pattern goes
for these other two agents
in Dallas, Los Angeles too.
However, just before
the event has started,
so we are looking at the data
point at around 12:10 UTC
on December the 12th.
We can quite clearly see that
the path that we were taking
goes from the Verizon and
then you know down the line,
traffic is still with the Verizon
and ultimately somewhere down the path
they are actually hand
over traffic to Tilia,
which is again one of
the tier one providers,
the autonomous system 12, 99
before the traffic
reaches 9, 9, 9, 9, right.
On the other side, when
it comes to Chicago agent,
we can see exactly the same
pattern except for the fact that
actually traffic went to level three,
again tier one provider
with the autonomous
number 3, 3, 5, 6, right.
So when the event has started,
if we navigate towards the
loss on the part visualization,
you can quite clearly see
that even at very start
of the event,
the packets.
Packet loss spike to 92%,
which is pretty significant.
- Yeah.
- If you look at the.
- And I think that's
also quite significant
cause it was almost like a
light switch that we see.
We go from 0 all over right
up to that 92, 95, which is,
which is turned off turn off.
- Exactly.
Exactly like it's going
to have a profound impact
on whatever the service
on the other side is.
In this case we know that DNS service,
but as you can quite clearly see,
like essentially what really
happened is that, you know,.
There was a lot of impact
and looking down the path,
you know, you can quite clearly see that
on the path visualization
there were certain changes
in the path.
All of a sudden, like
you can see that traffic,
all combined from all of these 3 agencies
going through the France
Telecom or Orange.
And even further down the
line you can see that actually
autonomous system
38,
44, which is liquid telecom,
essentially starts dropping
pretty much all of the traffic.
Now it's quite interesting
to see that, you know,
liquid telecom, which is
based in Congo actually,
attracted this traffic somehow right.
Now even before we start
speaking about that,
there are two things that
I want to touch base.
First one is this purple line.
Mike, can you tell us a little bit
about what these purple lines indicate
on the path visualization?
- Yeah, that's, that's interesting point.
So when we're seeing the purple lines,
what we're saying from that
is we're looking at a network outage,
but this is coming to us
from our collective intelligence system.
So this is where we're
putting the test together
and we're picking it up
within internet insights
to actually say there is some sort of,
picked up on a global perspective
just impacting sort of
multiple people there.
- Oh, okay, thank you for
providing more context into that.
So I think the next interesting
thing is actually to see
how this event or how this
spike in packet loss, actually.
What is the effect that it
had on the DNS server itself?
So if I click on the views for DNS server,
while Chicago, Illinois agent
from the Verizon's
perspective is selected.
We can quite clearly see,
you know,
availability dips 100%.
Like before the event,
essentially availability was 100%,
you know, and during the event,
essentially the core function
of what the test is doing
goes to 0%.
It was intermittent spike in availability,
but it essentially kind of recovered.
But again, you know, for a
complete duration of this event,
essentially the test was
not able to do its job.
And what it's doing is, you know,
we are trying to resolve
target domain of example.com
using UDP for the A record
using Quad9's
DNS server
for UIC provided already description for.
So the next question I
guess Mike is really have.
- Yeah, sorry, before we get, before that,
before you dive into that.
Can you just explain a little bit there
or expand a little bit?
We're talking here
about a DNS server test.
So we're actually,
and I think you've covered
it just briefly there,
but a little bit of, you know,.
Why this is important
and just a reach ability.
So if we were looking at HTV server test,
we might see a DNS failure,
but this specifically tells us
we're looking at a DNS server
Is that right?
- Yes, that's correct.
So essentially why is this important?
DNS server stands for Domain Name System
or Domain Name Service rate
and essentially the function
of this particular service
is to translate domains
to IP addresses, right.
So what happens is, whenever
you type in google.com
into your web browser,
or whenever you type in any
domain name for that matter
inside of your web
browser to view the page
or get your resource
or something like that,
One of the first things
that's going to happen
is essentially
translation of that domain name
in into the IP address.
You know, our computers
unfortunately don't understand
the concept of domain
names like google.com,
ThousandTimes.com, you know,
CloudFlare or Quad9 for that matter
does not really mean
anything to our computers.
But IP addresses that
are behind these names
means everything, right.
So in order for computer
to establish a connection
to that resource, it
needs to translate it.
Which is essentially what it does.
- Good like the zip code and
on my computer understands,
I shout loudly at it and it understands.
(laughing)
- Fantastic, so now.
I mean we saw that on
the part visualization
there was this pretty
much profound impact on,
on the packet loss, right.
From the packet loss
perspective, as part of
which packet loss really
did spike quite a lot
and it caused the
fundamental functionality
of the test itself to
completely fail, right.
Now, for that matter.
Like essentially all the
people that were using Quad9
as their DNS provider
during this particular time
that were affected by this.
So for example are all of these users
in Verizon for example,
would have a problem
translating their domains names
to IP addresses as part
of which essentially
they would not be able to
use the internet, right.
For that matter.
So, but the real question here, Mike.
Is essentially what happened here?
So if we click here onto
the BGP route visualization,
we can see that there
were two different spikes
in the path changes.
So there was the one that happened here
approximately at
12 o'clock
and we are going to get
some more details about it.
And then there was the
one that probably was done
just at the end of the event itself,
but let's dive into what
actually happened here.
On the left hand side
here we have a collection
of what we called BGP collectors.
Which are essentially machines
deployed all around the
world that are listening.
All the updates from the
BGP perspective, right?
And the BGP is this protocol,
Border Gateway Protocol is, you know,
the protocol that makes internet possible.
It's essentially mechanism
that, you know,
network equipment uses to
actually exchange information
about the companies and
the prefixes that they own.
So if we now look at
this particular collector
that was receiving certain messages
and we click on show only in
this monitor to, you know,
kind of view only, what was
happening from that perspective.
We can see that you were certain events.
Would you mind taking us
through this event, Mike?
- Yeah, so what we can
see going on through here,
as you said, we've seen
some sort of change,
some instigation and effectively
what BGP is doing for us,
as you said is actually
exchanges information.
Tell me how I get to it.
So if you think of the
internet as this collection
of autonomous system
networks easy for me to say.
This is where we're
actually starting to see
exchange that information.
So as we look at this path
where we see this change
coming through from here.
What we're starting to see is
as we actually sort of the
advertisements coming out.
Where we were going sort of straight
into this downstream provider from here,
from a woody perspective
is we're actually now
being diverted and we're
actually being advertised out.
We're seeing the best possible route
for us to get through,
is to go through this,
this Dominican Republic
of Congo based ISP there.
Who's actually advertising
the route out from there
to actually get to there.
So now we're taking this route,
so we've changed him here
and we're now, this is the best way
that we are saying to get
from this location into
or from this network I should say,
into the Quad9 environment.
- That's absolutely correct
and if you look at this,
it's actually quite easy to
spot what was happening here
with the dotted red line.
We can see that this,
there was a certain
visual in the effect
as part of which connectivity
between woodynet is 42
and the backspace technologies
on the left hand side
essentially, you know, got withdrawn
and instead what's happening
is that we are installing,
you know, different parts,
part of which this particular, you know,
starts advertising, starts
being actually upstream provider
for woodynet, you know.
Ultimately leading towards
the liquid autonomous system
38, 8, 44, as part of which, you know.
We are speaking about the event
that we called route leak.
Now.
Mike, there are two
different types of events
that we can speak when it comes to this.
The first one is,
the first one is essentially route leaks
and there are hijacks, right.
And easy distinction when
it comes to route leaks
and hijacks are essentially,
what's the intent behind that, right.
When it's malicious, when someone
really tries to, you know,
take Uber someone, traffic
force, someone's prefaces
once they are maliciously
advertising traffic towards the,
towards the internet, right.
Then we are speaking
about hijack, however,
when it happens as a result of human error
or configuration error or mechanisms.
So automation related errors,
we are speaking about truth leaks
and while it's really hard to distinguish
what really happened here.
We are suspecting that this
was just, you know, an error
as part of which we classified
this event as a root leak.
- Yeah, I mean, it is easy to say but
the characteristic sort of
the change around from there,
but you said the malicious
intent between a hijack
and typically what you see within a hijack
is that path maintained.
You know, here we saw a
complete loss of connectivity,
we couldn't actually get to this service.
What we've seen in
hijack is this diversion,
is exactly what the word says,
what it says on the label.
It hijacks it
and it's going through this
different area there for
or to a different network
which for the malicious intent there.
- Exactly, and the other thing
that I wanted to point out,
is that if you hover over the collector
on the left hand side and you
actually click on view details
of the part changes,
the thing that you can
see quite literally here.
- Love this view.
- Exactly like you can actually see
what was the initial part and
we can see that the part was
originating GSN
then VDSN
and the target
of the ASN we are collectors
essentially located.
And then you can quite clearly
see that at 12:13, 01 UTC.
So you, what you're getting is exact,
exact timestamp of the
event when it happened.
We can see that longer part,
which is quite interesting
as well gets installed.
You know, if you think about it, you know
BGP as a protocol prefers
the shorter spot ASN parts,
but in this particular case you can see
as a result of this route leak, you know.
Do the way that
prefixes were advertised towards
the rest of the internet.
Longer parts got installed and
traffic got diverted, right.
So now, we see that actually, you know.
We see what was the contributing
reason to 100% packet loss
that we were seeing in
the pot facilitation
for Chicago, Los Angeles and
some other agents, right.
And then we saw how all of
that contributed to, you know,
service impact from the
DNS server perspective.
PGP was initially not
designed with the security
as its guiding principle,
which is unfortunate, right.
And you know, there are
certain things that,
over the years, certain
companies and you know,
individuals tried to push,
you know, as a best practices.
One of the things that we
need to speak about is RPKI
or resource publicly key infrastructure.
Would you mind telling us a
little bit more about at Mike?
- So yeah, so
the RPKI is a, in very simple terms
is really do I trust these
routes coming out from there?
So I'm signing my routes, this is or
my networks, my prefix
of current from here.
So they're authenticated
affect me as you go through from there.
And then you are also checking
for those authentications coming in.
So you are, these are trusted routes,
these are trusted advertisements.
You are allowed to advertise
these coming out from there
that's really in a nutshell.
- Exactly and you know,
before we started actually
speaking about this event,
I actually checked and Quad9 as a company.
They actually sign all of their prefixes,
which means that, you know.
If the companies or
their OpStream providers
that actually let this happen
were enforcing RPKI
verifications or filtering.
Well you know, as part of which, you know,
if you are not the owner of,
if you are not the, you know,
if you are advertising something
that you are not signing.
They would would've dropped you
and you know, even though they
had them signed, the company
SUBSTREAM companies in
this particular case,
liquid ASN who propagated
this further, right.
Unfortunately did not filter
the advertisement as a result
of which we see what we see.
Now, you know, it's kind
of funny to see, you know.
During the preparation for this school,
I was actually checking
some of the prefixes
from the liquid ASN, right.
And it's really interesting
that themselves,
all of your prefixes are
also assigned with the RPKI.
Which means that they want to
get the benefits of the RPKI,
however they're not filtering
prefixes themselves.
So you know, they are saying, you know,
we see what's the benefit in using RPKI.
but we still haven't decided
to actually take that step
as part of which we are going to filter
prefixes that belong
to someone else, right.
It's like last missing step, right.
The other thing is like, this
goes to speak about, you know,
other things such as improper filtering.
So for example, you know, many companies,
you know, even before you
end up peering with them.
Are going to have really
strict requirements
to have your old prefixes in
peering database
for example and stuff like that, right.
As part of which, you
know, if it's not there
indicated with correct
objects and stuff like that,
you know, there are certain rules there.
You know, they're not going to even accept
your advertisements.
So if the prefix is not
explicitly on the list of prefixes
that you should be advertising,
the architect's going
to filter you, you know.
Which goes to say that, you know.
Operational practices could be improved.
And thankfully there is this thing
called mutually agreed
norms for routing security
or manners for short.
Which actually lists all
of these best practices
and operational practices
that companies should follow.
And you know, companies such as CloudFlare
and others on the internet
are doing really good job of,
you know, making sure that participation,
year over year gets better that
companies join.
But I think there's still long way to go.
- Yeah, absolutely.
So in, on that thing there,
I mean Quad9 did a pretty good job.
I think of actually letting
us know what was going on.
They were quite transparent
about saying what was coming out
from here and I think to your point,
they're actually sort of
pushing the manners stuff there
as well and and saying
this is what we want to do,
around from there.
But it's enough uphill backlog
guess in trying to get sort of everybody
on board to do that.
- Yeah, it's going to take time.
The thing is like, as we said, the,
when BGP was designed,
security wasn't their
guiding principle, right.
So having a security as the after tote,
which we saw even in in coding
practice and stuff like that.
Always tends to be really
hard thing to do, right.
So because it's opt in,
we kind of get to see the
negative effects of things
such as route leaks and
hijacks from time to time.
Now,
if I go back here and I,
if I click on the part
visualization, Mike here.
We briefly spoke about these
purple lines on the timeline,
right?
- Yeah.
- So actually our internet
insights also capture this event.
Could you actually take us
to what was happening here?
- Yeah, this is really cool.
I like, I like this view
here and we come into there,
so as we said upfront
is when we're looking
at those purple swim lines
there, it's actually identifying,
in this case we had a network outage.
So although we're looking at
sort of a smaller duration,
if you can see it started
at the same sort of time.
So what we're looking at from here
is we're looking at
sort of subset of agents
you can see in the middle again.
Where we have that early
autonomous system, there.
Is this is where we're
having this outage there
and from an internet insights perspective
and never can to determine
as a 100% packet loss
across a number of tests,
like I said, have an impact themselves.
So if you drill into that from there,
what we can actually go
down to is to start to see
specifically where those
interfaces were actually
putting that information from, sorry,
where the interfaces actually stopping,
where we start see that
falling loss occur from there.
And we can see all of these are occurring
within the liquid
telecoms environment there
we can actually start to see that.
And if you look across
to the right hand side,
we can actually then start to
see the impacted customers.
So we can see the likes of
Quad9 coming down from there
and then their downstream provider.
Yeah, great water, come on.
I love this, I love this view.
I spend my life in internet insights.
So essentially what we're
actually looking from here
is we're saying, as I said up front,
that blue purple swim lane there indicates
we had a network outage
is we see where we've
going across from there.
What we're looking at there
is the actual interface is
impacted so that that red block
up there tells the interface
impacted and if we drill into there.
We can actually see the location.
So we're seeing here
where the forwarding loss
was actually occurring here
and down to interface name.
So we're seeing that in liquid
telecoms environment there
and we're actually got
the interface names.
This all public domain information.
You're also seeing where
we're actually impacting,
the downstream customers, in this case,
the Quad9 and the woodynet
who happens to be there downstream pier
with actually coming to there.
- That's pretty awesome.
The other thing that's
quite important to notice,
is actually having the real
visibility into this, you know,
whether it's from the
internet insights perspective
or whether it's from
the part visualization,
BGP alerts and stuff like that.
I would say it's a crucial
thing to have, right.
And not only,
- Absolutely.
- you know, visibility comes, you know,
and kind of like
importance is self-evident
in this particular case,
but also having capability to alert in,
in a timely manner about these events
and having ability to have
your NRE network availability
engineering teams
or DevOps teams
or SREs depending on how
your organization is running
or just regular operations
team having dashboards views
of these events in my
opinion is really important.
- Absolutely.
You can't overstress the
visibility across from there.
You know, a picture
paints a thousand words.
I've said this on this
podcast a number of times.
I'm a very simple person.
I'd like to look at a
picture what's going on
and we can see very clearly
what was happening from here
from that macro view,
down into the micro view
to see how it's impacting me
going across from there and,
and all of that, you know,
if we, we talked about it briefly.
The concept between the route
leak versus the route hijack.
From there, understanding
that path, you know,
simply do I have a high loss rate,
but it might even be just
be a latency increase
because I'm going through
17 different hops now
because of the way my
network's been advertised
and just visualizing that straight away,
you can see where I'm going.
So I can do two things.
I can start to mitigate my service.
So mitigate the problem
by taking other action,
re-advertising my prefix
or whatever happens to be,
or I can plan for the
future to get around that.
So this, it's not going
to happen to me again
by split my prefixes or whatever.
- And that's actually quite a good point.
In this particular case,
if you look at the BGP route visualization
and for example, if you focus onto,
this particular monitor, right,
you can actually see that the operator
was advertising slash 94, right.
- So, 24, slash 24.
- Exactly slash 24, which
is really important, right.
If you think about it, that's the
the smallest preference is
going to be accepted on,
on the public internet,
which means that, you know,
Quad9's hence were tied when
it comes to what they could do
from the traffic engineering
perspective, right.
So having visibility, you know,
which seems that they had
by, to be perfectly honest.
Like, but having visibility
is crucial, right.
So for example, in this particular case.
Given the fact that your hands are tied
from the traffic engineering perspective.
The only thing that you can potentially do
is pick up your phone
and call the provider
and tell them, right.
Like this is, you know,
you are doing this to me.
Like, and this is affecting my, you know.
Reputation, my revenue and
everything that goes with it.
So, you know, this again
stresses the importance
of having visibility, you know.
And you know, you already
outlined that point,
and I could not agree anymore.
- Yeah and as visibility
in real time or visibility,
you know, we've seen
this immediately happen.
We talked about the light switch moment
when it goes on and off.
If I'm relying on even
an update from Twitter
or an update at a status page.
I'm essentially going to
be behind the eight ball.
I'm going to be looking, whereas
I could immediately see it.
And the benefit to that
is for future occurrences.
I can have some automotive
processes going to kick in,
you know, as we've seen across
a number of customers
over a number of years.
- Exactly, Mike, it's been my pleasure
speaking to you about this event.
I think like we uncovered
what real event on here.
Thank you so much.
- No, it's always my pleasure, mate.
Again, I can, we can,
we can talk for hours
and we can see people sort
of trying to wrap it up,
but that's good.
So that's our show.
Don't forget to like and
subscribe and if you do subscribe,
we'll send you a free t-shirt.
Just drop a note to
internetreport@thousandeyes.com
with your address and t-shirt size
and we'll get that straight over to you.
Thanks very much.
(gentle music)
(gentle music)