Unpacking the Dec. 12 Quad9 BGP Route Leak | Outage Deep Dive

On December 12, 2022, an ISP in the Democratic Republic of Congo leaked a route belonging to the Quad9 DNS service, causing some traffic, including Verizon US customer traffic, to get routed to Africa for ~90 minutes. Learn more about the incident as it was detected in the ThousandEyes platform.

(gentle music)

(gentle music)

- [Mike] This is Internet Report,

where we uncover what's working

and what's breaking on
the internet and why.

I'm Mike Hicks,

Principal Solution
Analyst with ThousandEyes.

I'm joined today by Kemal,

Principal Internet
Analyst at ThousandEyes.

Welcome Kemal, how you going?

- [Kemal] Thank you Mike.

It's good to be back
on the Internet Report.

Thanks for the invite.

- [Mike] It's great to have
you mate, always good to chat.

Today we're going to unpack

the recent Quad9 BGP incident

and Quad9 is an open
DNS recursive service.

It replaces your default

ISPs

or enterprise DNS configuration
and why is this important?

But when your computer performs,

any transaction that uses DNS

and most of them actually do
as we go through from there.

Quad9 effectively looks up

and blocks some malicious host names.

So it's obviously quite
important if we actually sort of

have an incident going across from there.

So with that, let's dive in.

(gentle music)

- So yes, Mike, what
really happened is that

we got

reports of

large spikes in packet loss

when it comes to Quad9's service, right.

And it looks like it affected

several different large scale ISPs,

or tier one providers for
that matter, such as Verizon.

And as you can see on the shared screen,

we have path visualization,

ThousandEyes path visualization,

which essentially shows
three different agents,

Dallas, Texas from the
perspective of Verizon,

Chicago, Illinois, from the
perspective of the Verizon

as well and Los Angeles, California.

Also from the perspective of the Verizon.

Now Verizon is tier one provider,

like they're quite big and large,

but you can quite clearly
see that somewhere around

12:15

UTC

on Monday December the 12th.

It was a significant spike in packet loss.

It actually went on for quite
extended period of time and it

ended approximately around

12:35,

sorry, 13:35,

13:40,

same day.

However, as you can see,
like when I'm hovering over

the path visualization
spike of the packet loss

on the timeline.

We are showing average values
across quite large number

of the agents that are assigned
to this particular test.

However,

if we focus on,

for example,

on Chicago agent,

we will see that actually the
impact was quite significant.

So from the agent dropdown
list, if I type in Chicago.

Chicago, so, and I saw.

- So just on that point, what
we're saying there Kamal,

is that when we're
looking at this view here.

We're looking at a percentage there

because as averaged out
across all the agents,

it's essentially showing a low percentage.

But now when you've gone in Chicago,

we're seeing it sitting up a 100%, so yeah

- Exactly, like you can
quite clearly see that

it was going to 100%, 90
something percent and you know,

it was pretty significant.

Now with the dark red color,

we are seeing Chicago, Illinois

a agent actually having this large spike

and the same pattern goes
for these other two agents

in Dallas, Los Angeles too.

However, just before
the event has started,

so we are looking at the data
point at around 12:10 UTC

on December the 12th.

We can quite clearly see that
the path that we were taking

goes from the Verizon and
then you know down the line,

traffic is still with the Verizon

and ultimately somewhere down the path

they are actually hand
over traffic to Tilia,

which is again one of
the tier one providers,

the autonomous system 12, 99

before the traffic
reaches 9, 9, 9, 9, right.

On the other side, when
it comes to Chicago agent,

we can see exactly the same
pattern except for the fact that

actually traffic went to level three,

again tier one provider

with the autonomous
number 3, 3, 5, 6, right.

So when the event has started,

if we navigate towards the

loss on the part visualization,

you can quite clearly see
that even at very start

of the event,

the packets.

Packet loss spike to 92%,
which is pretty significant.

- Yeah.
- If you look at the.

- And I think that's
also quite significant

cause it was almost like a
light switch that we see.

We go from 0 all over right
up to that 92, 95, which is,

which is turned off turn off.
- Exactly.

Exactly like it's going
to have a profound impact

on whatever the service
on the other side is.

In this case we know that DNS service,

but as you can quite clearly see,

like essentially what really
happened is that, you know,.

There was a lot of impact
and looking down the path,

you know, you can quite clearly see that

on the path visualization
there were certain changes

in the path.

All of a sudden, like
you can see that traffic,

all combined from all of these 3 agencies

going through the France
Telecom or Orange.

And even further down the
line you can see that actually

autonomous system

38,

44, which is liquid telecom,

essentially starts dropping
pretty much all of the traffic.

Now it's quite interesting
to see that, you know,

liquid telecom, which is
based in Congo actually,

attracted this traffic somehow right.

Now even before we start
speaking about that,

there are two things that
I want to touch base.

First one is this purple line.

Mike, can you tell us a little bit

about what these purple lines indicate

on the path visualization?

- Yeah, that's, that's interesting point.

So when we're seeing the purple lines,

what we're saying from that

is we're looking at a network outage,

but this is coming to us

from our collective intelligence system.

So this is where we're
putting the test together

and we're picking it up
within internet insights

to actually say there is some sort of,

picked up on a global perspective

just impacting sort of
multiple people there.

- Oh, okay, thank you for
providing more context into that.

So I think the next interesting
thing is actually to see

how this event or how this
spike in packet loss, actually.

What is the effect that it
had on the DNS server itself?

So if I click on the views for DNS server,

while Chicago, Illinois agent

from the Verizon's
perspective is selected.

We can quite clearly see,

you know,

availability dips 100%.

Like before the event,
essentially availability was 100%,

you know, and during the event,

essentially the core function
of what the test is doing

goes to 0%.

It was intermittent spike in availability,

but it essentially kind of recovered.

But again, you know, for a
complete duration of this event,

essentially the test was
not able to do its job.

And what it's doing is, you know,

we are trying to resolve
target domain of example.com

using UDP for the A record

using Quad9's

DNS server

for UIC provided already description for.

So the next question I
guess Mike is really have.

- Yeah, sorry, before we get, before that,

before you dive into that.

Can you just explain a little bit there

or expand a little bit?

We're talking here
about a DNS server test.

So we're actually,

and I think you've covered
it just briefly there,

but a little bit of, you know,.

Why this is important
and just a reach ability.

So if we were looking at HTV server test,

we might see a DNS failure,

but this specifically tells us
we're looking at a DNS server

Is that right?

- Yes, that's correct.

So essentially why is this important?

DNS server stands for Domain Name System

or Domain Name Service rate

and essentially the function
of this particular service

is to translate domains
to IP addresses, right.

So what happens is, whenever
you type in google.com

into your web browser,

or whenever you type in any
domain name for that matter

inside of your web
browser to view the page

or get your resource
or something like that,

One of the first things
that's going to happen

is essentially

translation of that domain name

in into the IP address.

You know, our computers
unfortunately don't understand

the concept of domain
names like google.com,

ThousandTimes.com, you know,

CloudFlare or Quad9 for that matter

does not really mean
anything to our computers.

But IP addresses that
are behind these names

means everything, right.

So in order for computer
to establish a connection

to that resource, it
needs to translate it.

Which is essentially what it does.

- Good like the zip code and
on my computer understands,

I shout loudly at it and it understands.

(laughing)

- Fantastic, so now.

I mean we saw that on
the part visualization

there was this pretty
much profound impact on,

on the packet loss, right.

From the packet loss
perspective, as part of

which packet loss really
did spike quite a lot

and it caused the
fundamental functionality

of the test itself to
completely fail, right.

Now, for that matter.

Like essentially all the
people that were using Quad9

as their DNS provider
during this particular time

that were affected by this.

So for example are all of these users

in Verizon for example,

would have a problem
translating their domains names

to IP addresses as part
of which essentially

they would not be able to
use the internet, right.

For that matter.

So, but the real question here, Mike.

Is essentially what happened here?

So if we click here onto
the BGP route visualization,

we can see that there
were two different spikes

in the path changes.

So there was the one that happened here

approximately at

12 o'clock

and we are going to get
some more details about it.

And then there was the
one that probably was done

just at the end of the event itself,

but let's dive into what
actually happened here.

On the left hand side
here we have a collection

of what we called BGP collectors.

Which are essentially machines

deployed all around the
world that are listening.

All the updates from the
BGP perspective, right?

And the BGP is this protocol,

Border Gateway Protocol is, you know,

the protocol that makes internet possible.

It's essentially mechanism

that, you know,

network equipment uses to
actually exchange information

about the companies and
the prefixes that they own.

So if we now look at
this particular collector

that was receiving certain messages

and we click on show only in
this monitor to, you know,

kind of view only, what was
happening from that perspective.

We can see that you were certain events.

Would you mind taking us
through this event, Mike?

- Yeah, so what we can
see going on through here,

as you said, we've seen
some sort of change,

some instigation and effectively
what BGP is doing for us,

as you said is actually
exchanges information.

Tell me how I get to it.

So if you think of the
internet as this collection

of autonomous system
networks easy for me to say.

This is where we're
actually starting to see

exchange that information.

So as we look at this path
where we see this change

coming through from here.

What we're starting to see is

as we actually sort of the
advertisements coming out.

Where we were going sort of straight

into this downstream provider from here,

from a woody perspective
is we're actually now

being diverted and we're
actually being advertised out.

We're seeing the best possible route

for us to get through,
is to go through this,

this Dominican Republic
of Congo based ISP there.

Who's actually advertising
the route out from there

to actually get to there.

So now we're taking this route,
so we've changed him here

and we're now, this is the best way

that we are saying to get
from this location into

or from this network I should say,

into the Quad9 environment.

- That's absolutely correct
and if you look at this,

it's actually quite easy to
spot what was happening here

with the dotted red line.

We can see that this,

there was a certain

visual in the effect

as part of which connectivity
between woodynet is 42

and the backspace technologies
on the left hand side

essentially, you know, got withdrawn

and instead what's happening
is that we are installing,

you know, different parts,

part of which this particular, you know,

starts advertising, starts
being actually upstream provider

for woodynet, you know.

Ultimately leading towards
the liquid autonomous system

38, 8, 44, as part of which, you know.

We are speaking about the event
that we called route leak.

Now.

Mike, there are two
different types of events

that we can speak when it comes to this.

The first one is,

the first one is essentially route leaks

and there are hijacks, right.

And easy distinction when
it comes to route leaks

and hijacks are essentially,

what's the intent behind that, right.

When it's malicious, when someone
really tries to, you know,

take Uber someone, traffic
force, someone's prefaces

once they are maliciously
advertising traffic towards the,

towards the internet, right.

Then we are speaking
about hijack, however,

when it happens as a result of human error

or configuration error or mechanisms.

So automation related errors,

we are speaking about truth leaks

and while it's really hard to distinguish

what really happened here.

We are suspecting that this
was just, you know, an error

as part of which we classified
this event as a root leak.

- Yeah, I mean, it is easy to say but

the characteristic sort of
the change around from there,

but you said the malicious
intent between a hijack

and typically what you see within a hijack

is that path maintained.

You know, here we saw a
complete loss of connectivity,

we couldn't actually get to this service.

What we've seen in
hijack is this diversion,

is exactly what the word says,
what it says on the label.

It hijacks it

and it's going through this
different area there for

or to a different network

which for the malicious intent there.

- Exactly, and the other thing
that I wanted to point out,

is that if you hover over the collector

on the left hand side and you
actually click on view details

of the part changes,

the thing that you can
see quite literally here.

- Love this view.

- Exactly like you can actually see

what was the initial part and
we can see that the part was

originating GSN

then VDSN

and the target

of the ASN we are collectors
essentially located.

And then you can quite clearly
see that at 12:13, 01 UTC.

So you, what you're getting is exact,

exact timestamp of the
event when it happened.

We can see that longer part,

which is quite interesting
as well gets installed.

You know, if you think about it, you know

BGP as a protocol prefers
the shorter spot ASN parts,

but in this particular case you can see

as a result of this route leak, you know.

Do the way that

prefixes were advertised towards
the rest of the internet.

Longer parts got installed and
traffic got diverted, right.

So now, we see that actually, you know.

We see what was the contributing
reason to 100% packet loss

that we were seeing in
the pot facilitation

for Chicago, Los Angeles and
some other agents, right.

And then we saw how all of
that contributed to, you know,

service impact from the
DNS server perspective.

PGP was initially not
designed with the security

as its guiding principle,
which is unfortunate, right.

And you know, there are
certain things that,

over the years, certain
companies and you know,

individuals tried to push,
you know, as a best practices.

One of the things that we
need to speak about is RPKI

or resource publicly key infrastructure.

Would you mind telling us a
little bit more about at Mike?

- So yeah, so

the RPKI is a, in very simple terms

is really do I trust these
routes coming out from there?

So I'm signing my routes, this is or

my networks, my prefix
of current from here.

So they're authenticated

affect me as you go through from there.

And then you are also checking

for those authentications coming in.

So you are, these are trusted routes,

these are trusted advertisements.

You are allowed to advertise
these coming out from there

that's really in a nutshell.

- Exactly and you know,

before we started actually
speaking about this event,

I actually checked and Quad9 as a company.

They actually sign all of their prefixes,

which means that, you know.

If the companies or
their OpStream providers

that actually let this happen

were enforcing RPKI
verifications or filtering.

Well you know, as part of which, you know,

if you are not the owner of,

if you are not the, you know,

if you are advertising something
that you are not signing.

They would would've dropped you

and you know, even though they
had them signed, the company

SUBSTREAM companies in
this particular case,

liquid ASN who propagated
this further, right.

Unfortunately did not filter
the advertisement as a result

of which we see what we see.

Now, you know, it's kind
of funny to see, you know.

During the preparation for this school,

I was actually checking
some of the prefixes

from the liquid ASN, right.

And it's really interesting
that themselves,

all of your prefixes are
also assigned with the RPKI.

Which means that they want to
get the benefits of the RPKI,

however they're not filtering
prefixes themselves.

So you know, they are saying, you know,

we see what's the benefit in using RPKI.

but we still haven't decided
to actually take that step

as part of which we are going to filter

prefixes that belong
to someone else, right.

It's like last missing step, right.

The other thing is like, this
goes to speak about, you know,

other things such as improper filtering.

So for example, you know, many companies,

you know, even before you
end up peering with them.

Are going to have really
strict requirements

to have your old prefixes in

peering database

for example and stuff like that, right.

As part of which, you
know, if it's not there

indicated with correct
objects and stuff like that,

you know, there are certain rules there.

You know, they're not going to even accept

your advertisements.

So if the prefix is not
explicitly on the list of prefixes

that you should be advertising,

the architect's going
to filter you, you know.

Which goes to say that, you know.

Operational practices could be improved.

And thankfully there is this thing

called mutually agreed
norms for routing security

or manners for short.

Which actually lists all
of these best practices

and operational practices
that companies should follow.

And you know, companies such as CloudFlare

and others on the internet
are doing really good job of,

you know, making sure that participation,

year over year gets better that

companies join.

But I think there's still long way to go.

- Yeah, absolutely.

So in, on that thing there,

I mean Quad9 did a pretty good job.

I think of actually letting
us know what was going on.

They were quite transparent
about saying what was coming out

from here and I think to your point,

they're actually sort of
pushing the manners stuff there

as well and and saying
this is what we want to do,

around from there.

But it's enough uphill backlog

guess in trying to get sort of everybody

on board to do that.

- Yeah, it's going to take time.

The thing is like, as we said, the,

when BGP was designed,

security wasn't their
guiding principle, right.

So having a security as the after tote,

which we saw even in in coding
practice and stuff like that.

Always tends to be really
hard thing to do, right.

So because it's opt in,

we kind of get to see the
negative effects of things

such as route leaks and
hijacks from time to time.

Now,

if I go back here and I,

if I click on the part
visualization, Mike here.

We briefly spoke about these
purple lines on the timeline,

right?
- Yeah.

- So actually our internet
insights also capture this event.

Could you actually take us
to what was happening here?

- Yeah, this is really cool.

I like, I like this view
here and we come into there,

so as we said upfront
is when we're looking

at those purple swim lines
there, it's actually identifying,

in this case we had a network outage.

So although we're looking at
sort of a smaller duration,

if you can see it started
at the same sort of time.

So what we're looking at from here

is we're looking at
sort of subset of agents

you can see in the middle again.

Where we have that early
autonomous system, there.

Is this is where we're
having this outage there

and from an internet insights perspective

and never can to determine
as a 100% packet loss

across a number of tests,

like I said, have an impact themselves.

So if you drill into that from there,

what we can actually go
down to is to start to see

specifically where those
interfaces were actually

putting that information from, sorry,

where the interfaces actually stopping,

where we start see that
falling loss occur from there.

And we can see all of these are occurring

within the liquid
telecoms environment there

we can actually start to see that.

And if you look across
to the right hand side,

we can actually then start to
see the impacted customers.

So we can see the likes of
Quad9 coming down from there

and then their downstream provider.

Yeah, great water, come on.

I love this, I love this view.

I spend my life in internet insights.

So essentially what we're
actually looking from here

is we're saying, as I said up front,

that blue purple swim lane there indicates

we had a network outage

is we see where we've
going across from there.

What we're looking at there
is the actual interface is

impacted so that that red block
up there tells the interface

impacted and if we drill into there.

We can actually see the location.

So we're seeing here
where the forwarding loss

was actually occurring here
and down to interface name.

So we're seeing that in liquid
telecoms environment there

and we're actually got
the interface names.

This all public domain information.

You're also seeing where
we're actually impacting,

the downstream customers, in this case,

the Quad9 and the woodynet

who happens to be there downstream pier

with actually coming to there.

- That's pretty awesome.

The other thing that's
quite important to notice,

is actually having the real
visibility into this, you know,

whether it's from the
internet insights perspective

or whether it's from
the part visualization,

BGP alerts and stuff like that.

I would say it's a crucial
thing to have, right.

And not only,
- Absolutely.

- you know, visibility comes, you know,

and kind of like
importance is self-evident

in this particular case,

but also having capability to alert in,

in a timely manner about these events

and having ability to have

your NRE network availability
engineering teams

or DevOps teams

or SREs depending on how
your organization is running

or just regular operations
team having dashboards views

of these events in my
opinion is really important.

- Absolutely.

You can't overstress the
visibility across from there.

You know, a picture
paints a thousand words.

I've said this on this
podcast a number of times.

I'm a very simple person.

I'd like to look at a
picture what's going on

and we can see very clearly
what was happening from here

from that macro view,
down into the micro view

to see how it's impacting me
going across from there and,

and all of that, you know,

if we, we talked about it briefly.

The concept between the route
leak versus the route hijack.

From there, understanding
that path, you know,

simply do I have a high loss rate,

but it might even be just
be a latency increase

because I'm going through
17 different hops now

because of the way my
network's been advertised

and just visualizing that straight away,

you can see where I'm going.

So I can do two things.

I can start to mitigate my service.

So mitigate the problem
by taking other action,

re-advertising my prefix
or whatever happens to be,

or I can plan for the
future to get around that.

So this, it's not going
to happen to me again

by split my prefixes or whatever.

- And that's actually quite a good point.

In this particular case,

if you look at the BGP route visualization

and for example, if you focus onto,

this particular monitor, right,

you can actually see that the operator

was advertising slash 94, right.

- So, 24, slash 24.

- Exactly slash 24, which
is really important, right.

If you think about it, that's the

the smallest preference is
going to be accepted on,

on the public internet,
which means that, you know,

Quad9's hence were tied when
it comes to what they could do

from the traffic engineering
perspective, right.

So having visibility, you know,

which seems that they had
by, to be perfectly honest.

Like, but having visibility
is crucial, right.

So for example, in this particular case.

Given the fact that your hands are tied

from the traffic engineering perspective.

The only thing that you can potentially do

is pick up your phone
and call the provider

and tell them, right.

Like this is, you know,
you are doing this to me.

Like, and this is affecting my, you know.

Reputation, my revenue and
everything that goes with it.

So, you know, this again
stresses the importance

of having visibility, you know.

And you know, you already
outlined that point,

and I could not agree anymore.

- Yeah and as visibility
in real time or visibility,

you know, we've seen
this immediately happen.

We talked about the light switch moment

when it goes on and off.

If I'm relying on even
an update from Twitter

or an update at a status page.

I'm essentially going to
be behind the eight ball.

I'm going to be looking, whereas
I could immediately see it.

And the benefit to that
is for future occurrences.

I can have some automotive
processes going to kick in,

you know, as we've seen across

a number of customers
over a number of years.

- Exactly, Mike, it's been my pleasure

speaking to you about this event.

I think like we uncovered
what real event on here.

Thank you so much.

- No, it's always my pleasure, mate.

Again, I can, we can,
we can talk for hours

and we can see people sort
of trying to wrap it up,

but that's good.

So that's our show.

Don't forget to like and
subscribe and if you do subscribe,

we'll send you a free t-shirt.

Just drop a note to
internetreport@thousandeyes.com

with your address and t-shirt size

and we'll get that straight over to you.

Thanks very much.

(gentle music)

(gentle music)

We want to hear from you! Email us at internetreport@thousandeyes.com