The Internet Report | Transcript: Twitter in the Elon Era + Microsoft & AWS Outages

December 16, 2022 • 23 Minutes

Twitter in the Elon Era + Microsoft & AWS Outages | Pulse Update

Welcome to the first episode of the Internet Report: Pulse Update, where we cover the latest outage numbers, highlighting a few interesting outages, trends, traits, and the general health of the internet. This week, we're talking about how Twitter's holding up since its sale to Elon Musk, and, walking through what happened in a recent Microsoft and AWS outage.

(upbeat ethereal music plays)

- This is Internet Report's
bi-weekly pulse update

where we'll keep our finger
on the pulse of the internet

and see how it's holding
up week after week.

We often hear that the
internet is held together

with chewing gum and string,

and quite frankly, the truth
is only slight less concerning.

Every other week, I'll
be back here sharing

the latest outage numbers,

highlighting a few interesting outages,

trends, traits, and general
health of the internet.

This week, I'm joined by my
friend and colleague, Kemal.

How are you, mate?

How's it going?

- Hey Mike, good to be here.

Thanks for the invite.

- Before we get started,

I do actually wanna welcome everyone.

This is a new podcast

and a new flavor of the internet report.

So, we're producing this in the show.

We've been producing this show

in a blog form for like nearly a year now.

So we really thought it was about time,

we sort of verbalize it

and started come across there,

and show our findings in podcast forms.

But don't worry,

we're still gonna be producing

the deep dive internet report

anytime anything major happens.

The idea of this is just
to keep something regular

within your podcast that you
could actually listen to,

to understand what's going on

from the health of the internet.

Now, just before we get started,

in terms of housekeeping,

we'd love for you to
hit Like and Subscribe.

So you can do that, and keep your fingers

on the pulse of the internet every week

'cause I said we're gonna be
putting this out from there.

But any questions you've got, any ideas,

anything you want us to look into,

reach out to us anytime

at the internetreport@thousandeyes.com,

and we're happy to potentially
address the questions

for future episodes.

All right, let's dive in.

And listen before we start,
we're gonna look at the numbers

to see what happened this week.

(upbeat music)

So what we're looking
at here is the numbers,

and some of the interesting things, Kemal,

that started to see

is this decrease as we start to come down.

So this is obviously going back

to sort of early October there,

and we can see the start to increase.

What we're looking at here is
the global outages observed,

and then we actually
break that down as well

to the U.S. perspective as well.

Just really interesting to see the trends.

And the reason we're
breaking out the U.S. outages

is when we look, or as I've
looked at these numbers

over time, what I've actually
started to see is that

the U.S. outages typically
attribute about 38%

of all the observed outages.

So, it just makes sense
if we start to track,

see what's going on from there.

But the interesting thing I
really wanna call out here is

that we start to get into sort
of the November timeframe.

So we're starting to get an area there,

this is kind of seasonal,

we sort of see a drop
coming off from there.

So we see this decrease from 352, 231,

and then it dropped
dramatically here to 222, right?

That's a 33% drop.

That's not necessarily unusual,

it's actually quite seasonal,

but it's interesting that it is occurring

around that week of November 21st to 27th,

which for us in the...

Well, me in the southern hemisphere,

you're in northern hemisphere,

but neither us in North America,
that's Thanksgiving week.

- [Kemal] Yeah, it's
actually quite interesting

to see this dip for the
week of November 21st

till November 27th, right?

As we know, it's a Thanksgiving week,

people are probably
taking their vacations,

and interestingly enough,
as you pointed out,

this is like period of time,

but the things are kind of slowing down

in terms of changes and stuff like that.

And companies are moving

towards the change freezes
to be perfectly honest.

Like if you think about
it like holiday season is

where the companies are
gonna have less employees

being on call.

And engineers looking at the stuff,

so it kind of makes sense
that this dip that we see

especially in November
21st to 27th, you know,

it kind of makes sense
that it happened, right?

So hopefully people were
enjoying their holidays

and you know, having a good time, right?

- [Mike] I give thanks down here

because it means my email
box is kind of quiet

for that week.

So I can see the same
what's going on there,

but it is interesting and you talk

about the freezes we see there,

we see this sort of coming
across and like I said,

if I look seasonally
back across, you know,

we've got this data going back sort

of years and we can
actually start to see this,

it kind of follows these
patterns coming from there.

And maybe not for this episode

but digging into it for
later on, I'm also now

starting to see sort of
different patterns happen.

So we come down to November
change, I'm seeing a bit

of a rush in some areas
and then I expect to drop

off again over the holiday period again.

But while at that point,

let's take a look at the,
what the internet has served

up for us in terms of outages

and let's go under the hood with a couple

of interesting events.

(upbeat music)

All right, Kemal, so
having said I promise you

on outages or dive into
outages, there really isn't

that much thematical
about the set of incidents

and degradations that we
wanna talk about today

but there was some really
interesting idiosyncrasies

that went into there

and I think they warrant
some investigations.

So let's sort of jump in.

The first one we wanna talk
about is a general look

at the Twitter landscape
since this sell to Elon Musk.

And while nothing major happened, again,

I could use the word a lot

but I do find it fascinating
some of these little footprints

and traits we've actually
started to see from there.

Just a quick note again,
before we dive into this,

if you're actually curious
about what we're seeing

and what we're showing here on the screen,

for those of you listening along,

we'll actually be able to, you
can, if you want to get into

dive into these, these views themselves

we will have various links

beneath the show notes there or as well

if you actually go to the
pulse update blog as well.

Again, they'll have the screenshots

and the actual views yourself

so you can actually see
what it is we're actually

sort of going on about from here.

So if we go back in time

we can start to see where these sort

of outages are occurring.

They're quite sporadic, but
as we actually start to get

into this area where the sale took place,

what we actually start to
see is intensity starts

to increase and we still

these periods here where we
starting to see quite a lot

of activity where we're
actually sort of going in,

to see what sort of happens
from an outage perspective.

As I said, again, these
are all kind of sporadic.

And what they were really interesting was

that none of these were
sort of that devastating

in terms of they started to
really impact the end users.

People were reporting lagginess,

people were were saying sort
of things were failing to load.

But what's really interesting

from our perspective was
the way we were able to

actually visualize this to be able

to see what was going on.

So to start with, I saw these
sort of outages occurring

from there, oh these, these
hits in degradations in service

but down to be able to get
into sort of each level.

The the other thing I wanna say

about that really is that a lot

of the people when the
sale took place were sort

of saying it's gonna fall
over in within three weeks.

We haven't seen any of that.

Like I said, we've seen
these degradation occurring

and these small instances,
which to a degree

and I think you and I
discussed this offline before

is that this is almost understandable

if I'm going in and starting to look

at something to understand what's going on

I might wanna just sort of turn things

on and off to see
exactly what's happening.

- [Kemal] Yeah, and the
other thing is like, you know

there was a lot of discussion on what went

on and there were some
unfortunate events such

as layoffs and you know,
other things that happened

during this timeframe when it happened.

But you know, it actually
speaks a lot, the fact

that Twitter did not buckle, right?

It speaks a lot about the really good SRE

and architectural practices
that they were following, right?

If you think about it, you know

reducing workforce was quite
a significant, you know

there was a lot of churn as
a result of unsatisfaction

of people over the there, you know,

and I was actually quite curious

on what this is gonna look like.

And as you pointed out

these are just sporadic
outages taking place.

But you know, and yes, while
they were affecting people

to a certain degree,
nothing major happened

and when I was reading some blog posts

from the people that
unfortunately departed the company

right, they were speaking about

or they were writing about SRE practices

and good engineering practices
and I was quite impressed

on what they were doing at the company.

So actually if you think about it,

it does not surprise that the, you know

the architecture and the
platform held so well.

So the last thing that
actually took place as far

as I can recall is

that route leak that happened
back in March of 2022

which we actually covered
on the internet report

but the thing is like it
was outside of your control.

So the fact that, you know, the event

through all of this that
happened to the company, you know

it actually speaks quite
a lot about like how

good the architecture and
infrastructure actually is.

- [Mike] Yeah, absolutely.

So what we saw within the
system itself, like I said,

we saw sort of 503 service unavailables,

where things were dropping out.

They're all really short
durations sort of at

you know, a minute, two minutes at a time.

We then sometimes saw sort of

the longest one I think
I saw was 11 minutes

which was was something around,

in terms of some authentication issues

which to a degree have been documented,

where some two-factor authentication

there was a failure to send out some

from an email perspective around there.

But really, like I said,
everything was sort of really short

and it was within the application itself.

So one of the things I
found really interesting

was sort of how this manifests itself

in very small areas and
within the application itself.

So this is a real cool one
as now I think I've said this

to you many times, I'm a simple man,

I like looking at pictures.

This was really beautiful
as it sort of came down.

So what we're looking at
here is we're looking at

an individual transaction where
we're actually going to it.

And this is where we sort of saw one

of these outages occurring,
which degradations

in service, what we see down the bottom

in the waterfall is a beautiful step.

I could actually walk

down that quite nicely
into that garden there

these sort of steps
coming at in from there.

But what it is is a whole
series of redirects.

So it's simply just redirecting to itself.

So basically it is looking

for a case or something like that.

So it's actually redirecting, it's getting

to itself and it's saying
no, I'm not the right place

and sending it exactly
back to itself again.

So it gets stuck in this loop
and eventually it times out

in our case this times out
and from a test perspective

but also what happens is
you get too many redirects

in the system actually stops itself.

But if I just quickly go

before that just to show the
contrast between what happened

between those steps, this
is immediately before

and this is what we'd expect to see

from a page low time, a
waterfall I should say.

We actually start to see
it come down really nicely

quite parallel and those
processes happening from there.

So these are just little
glicks, what you'd have seen

in this instance would've
been maybe sort of a

a laggy performance.

I would've actually just
delayed for a minute and

then my system would've actually
connected him from there.

So all these little bits
sort of fiddling around

see what was happening.

I just find kind of fascinating
to see what was there.

- [Kemal] Yeah, it's funny

that they call this view waterfall,

you know, or this particular,
looking of the objects

but this actually looks

like a waterfall to me, so. (laughs)

- [Mike] It does, doesn't it.

Nice waterfall coming
down through from there.

Okay, so having promised
you outages, Kemal

now we're actually gonna get into one.

This is a Microsoft Office 365 outage

or is actually a Microsoft outage

that occurred December the 2nd there

and sort of impacted sort of more people

in the real APAC region down from there.

So, which is why it's
sort of close to my heart

the actually is actually
quite significant.

It was sort of an hour and 20 minutes.

We actually sort of saw
this occurring from there.

- [Kemal] So it's quite
actually interesting to see how

this even unfolded.

You know, if you look at the timeline view

from the application outages perspective

and you move forward

you will see how actually application went

and had the outage affecting
Tokyo users in Japan, right?

And as we progress throughout the event

you will see that more and more users

within the Asia-Pacific region

actually start getting impacted.

And as you can see here

we have a quite a beautiful representation

of what we are just speaking about.

So on the right hand side you
can see people in Singapore

Hong Kong, Kuala Lumpur,
Tokyo being affected, right?

And you know, to Microsoft's credit

they actually publicly
announced what happened.

So there there was

like crew codes analysis
and stuff like that.

And it turns out that this was result

of legacy code that was not able to

that was not able to
assert request in time.

And you know, quite interestingly solution

for this particular event was, you know,

just to reload, right?

- [Mike] That's right.

Yeah, yeah.

Internet.

So guys, yeah, absolutely.

that's really cool and as you say,

so the leg system unable to
process at the time there

and if we look at the
areas we are seeing there

they were really just timeouts.

So the system was failing.

And what I like about this

and this this moment in time, so you know

this was occurring at at 10 to 1, UTC

which is actually 10 to 10
Japanese central time, which is

you know, path may actually
affecting the infrastructure

in that region is also why
they noticed it as well.

It was in the middle of
their working day sort

of where it was the
out outside from there.

- [Kemal] Exactly.

- [Mike] But the, yeah, go on, sorry.

- [Kemal] And we can quite clearly see

that it's affected, you know,

a lot of users, we said it multiple times

there's this paradigm
shift as part of which,

cloud is our new data center, right?

And software as a service
applications are, you know

the way forward is part of
which like everyone's using

either using software
as a secret applications

or they are actually, you know, moving

towards the infrastructure as
a service model and you know

this being assessed suit
of applications actually

probably was negatively felt
by the users in the region.

- [Mike] Yeah, absolutely.

And to your point there

about it being the SaaS and
everyone across from there

which kind of froze me back
to my numbers at the start

and I won't go back into the details

but we talk about the code
freezes across from there.

This is also an indication

that the internet is now
a huge reliance for not

just the people at home but
also the businesses themselves.

You know, we had an issue
here from a SaaS application,

this sort of dramatic effect from there.

And this is the moment in
time here is where the actual,

the turn it off and on
again started to occur

and really started to
clear the queues down.

As you said Microsoft were
really quite open about this.

They said that they had this issue

with legacy infrastructure,
a process, sorry

a legacy process there

which got token authentication
and then they actually

did this reset to
effectively clear the queues

move workloads off and
shift things around.

And we can see again my
step process comes back in

we see it recover sort of down from there.

But as you say

saw the complete outage again,

a beautiful picture is painted.

- [Kemal] Agreed.

- [Mike] Let's move on to our last outage

for this week on December the 5th

just after 2:30 PM Eastern Standard Time,

AWS Ohio based US-East-2 region

experienced connectivity issues.

What do we see, Kemal?

- [Kemal] Yeah, so
actually what we're looking

at is intermittent spikes

in pocket laws that affected connectivity

towards the US-East-2 Amazon region

which is quite a large
region for that matter.

So if we look at the part visualization

for this particular test, approximately

at 7:30 UTC on the 5th of December,

we see that multiple agents

in fact, 18 agents that that were assigned

to this test are executing everything

you know correctly,
everything is working fine.

However, approximately at
7:40 we can start seeing

like certain red circles

which actually means that
some of these hops are

were experiencing what
we call forwarding loss.

So if I move forward
within the event timeline,

you can quite clearly see

that we have this chunk
of, you know, red circles

on towards the right hand side,
which essentially indicates

that traffic going towards the AWS region,

US-East-2 in this particular case,

we're having some problems now

if we a little bit zoom in onto this event

and select one of the
agents such as Seattle agent

right from the central link perspective

from the agent dropdown menu

we are gonna start seeing
values for that one.

And if I select that one

as the only agent for
that particular review

we are gonna see how this event unfolded.

And as you can see from the left hand side

we see Seattle agent is deployed

in central link and the traffic was going

towards the agent that's
deployed in AWS, in US-East-2.

And if I hover over this red circle

which tells me where the
forwarding loss was experienced

I can see quite
interesting thing in format

of the fact that like
reverse DNS lookup tells me

that this was a peering
link between level three

which is a tier one provider

again autonomous system number 3-3-5-6

and Amazon 65-09, right?

So it looks

like the event actually
unfolded on the Amazon's edge

which might be the fact that, you know

something went wrong with
the control plane at the time

maybe there was some, you
know, large event that you know

spikes the CPUs and stuff like that.

Or there was some automation change

automated change that actually
caused this or there was

like typical engineering
change that actually, you know

maybe configuration error
or something like that.

It's hard to know from this perspective

but we are actually looking into the event

into into details we can quite clearly see

that the event took
place at the Amazon Edge.

Now the percentages are not that high.

Like we are looking at, you know

somewhere around a few
percent of pocket loss

which is probably the reason why not a lot

of people actually
complained about region.

But I think that this particular
view is quite important

because it gives us, you know

visibility into the
forwarding part as part

of which you can see traffic going

to the Amazon from the
agent on left hand side.

And you know

quite importantly we see
the traffic flowing back

from the Amazon

from the US-East-2
region towards the agent

that was originating
the traffic originally.

Right? So why is that important?

It's important from the perspective

of the fact that in TechNet
is asynchronous, right?

The traffic that's going
in forwarding direction,

quite often is not going

to take the same path back, you know,

when the responses are going on.

So, and in this particular
case we actually see that

so here if I just follow the arrows

and if I hover over these icons

I can see that this loss was
again happening somewhere

down in probably on the
Amazon's backbone at this stage.

- [Mike] So we can see this

within the Amazon backbone
coming across there

we're sort of shown the
forward and reversing there.

So the impact, because
this is in US-East-2,

we're actually looking that local

it looks like it's the ISP.

So as you said,

you said many times the
edge coming in from there.

So what type of impact
would we have experienced,

if I'm hosting there or
what, lemme just say,

what so might have seen from me, yeah.

- [Kemal] Yeah, even
though the percentages

are quite small, we know

that pocket loss tends to
cripple truth put of the flows

right so-

- [Mike] Yeah, absolutely.

- [Kemal] If you are
essentially hosting something

in US-East-2 at this
time and there's like 2%

of pocket loss or something
like that, you know,

your users might see
degraded performance in,

of your services essentially, right?

Like you know, they are trying to open

up a certain webpages or
they are trying to transact

with the service that you are hosting

in this particular region
and they are having hard time

actually getting their experience.

Fortunately enough this,
you know, this was again

a small percentage of
pocket loss but still

significant enough you
know, to, for our platform

to actually observe it
and notice it, you know

and then again like, you
know, it goes without saying

but you know, visibility is a key here.

You know, having this
kind of a signal coupled

with alerts and dashboards,
you know, can be a difference

between you actually
having to receive reports

about you having a problem
from the customers, which

you know, in the end of the
2022 is the worst possible way

of actually getting to know about

you know, performance
issues of your resource.

- End of 2022.

You're making me feel very old.

(both laugh)

Rush his bypass there.

Alright, that's really interesting.

And I love the way, again

it shows up from there
simple with the pictures.

I like this bidirectional
stuff we can actually see

from there and you know, really allows you

to definitively say this
is where the problem lies.

So it comes back to
identifying responsibility.

So thanks for that mate.

So that's our show.

Don't forget to subscribe

and follow us on Twitter as
always, you have questions,

feedback, we'll take all of it.

Good, bad or ugly or any
guess you'd like to see.

As I said, anything you'd
like featured on the show,

please just send us a note

at the internetreport@thousandeyes.com

and that's also where new subscribers

can claim their free t-shirt.

Just send us your address and t-shirt size

and we'll get that right over to you.

So with that, thanks Kemal

really appreciate your time mate.

- Thank you so much Mike.

This was a lot of fun.

(upbeat techno music)

We want to hear from you! Email us at internetreport@thousandeyes.com