Twitter in the Elon Era + Microsoft & AWS Outages | Pulse Update
(upbeat ethereal music plays)
- This is Internet Report's
bi-weekly pulse update
where we'll keep our finger
on the pulse of the internet
and see how it's holding
up week after week.
We often hear that the
internet is held together
with chewing gum and string,
and quite frankly, the truth
is only slight less concerning.
Every other week, I'll
be back here sharing
the latest outage numbers,
highlighting a few interesting outages,
trends, traits, and general
health of the internet.
This week, I'm joined by my
friend and colleague, Kemal.
How are you, mate?
How's it going?
- Hey Mike, good to be here.
Thanks for the invite.
- Before we get started,
I do actually wanna welcome everyone.
This is a new podcast
and a new flavor of the internet report.
So, we're producing this in the show.
We've been producing this show
in a blog form for like nearly a year now.
So we really thought it was about time,
we sort of verbalize it
and started come across there,
and show our findings in podcast forms.
But don't worry,
we're still gonna be producing
the deep dive internet report
anytime anything major happens.
The idea of this is just
to keep something regular
within your podcast that you
could actually listen to,
to understand what's going on
from the health of the internet.
Now, just before we get started,
in terms of housekeeping,
we'd love for you to
hit Like and Subscribe.
So you can do that, and keep your fingers
on the pulse of the internet every week
'cause I said we're gonna be
putting this out from there.
But any questions you've got, any ideas,
anything you want us to look into,
reach out to us anytime
at the internetreport@thousandeyes.com,
and we're happy to potentially
address the questions
for future episodes.
All right, let's dive in.
And listen before we start,
we're gonna look at the numbers
to see what happened this week.
(upbeat music)
So what we're looking
at here is the numbers,
and some of the interesting things, Kemal,
that started to see
is this decrease as we start to come down.
So this is obviously going back
to sort of early October there,
and we can see the start to increase.
What we're looking at here is
the global outages observed,
and then we actually
break that down as well
to the U.S. perspective as well.
Just really interesting to see the trends.
And the reason we're
breaking out the U.S. outages
is when we look, or as I've
looked at these numbers
over time, what I've actually
started to see is that
the U.S. outages typically
attribute about 38%
of all the observed outages.
So, it just makes sense
if we start to track,
see what's going on from there.
But the interesting thing I
really wanna call out here is
that we start to get into sort
of the November timeframe.
So we're starting to get an area there,
this is kind of seasonal,
we sort of see a drop
coming off from there.
So we see this decrease from 352, 231,
and then it dropped
dramatically here to 222, right?
That's a 33% drop.
That's not necessarily unusual,
it's actually quite seasonal,
but it's interesting that it is occurring
around that week of November 21st to 27th,
which for us in the...
Well, me in the southern hemisphere,
you're in northern hemisphere,
but neither us in North America,
that's Thanksgiving week.
- [Kemal] Yeah, it's
actually quite interesting
to see this dip for the
week of November 21st
till November 27th, right?
As we know, it's a Thanksgiving week,
people are probably
taking their vacations,
and interestingly enough,
as you pointed out,
this is like period of time,
but the things are kind of slowing down
in terms of changes and stuff like that.
And companies are moving
towards the change freezes
to be perfectly honest.
Like if you think about
it like holiday season is
where the companies are
gonna have less employees
being on call.
And engineers looking at the stuff,
so it kind of makes sense
that this dip that we see
especially in November
21st to 27th, you know,
it kind of makes sense
that it happened, right?
So hopefully people were
enjoying their holidays
and you know, having a good time, right?
- [Mike] I give thanks down here
because it means my email
box is kind of quiet
for that week.
So I can see the same
what's going on there,
but it is interesting and you talk
about the freezes we see there,
we see this sort of coming
across and like I said,
if I look seasonally
back across, you know,
we've got this data going back sort
of years and we can
actually start to see this,
it kind of follows these
patterns coming from there.
And maybe not for this episode
but digging into it for
later on, I'm also now
starting to see sort of
different patterns happen.
So we come down to November
change, I'm seeing a bit
of a rush in some areas
and then I expect to drop
off again over the holiday period again.
But while at that point,
let's take a look at the,
what the internet has served
up for us in terms of outages
and let's go under the hood with a couple
of interesting events.
(upbeat music)
All right, Kemal, so
having said I promise you
on outages or dive into
outages, there really isn't
that much thematical
about the set of incidents
and degradations that we
wanna talk about today
but there was some really
interesting idiosyncrasies
that went into there
and I think they warrant
some investigations.
So let's sort of jump in.
The first one we wanna talk
about is a general look
at the Twitter landscape
since this sell to Elon Musk.
And while nothing major happened, again,
I could use the word a lot
but I do find it fascinating
some of these little footprints
and traits we've actually
started to see from there.
Just a quick note again,
before we dive into this,
if you're actually curious
about what we're seeing
and what we're showing here on the screen,
for those of you listening along,
we'll actually be able to, you
can, if you want to get into
dive into these, these views themselves
we will have various links
beneath the show notes there or as well
if you actually go to the
pulse update blog as well.
Again, they'll have the screenshots
and the actual views yourself
so you can actually see
what it is we're actually
sort of going on about from here.
So if we go back in time
we can start to see where these sort
of outages are occurring.
They're quite sporadic, but
as we actually start to get
into this area where the sale took place,
what we actually start to
see is intensity starts
to increase and we still
these periods here where we
starting to see quite a lot
of activity where we're
actually sort of going in,
to see what sort of happens
from an outage perspective.
As I said, again, these
are all kind of sporadic.
And what they were really interesting was
that none of these were
sort of that devastating
in terms of they started to
really impact the end users.
People were reporting lagginess,
people were were saying sort
of things were failing to load.
But what's really interesting
from our perspective was
the way we were able to
actually visualize this to be able
to see what was going on.
So to start with, I saw these
sort of outages occurring
from there, oh these, these
hits in degradations in service
but down to be able to get
into sort of each level.
The the other thing I wanna say
about that really is that a lot
of the people when the
sale took place were sort
of saying it's gonna fall
over in within three weeks.
We haven't seen any of that.
Like I said, we've seen
these degradation occurring
and these small instances,
which to a degree
and I think you and I
discussed this offline before
is that this is almost understandable
if I'm going in and starting to look
at something to understand what's going on
I might wanna just sort of turn things
on and off to see
exactly what's happening.
- [Kemal] Yeah, and the
other thing is like, you know
there was a lot of discussion on what went
on and there were some
unfortunate events such
as layoffs and you know,
other things that happened
during this timeframe when it happened.
But you know, it actually
speaks a lot, the fact
that Twitter did not buckle, right?
It speaks a lot about the really good SRE
and architectural practices
that they were following, right?
If you think about it, you know
reducing workforce was quite
a significant, you know
there was a lot of churn as
a result of unsatisfaction
of people over the there, you know,
and I was actually quite curious
on what this is gonna look like.
And as you pointed out
these are just sporadic
outages taking place.
But you know, and yes, while
they were affecting people
to a certain degree,
nothing major happened
and when I was reading some blog posts
from the people that
unfortunately departed the company
right, they were speaking about
or they were writing about SRE practices
and good engineering practices
and I was quite impressed
on what they were doing at the company.
So actually if you think about it,
it does not surprise that the, you know
the architecture and the
platform held so well.
So the last thing that
actually took place as far
as I can recall is
that route leak that happened
back in March of 2022
which we actually covered
on the internet report
but the thing is like it
was outside of your control.
So the fact that, you know, the event
through all of this that
happened to the company, you know
it actually speaks quite
a lot about like how
good the architecture and
infrastructure actually is.
- [Mike] Yeah, absolutely.
So what we saw within the
system itself, like I said,
we saw sort of 503 service unavailables,
where things were dropping out.
They're all really short
durations sort of at
you know, a minute, two minutes at a time.
We then sometimes saw sort of
the longest one I think
I saw was 11 minutes
which was was something around,
in terms of some authentication issues
which to a degree have been documented,
where some two-factor authentication
there was a failure to send out some
from an email perspective around there.
But really, like I said,
everything was sort of really short
and it was within the application itself.
So one of the things I
found really interesting
was sort of how this manifests itself
in very small areas and
within the application itself.
So this is a real cool one
as now I think I've said this
to you many times, I'm a simple man,
I like looking at pictures.
This was really beautiful
as it sort of came down.
So what we're looking at
here is we're looking at
an individual transaction where
we're actually going to it.
And this is where we sort of saw one
of these outages occurring,
which degradations
in service, what we see down the bottom
in the waterfall is a beautiful step.
I could actually walk
down that quite nicely
into that garden there
these sort of steps
coming at in from there.
But what it is is a whole
series of redirects.
So it's simply just redirecting to itself.
So basically it is looking
for a case or something like that.
So it's actually redirecting, it's getting
to itself and it's saying
no, I'm not the right place
and sending it exactly
back to itself again.
So it gets stuck in this loop
and eventually it times out
in our case this times out
and from a test perspective
but also what happens is
you get too many redirects
in the system actually stops itself.
But if I just quickly go
before that just to show the
contrast between what happened
between those steps, this
is immediately before
and this is what we'd expect to see
from a page low time, a
waterfall I should say.
We actually start to see
it come down really nicely
quite parallel and those
processes happening from there.
So these are just little
glicks, what you'd have seen
in this instance would've
been maybe sort of a
a laggy performance.
I would've actually just
delayed for a minute and
then my system would've actually
connected him from there.
So all these little bits
sort of fiddling around
see what was happening.
I just find kind of fascinating
to see what was there.
- [Kemal] Yeah, it's funny
that they call this view waterfall,
you know, or this particular,
looking of the objects
but this actually looks
like a waterfall to me, so. (laughs)
- [Mike] It does, doesn't it.
Nice waterfall coming
down through from there.
Okay, so having promised
you outages, Kemal
now we're actually gonna get into one.
This is a Microsoft Office 365 outage
or is actually a Microsoft outage
that occurred December the 2nd there
and sort of impacted sort of more people
in the real APAC region down from there.
So, which is why it's
sort of close to my heart
the actually is actually
quite significant.
It was sort of an hour and 20 minutes.
We actually sort of saw
this occurring from there.
- [Kemal] So it's quite
actually interesting to see how
this even unfolded.
You know, if you look at the timeline view
from the application outages perspective
and you move forward
you will see how actually application went
and had the outage affecting
Tokyo users in Japan, right?
And as we progress throughout the event
you will see that more and more users
within the Asia-Pacific region
actually start getting impacted.
And as you can see here
we have a quite a beautiful representation
of what we are just speaking about.
So on the right hand side you
can see people in Singapore
Hong Kong, Kuala Lumpur,
Tokyo being affected, right?
And you know, to Microsoft's credit
they actually publicly
announced what happened.
So there there was
like crew codes analysis
and stuff like that.
And it turns out that this was result
of legacy code that was not able to
that was not able to
assert request in time.
And you know, quite interestingly solution
for this particular event was, you know,
just to reload, right?
- [Mike] That's right.
Yeah, yeah.
Internet.
So guys, yeah, absolutely.
that's really cool and as you say,
so the leg system unable to
process at the time there
and if we look at the
areas we are seeing there
they were really just timeouts.
So the system was failing.
And what I like about this
and this this moment in time, so you know
this was occurring at at 10 to 1, UTC
which is actually 10 to 10
Japanese central time, which is
you know, path may actually
affecting the infrastructure
in that region is also why
they noticed it as well.
It was in the middle of
their working day sort
of where it was the
out outside from there.
- [Kemal] Exactly.
- [Mike] But the, yeah, go on, sorry.
- [Kemal] And we can quite clearly see
that it's affected, you know,
a lot of users, we said it multiple times
there's this paradigm
shift as part of which,
cloud is our new data center, right?
And software as a service
applications are, you know
the way forward is part of
which like everyone's using
either using software
as a secret applications
or they are actually, you know, moving
towards the infrastructure as
a service model and you know
this being assessed suit
of applications actually
probably was negatively felt
by the users in the region.
- [Mike] Yeah, absolutely.
And to your point there
about it being the SaaS and
everyone across from there
which kind of froze me back
to my numbers at the start
and I won't go back into the details
but we talk about the code
freezes across from there.
This is also an indication
that the internet is now
a huge reliance for not
just the people at home but
also the businesses themselves.
You know, we had an issue
here from a SaaS application,
this sort of dramatic effect from there.
And this is the moment in
time here is where the actual,
the turn it off and on
again started to occur
and really started to
clear the queues down.
As you said Microsoft were
really quite open about this.
They said that they had this issue
with legacy infrastructure,
a process, sorry
a legacy process there
which got token authentication
and then they actually
did this reset to
effectively clear the queues
move workloads off and
shift things around.
And we can see again my
step process comes back in
we see it recover sort of down from there.
But as you say
saw the complete outage again,
a beautiful picture is painted.
- [Kemal] Agreed.
- [Mike] Let's move on to our last outage
for this week on December the 5th
just after 2:30 PM Eastern Standard Time,
AWS Ohio based US-East-2 region
experienced connectivity issues.
What do we see, Kemal?
- [Kemal] Yeah, so
actually what we're looking
at is intermittent spikes
in pocket laws that affected connectivity
towards the US-East-2 Amazon region
which is quite a large
region for that matter.
So if we look at the part visualization
for this particular test, approximately
at 7:30 UTC on the 5th of December,
we see that multiple agents
in fact, 18 agents that that were assigned
to this test are executing everything
you know correctly,
everything is working fine.
However, approximately at
7:40 we can start seeing
like certain red circles
which actually means that
some of these hops are
were experiencing what
we call forwarding loss.
So if I move forward
within the event timeline,
you can quite clearly see
that we have this chunk
of, you know, red circles
on towards the right hand side,
which essentially indicates
that traffic going towards the AWS region,
US-East-2 in this particular case,
we're having some problems now
if we a little bit zoom in onto this event
and select one of the
agents such as Seattle agent
right from the central link perspective
from the agent dropdown menu
we are gonna start seeing
values for that one.
And if I select that one
as the only agent for
that particular review
we are gonna see how this event unfolded.
And as you can see from the left hand side
we see Seattle agent is deployed
in central link and the traffic was going
towards the agent that's
deployed in AWS, in US-East-2.
And if I hover over this red circle
which tells me where the
forwarding loss was experienced
I can see quite
interesting thing in format
of the fact that like
reverse DNS lookup tells me
that this was a peering
link between level three
which is a tier one provider
again autonomous system number 3-3-5-6
and Amazon 65-09, right?
So it looks
like the event actually
unfolded on the Amazon's edge
which might be the fact that, you know
something went wrong with
the control plane at the time
maybe there was some, you
know, large event that you know
spikes the CPUs and stuff like that.
Or there was some automation change
automated change that actually
caused this or there was
like typical engineering
change that actually, you know
maybe configuration error
or something like that.
It's hard to know from this perspective
but we are actually looking into the event
into into details we can quite clearly see
that the event took
place at the Amazon Edge.
Now the percentages are not that high.
Like we are looking at, you know
somewhere around a few
percent of pocket loss
which is probably the reason why not a lot
of people actually
complained about region.
But I think that this particular
view is quite important
because it gives us, you know
visibility into the
forwarding part as part
of which you can see traffic going
to the Amazon from the
agent on left hand side.
And you know
quite importantly we see
the traffic flowing back
from the Amazon
from the US-East-2
region towards the agent
that was originating
the traffic originally.
Right? So why is that important?
It's important from the perspective
of the fact that in TechNet
is asynchronous, right?
The traffic that's going
in forwarding direction,
quite often is not going
to take the same path back, you know,
when the responses are going on.
So, and in this particular
case we actually see that
so here if I just follow the arrows
and if I hover over these icons
I can see that this loss was
again happening somewhere
down in probably on the
Amazon's backbone at this stage.
- [Mike] So we can see this
within the Amazon backbone
coming across there
we're sort of shown the
forward and reversing there.
So the impact, because
this is in US-East-2,
we're actually looking that local
it looks like it's the ISP.
So as you said,
you said many times the
edge coming in from there.
So what type of impact
would we have experienced,
if I'm hosting there or
what, lemme just say,
what so might have seen from me, yeah.
- [Kemal] Yeah, even
though the percentages
are quite small, we know
that pocket loss tends to
cripple truth put of the flows
right so-
- [Mike] Yeah, absolutely.
- [Kemal] If you are
essentially hosting something
in US-East-2 at this
time and there's like 2%
of pocket loss or something
like that, you know,
your users might see
degraded performance in,
of your services essentially, right?
Like you know, they are trying to open
up a certain webpages or
they are trying to transact
with the service that you are hosting
in this particular region
and they are having hard time
actually getting their experience.
Fortunately enough this,
you know, this was again
a small percentage of
pocket loss but still
significant enough you
know, to, for our platform
to actually observe it
and notice it, you know
and then again like, you
know, it goes without saying
but you know, visibility is a key here.
You know, having this
kind of a signal coupled
with alerts and dashboards,
you know, can be a difference
between you actually
having to receive reports
about you having a problem
from the customers, which
you know, in the end of the
2022 is the worst possible way
of actually getting to know about
you know, performance
issues of your resource.
- End of 2022.
You're making me feel very old.
(both laugh)
Rush his bypass there.
Alright, that's really interesting.
And I love the way, again
it shows up from there
simple with the pictures.
I like this bidirectional
stuff we can actually see
from there and you know, really allows you
to definitively say this
is where the problem lies.
So it comes back to
identifying responsibility.
So thanks for that mate.
So that's our show.
Don't forget to subscribe
and follow us on Twitter as
always, you have questions,
feedback, we'll take all of it.
Good, bad or ugly or any
guess you'd like to see.
As I said, anything you'd
like featured on the show,
please just send us a note
at the internetreport@thousandeyes.com
and that's also where new subscribers
can claim their free t-shirt.
Just send us your address and t-shirt size
and we'll get that right over to you.
So with that, thanks Kemal
really appreciate your time mate.
- Thank you so much Mike.
This was a lot of fun.
(upbeat techno music)