Scaling To Meet the Black Friday Demand: Tips for IT Teams

BARRY COLLINS: Hi everyone, and welcome back to The Internet Report, where we uncover what's working and what's breaking on the Internet—and why. In the U.S., Thanksgiving and Black Friday are almost upon us, and so this week we're here with another special episode covering Black Friday best practices and tips for delivering great user experiences and minimizing downtime during this key period for the retail industry. 

We'll also cover helpful case studies from Black Fridays that have experienced hiccups in the past and what you can do to guard against similar disruptions. 

I'm Barry Collins and I'll be your host today along with the amazing Mike Hicks, Principal Solutions Analyst at ThousandEyes. How are you doing, Mike?

MIKE HICKS: I'm doing great, Barry, and as I said, I'll never get tired of that introduction.

BARRY: Well, let's get on with the show. As usual, we've got the chapter links in the description below, so if you want to jump to the section that interests you most, you can do that. And we'd also love you to hit the like and subscribe and always feel free to email us at internetreport@thousandeyes.com. We welcome your feedback and questions. 

Let’s get on with the show. Black Friday is probably the busiest day of the year for many Internet retailers. Tell us about the amount of planning that goes into ensuring sites stay upright during this time, Mike.

MIKE: Yeah, absolutely Barry. So obviously, things have dramatically changed and as you said, it's grown bigger and bigger and bigger, as they're flowing into Cyber Monday now. So it becomes this whole journey and even going back as far as the early days of October. What we're now starting to see is people are testing the waters, putting out, if we have an offer around this time and people come to look for prices. 

So the planning, again, if we sort of go back historically and look what happened in the early days, we're really just concerned with the capacity issue. If you think, it was kind of not restricted to the U.S., but that's where the main focus was. So we could have our content and everything sort of centered in one location and we went around there. 

And the primary focus with them really was, do I have enough data? Do I have enough access to come into that? If I get hit by all these people at once, how am I going to have access? And that's where the concentration is actually sort of built around. And this is what people started to look at, you know, really it became just a load exercise. 

Then as that evolved and we started to sort of have this more distributed architecture, and also as the users started to grow, the planning changed. We've now got to consider a number of things, you know? So part of this testing going forward in the first day around what products are people going to be interested in. And it comes in seasons and we'll get onto the global aspect of it in a little while. But if I think about this, so what products will I need to sell? So I have to get that inventory in. So I have to be able to make sure that I have enough stock in so that I can actually get them out. We also have to consider the logistics of getting things out. 

But primarily the major concern for people is, or for the eCommerce organizations is that speed to load that page to get the information to them. And again, I'm very old. So when I go back in the early days, people were talking about eight seconds for a website, if it took that long to load, they would go somewhere else. And I can't remember the latest figures, but it is down to around three seconds. And because the Black Friday is now so competitive, it's so many people that want to do this. So you're talking sort of a magnitude of three seconds, and it's not for that page to load. So it's no good looking just at availability. And this is a mistake people have made. They've got the availability, but it's how can I get on, get my product that I want to see up there and point people to that so it starts going to the backend? 

So I now then have to start to consider sort of load testing and looking for things like chatbots because I want to keep those people engaged. Because I don't want to bounce right, I don't want people to move away from there. So all of a sudden I'm adding another dependency on top, another added bit of complexity. So the testing that needs to be done is moved away dramatically from essentially just load testing to this whole functional performance testing because we're obviously always going to have bottlenecks in those areas, but because I'm coming from a distributed perspective, what I'm really concerned with is a degradation in my performance, because that's going to dramatically cost me money. Yes, if my site goes down—and I'm going to sound really silly now, but it's kind of easier to build for resilience. And the Internet itself is fairly resilient. If we get a path coming down, it may affect local users, but essentially it's not going to affect my eCommerce site. So I can have all that there. I can have redundant architecture, but it's putting that all together. So the testing today now has to come and look at this complete area. 

A bit of a war story on one of those, an organization I was working with, didn't actually do that. So again, we started to move out from these early days. We'd moved away from load testing, and had the functional stuff around from there. But what they hadn't considered was this single point of aggregation in the backend of the system. So when one particular item became very popular, they hadn't counted for everybody trying to do this search at the same time. So looking up on the page and rather than having the item that they wanted to promote at the front page, it actually involved doing a database search on the inventory to sort of pull that back. So I made an API call to an inventory server to pull that back there. 

So all of a sudden they had this bottleneck because they hadn't then considered that this workflow to test this function right through. All that they'd done and verified for was to actually look to the front end. So what happened was people couldn't get it, the search was just timing out—and this product wasn't unique, which meant they could get it somewhere else. Might've been a few dollars more, but they could actually get that item there. Whereas before they were just looking essentially at sort of the spinning doom of death.

BARRY: And you mentioned it a moment ago, but it's not just the front end of the sites that you need to worry about these days. You've got inventory, logistics, distribution, warehouses. So how do companies think about that big picture and making sure everything is connected and online?

MIKE: That's a great point. And there's something there as well, because we're now sort of dealing with people and processes as well. So you think about logistics, and I joke about it all the time, sort of living out here, rural in Western Australia there, next day delivery for us means the next day they're prepared to deliver it. But that's not the same, let's say, for the rest of the world. It's got to be there straight away. 

So I've got to have all these systems that are going to tie together. If you think about the whole supply chain, so it's me purchasing that thing there. And people want instant gratification. So I've ordered it. I want it there as soon as possible, because you do have time to return it. So the buyer's remorse, those types of things that may kick in. But I want it there and sort of instantly sent out. 

So again, we've got to tie into this backend system for my inventory staff, but also got to tie into my logistics. Do I have a courier available to deliver to this area? So these are all checks that have to go through from the first place. So as I'm going, before I even take that order, am I going to be able to deliver it? Because also some of these organizations, to try and beat their opposition, may have these, you know, “we'll deliver it within four hours.” Again, I'm making these figures up. “We'll deliver it within four hours or we'll give you a 10% discount on your next sale or whatever it happens to be. So therefore to do that, to be able to offer that because we'll have some financial implications, it means I have to make sure that all these things are connected together. 

Now, a lot of these organizations have their own logistics systems in place, but what I'm now doing is I'm introducing another B2B aspect, another third-party system where I've got to go in. So I've got to be able to sort of verify their systems. How far do I go into their system to make sure they've got drivers available, these types of things? There might be SLAs involved, but it's this linking together. And it's interesting because when we deal with retail, we're dealing with physical products. We're not buying software—or that is a part of it, but essentially the bulk of it is we're buying sort of physical items. So as soon as I do that, I'm involving not just technology, I'm not just spinning up another web service or another microservice somewhere else or any instance to actually sort of load this game down where someone can do. I've actually got to make sure I have people and processes linked together so that this whole chain has to work seamlessly.

BARRY: And it might be relatively easy to load test your own sites, but load testing against potentially third parties must be a trickier job.

MIKE: Yeah, absolutely, absolutely. We're talking load testing, but it is really that functional performance. Now load has a factor in it, and you can sort of exponentially, but to your point there, I can't load test someone else's system, because I can't essentially load test without it becoming destructive. So if I start to put loads of queries or transactions through, I'm doing a couple of things here. 

One, I'm increasing the load and therefore stopping them having their other business during this sort of proactive test. I can't necessarily predict how many I'm going to have at the time, at any one time. 

But also what I'm also doing, I said this is destructive. I am placing an order on their systems. They might be taking something off the shelf and putting it down the warehouse and then returning it back to the shelf, you know. 

So therefore what I have to be able to do is to devise a test that tests that functional performance without becoming destructive. It has to go into test that backend and it has to be able to test that third party. So it might be, it's not just the point. And most of these, you know, pretty much no web application exists today without some sort of API functionality within there. So what I'm actually doing at this point is I'm making a call to an API, but it's not just enough to say, is that API up and available? It has to work. And this is where you can devise sort of tests where I'm looking at, say, the inventory one. And that just can be as simple as a lookup. 

So there's a certain amount I can test into someone's system without becoming destructive. And that's just really, do you have this item in stock? That's not doing anything destructive. And that then allows me to sort of test that, right, okay, I'm pretty sure that when we run this, it's going to run within two seconds. Therefore, for these users located in this world—this world, this part of the world, I mean—we're in a different world down here. And therefore that means that I'm going to meet the requirements and I'm going to sort of be able to transact and I'm going to make money on it.

BARRY: You say you're in a different part of the world. I mean, I don't know about you down there in Perth, Mike, but certainly here in London, in England, it's a big event, Black Friday now in a way that it wasn't 10, even five years ago, maybe.

MIKE: Yeah.

BARRY: How big a challenge is that for the retailers now that this has become a wide-scale global event really?

MIKE: It’s grown. Again, I might be dating you here, Barry, but if you go back into our day there, we had the Boxing Day sales, the day after Christmas, people would be queued up outside the store, queued up outside Harrods or wherever it happened to be, and then they'd rush in to get that £10 television set from there. So that's sort of gone away. 

We've gone to this global thing, we can access it from anywhere, and that has made it more accessible to everybody. People start looking at this, like I said, sort of early October where we're starting to sniff things out and the vendors themselves use that to say, “okay, what's going to be interesting?”

But the other complexity it started to add on to that is, as you talked about, the length of time. So we go from October and then we'll go through to Cyber Monday as well. So we have this whole weekend, it's sort of like a Black Week, as it were, where we start to have a look here and focus on this. 

And then, we have these global implications. So it's summer down here now in the Southern Hemisphere. So we're not going to be wanting to be buying jumpers and scarves and those types of things from some of these elements. We're going to be looking for different merchandise. So if I want that content to be pushed out there, it's going to be different. Because remember what I'm talking about is this huge competition. So I have to then not only be able to make sure the right content is served to the right region, but also make sure that's available and my load and everything is going to be there within those two seconds we arbitrarily talked about, but also that the time of day is going to affect, because during this period, what you will see—and I've seen this across a couple of vendors over the years—is they're doing so many pushes to their website and they reckon these sort of millisecond types of improvement they're getting has a direct impact to their bottom line. They're making tens of thousands to millions of dollars based on this. 

And what they're doing in doing these pushes—and they're not necessarily optimizing it, but they're shifting products around. So they're moving this to the front. They're making sure this person doesn't wait for the chatbot so long. So this is all happening all the time. Now, obviously I can't have these developers sitting online, these DevOps people all the time. So I always have this dynamic workforce looking at this system for a distributed dynamic world. 

And then what we have again on there is these time zones. I was speaking to a customer actually just this morning who laid a very interesting thing that as we talk about in the Pulse Update, when you and I were talking there, we talked about a lot of the maintenance outages, a lot of the outages we'll see occur during the middle of the night, because this is a traditional quiet time for the customer base. So then the ISPs can actually sort of make a change there while having minimum disruption. 

So what this customer was actually telling me was that their user base operates overnight. They're going, they start hitting their application between 12 and 4 AM in the morning. That's when they get their heaviest loads on. So then bring that back into an eCommerce type of situation where I have this global. I have that situation because all of a sudden I have people down, I'm 12 hours or I'm GMT+8 at the moment. So I have effectively this 12-hour time difference. So during the middle of the night where I might be wanting to do a sort of maintenance update or I've stopped doing so many pushes, the people are sort of offline in that time. So I could be missing out on a whole bunch of sales because I'm not understanding the demographics. Or if something sort of dramatic happens, we might want to push some sort of product up on that page, and I'll miss the opportunity unless I'm actually constantly looking and tracking these different trends. It's not necessarily a one-size-fits-all. 

And then you have this other aspect of it, I was talking about this digital equity. And what I mean by digital equity is that if I'm sitting down here in Perth, Western Australia, I expect to have access to the same applications, the same performance as if I was sitting in headquarters in San Jose. And that's a similar thing we're going to face, that we are facing with this, the whole eCommerce, the Black Friday, Cyber Monday, is I expect to have the same type of performance because the impact there is that if I don't get the same performance, I might be a bit disgruntled. I might have to work later in the evening. 

What happens on eCommerce site is I've lost that sale. I'll go somewhere else where I'll get there. So maintaining that performance is there. So where I'm leading to with this long ramble is what I need to be able to do is look at this proactively. So I can do as much testing as I want at the time, prior to the situation, again, without being destructive and I get a pretty good idea of what's going to happen. But I'm dealing with all these unknowns. I'm dealing with all this dynamic environment. So I need to have a way of monitoring this or looking at this proactively, right across from the application that I'm building and from my trends, so the people doing that, sort of right down to the performance from there and getting this overall view to bring everything together.

BARRY: And just finally for this section, Mike, you mentioned the issues there of huge demand on everyone's servers at different times of the day. And we know that the big Internet retailers rely on geographically distributed load and backups to ensure that, if we have a problem in one region, we can cover it in the next. How big of an issue, how big of a challenge is that on Black Friday when there's huge demand everywhere across the world?

MIKE: Yeah, it is. It's a challenge for everybody there really, but you've made an important point. We're talking about a specific moment in time here. So again, I can make an estimate of what's going on, but I've got to dynamically do this. So if I'm thinking about normal operations, the way that we architect, we architect for resilience, we'll architect for redundancy from there. So I'll have shift workloads around. And again, we've talked about in the Pulse Update, the single aggregation points, which can effectively have this sort of degradation.

But effectively I'm dealing with a static environment. And what I mean by static is that I know every month I'm going to run payroll. I know at the account close at the end of the year. So there's those types of things, but essentially that the systems, so the only variable I've got is that carriage, that Internet perspective for it. So I can have a better sense of planning, but now all of a sudden that—and what I'm going with that is I can replicate, you know, from a CDN, if I think about how I distribute stuff. So there's geographic locks, which I should put it on, but they tend to be in place. But all of a sudden, when I'm dealing with this period of the year, when I'm doing the Black Friday, Cyber Monday type of stuff, I've got to add that dynamic nature into that as well. 

So I can plan, I can think that this is going to be the best-selling item. Again, I'm going to go to my history, the Cabbage Patch doll, you know, that might've been the big selling item. Or something comes out of left field and I've actually structured all my CDNs, I've structured all my database, my replications to fit around there. And all of a sudden that changes. Now I've got to shift workloads around. I could do a certain amount of prefetch, I can push that sort of thing, but all that's got to change. 

And again, then I've got that other variable that takes in place. I've got those payment gateways. So I have all that from a retail perspective, what if that changes? What if there's sort of currency fluctuations or all these types of things that's going to come in overnight potentially for some of these. Now they will tend to fix their prices there. But again, it's another variable that comes in that is because this is a dynamic part of the year. And again, I'm dealing with protocols on top of protocols with this unpredictability of user behavior, as well as the unpredictability of the Internet carriage itself.

BARRY: So as we're bringing the show to a close, Mike, and it's Thanksgiving season, I wonder if you'd mind taking a minute or two to tell us a few things that you're thankful for from the networking landscape over the past year. Let's start your first one.

MIKE: Yeah, absolutely. I'll I'll probably cheat a bit and go sort of further back than a year. But the first thing that came to mind when I started thinking about this was these self-service applications. Think about the banking applications. The way I can actually sort of sit here at my desk and I need to go out, I can start to transfer money, I can do a pay anybody, I can do all those types of things. 

And this extends right across. It goes a little bit back into what we're talking about the eCommerce stuff as well. I can sit here at my desk and I can decide I want a new monitor, I can just sit here and order it and then tomorrow it rocks up. So all of that I find really useful. 

And then even if I think about when I go to the shops, so the self-service checkout, I don't need to converse with anybody. I can actually just go through, get my items, scan them myself. And it makes me thankful in a couple of ways, because one, again, so I don't need to talk to human life-forms, I can actually sit here and I can do everything from my desk. But also on top of that, it's because of all these things in the backend, I just marvel at them. You know, the fact that I actually have to have all these different technologies linked together. And why I'm thankful for it is because, again, if I sort of go back, it was really technology that drove our requirements. So what was available, and then we could actually sort of deliver based on that, oh, we've got this particular encapsulation, you know, I'm going back X.25, frame relay from a network perspective, ISDN, we've got those, but that then drove all I could do with it. 

Now what seems to have happened now is it's the user who has become king and we've driven a technology and then the cloud technology has allowed us to actually do that. Because we have the cloud facilities, we have this ability to set things up, we're able to be agile and work as a dynamic world. I can link all these businesses together. I don't have to be a financial organization. I can actually just offload that to a payment gateway and get the receipts back. So this has then generated all these other industries, all these other services that we're able to do. And again, me as a user is a beneficiary of that. And then me as an engineer, I just sit back and just sort of applaud what a fantastic infrastructure you have there.

BARRY: Okay, Mike, what's number two?

MIKE: Number two, number two is one that's very near and dear to me and that's hybrid work. So I've actually worked remotely for way too many years. If I say how many years I've worked remotely for, it's probably just going to give away my age. I'm actually younger than this. I just have a very hard life.

BARRY: 25, aren’t you?

MIKE: Exactly, exactly. The concept of hybrid work has really moved on. So one thing we had a few years back is if I was working remotely, I'd dial into a VPN, going back further, I had a 56k dial-up modem, but you have all those things to get in there. And my performance would have been kind of hit or miss, but also from an organization perspective, it was, “well, you choose to live there or be remote.” So they would allow me to do it, but then, you know, it wasn't necessarily you miss out on opportunities. 

But the fact then this hybrid work situation has come in. And when we talk about hybrid work, I mean, I actually worked from my office here, but I could actually be anywhere in the world doing this job. And I have access, I said before this digital equity where I actually have access to the same applications, I get the same performance as if I was sort of sitting within HQ. And I don't know if we mentioned this one before, but a really interesting factor that came out from this was that because of the bandwidth that's available—actually, I'll exclude myself from this because I'm connected by a wet string. 

But what they found coming out when the hybrid work started to take off was people going back to the office were going to a shared medium. And they weren't getting the performance that they wanted, the performance they expected from when they were actually using their remote connectivity. Because they had fiber to the curb, they had all these types of things, and they weren't on a shared. They were literally in a branch of one. So all of a sudden they're going to a shared system. So everything's really kicked on and everything has started to sort of move to this hybrid environment where you can actually do everything you want. Yeah, we're doing this podcast 12,000 miles apart. And we can converse over there with very little delay as it were. So that's something I'm really grateful for that hybrid work is now socially acceptable. And again, the technology has caught up and made it really possible.

BARRY: I guess if we want to bring it bang up to date in the past year, services like Starlink, where we've had satellite broadband for many years, but now the level of bandwidth just keeps going up and up. That really does make it practical for people to work from pretty much anywhere on the planet.

MIKE: Yeah, absolutely. One of the good things with the Starlink software operating this LEO, the low-earth orbital, which obviously drops the latency down. I come from the world where we dealt with a lot of geostationary satellites. So I had this diablo effect coming down, this long loop coming up and down from this latency there. The fact that I've sort of dropped them down a few thousand miles or kilometers, all of a sudden means that their latency is less.
Plus the clusters of the low-earth orbital stuff, yeah, absolutely it's become sort of easier. Yeah, again, I'm out here in a horse paddock, I can actually still get connectivity wherever I am.

And it's interesting from that point there you make as well, because we also, we're now dependent on it. So if I don't have that connection anymore, all of a sudden, it's like my arm's cut off. I look at my daughter, if her phone's not available from there, it's like she almost can't breathe for this system. So we've become reliant on it. But again, the technology has facilitated it as well.

BARRY: Okay, Mike, so give us your third choice.

MIKE: Third choice, and this is where I'm cheating a little bit because it goes back in time, but the biggest thing I'm grateful for, and building it up to this top one, is the Internet, because this has made it all possible. So underpinning all of this, and why I'm thankful for this, because it's a real demonstration of community. So if you think about how the Internet is made up, it's all these autonomous systems put together and sort of passing traffic between the two which means we have these resilient paths or redundant paths. If I can't get one way, I can go the other. It's linked by submarine cables. Again, the technology and the way that everything sort of works seamlessly together. And if I'm looking stuff up, I can get it immediately. It's ruined pub quizzes, but I can actually instantly get the information. And this is all because of the Internet that underpins it. And obviously, it is my job. I love looking at the patterns that come up from there, but that's also why I'm thankful for the Internet as well, because it underpins essentially everything we do.

BARRY: So that's our show. Don't forget to like and subscribe, and you can follow us on X @thousandeyes. As always, if you've got any questions or feedback, send us an email to internetreport@thousandeyes.com. 

We hope you enjoyed this special Black Friday episode. And if you're looking for a regular podcast to help keep your finger on the pulse of the health of the Internet, then check out our bi-weekly Internet Report: Pulse Update podcast series. We've included the link on the screen above and in the description box. 

Thanks again for tuning in today. Thanks again to Mike, and we'll catch you soon. Goodbye.

We want to hear from you! Email us at internetreport@thousandeyes.com