The Performance Beacon

The web performance, analytics, and optimization blog

A 7-part guide to sanity check the hysteria behind holiday website outages


7 things to remember about outages and web performance

At some point, every website goes down. The bigger the site, the more likely its outage is to make the news. The tech media loves outage-related headlines, and Thanksgiving weekend is a predictable feeding frenzy.

I have no interest in performance-shaming sites that experienced issues over the long weekend. I have a ton of respect for the people who work hard behind the scenes to keep sites running. Most of the folks I know work in healthy environments where the finger-pointing is kept at bay. For those who don’t, some thoughts to sanity-check the hysteria behind holiday outages.

1. Remember that every site fails eventually

There’s no such thing as 100% uptime. When a site goes down, it isn’t because someone just forgot to flip a switch. It’s because modern websites are complex mechanisms with countless moving parts. Any complex system will fail eventually.

A few years ago at Velocity, there was a fantastic keynote by Richard Cook from Stockholm’s Royal Institute of Technology. The title is How Complex Systems Fail, but Richard actually talks about the opposite idea: how complex systems manage to work most of the time. Best quote: “No one ever calls you up at three o’clock in the morning to tell you the system is running well and they’re very happy with things.”

2. Accept that you can’t performance test for every contingency

Speaking of quotes, I absolutely love this one from Gopal Brugalette, (senior performance engineer at Nordstrom, a SOASTA customer) from a webinar we did together last summer:

“I am 100% confident that everything we tested will work just fine.”

It was something of a tongue-in-cheek joke that has a core of truth. Performance tests are a reliable way to guarantee that your site won’t go down… as long as it’s subjected to the same conditions defined within your test parameters. But you can’t test every single variation of every single parameter. When loads are different from what you modelled in your tests, you may have problems. (This is why real-time monitoring is crucial to get to the root cause of the issue as quickly as possible.)

3. Know that the past is not a predictor of the future

Unexpected load is a culprit in some of the outages that happened over this past weekend. This comes as no surprise when you learn that, for the first Black Friday weekend ever, more people shopped online than in stores. Many retailers experienced utterly unprecedented load volumes, which meant that, from a performance perspective, they were way out on the ecommerce frontier.

Load patterns are unpredictable. Yes, you can and should take past load patterns into account when preparing your site, but this won’t cover you for every contingency. Just because you experienced one type of load patterns for one event, that doesn’t mean that load pattern will be consistent for other events.

Over time — even very short periods of time — your site changes, your users change, and your users’ behaviour changes.

There are no constants. Surprises happen. Thanksgiving weekend is ground zero for traffic surprises.

4. Be prepared… but don’t be overprepared

When you’re performance testing your site, you can’t obsess over every conceivable worst-case scenario. You have limited resources, so you need to focus them on the most realistic cases versus the worst case.

For example, a worst-case scenario would be everyone in North America coming to your site on Black Friday, resulting in a spectacular crash. This isn’t likely to happen. Rather than falling down a rabbit hole of planning for wildly unrealistic scenarios, focus on what’s more likely to happen.

5. See failure as an opportunity

Outages suck. There’s no sugarcoating that. But if you have to experience one, then you should learn everything you can from it. If you’re in DevOps, then of course you’ll make it your mission to get to the root cause of the problem and develop new testing processes to prevent this issue from recurring. That goes without saying.

But the opportunity for education doesn’t end there.

For example, how did your marketing/social media teams respond to the outage? A talented marketing team can actually make your customers feel better than ever about your brand during an outage by replying quickly and sensitively to concerns, being as transparent as possible about the issue, and offering special promotions to compensate for the problem. Is there room for improvement in your PR processes?

(Again, this is why real-time monitoring of 100% of your user experiences is crucial — so you can be among the first to know that your site is experiencing performance issues, where the issues are happening, and which users are being affected. That way, you can make sure your PR efforts are hitting the right note with the right people.)

6. Embrace continuous improvement

The web is a dynamic space… which means none of us ever get to stand back, slap a layer of varnish on our work, dust off our hands, and exclaim, “There! It’s finished!”

Instead we build, we evolve, we fail (sometimes), we learn, we evolve some more, and so on. We value small evolutionary steps — adding new tools and processes gradually — versus huge overnight changes. We recognize that rigorous performance testing and monitoring don’t guarantee 100% uptime, but they do allow us to fail faster and iterate sooner.

The sites that experienced performance issues over the weekend will (most likely) successfully take steps to ensure that those specific issues will never occur again. Which isn’t to say they won’t experience new issues at some point in the future. Because they inevitably will. We all will.

7. Be aware that page slowdowns can cause as much — or more — damage to your business as outages

I talk about this a lot, but it bears repeating. Holiday outages are stressful, but they’re not the worst performance issue most sites face. If a site goes down, you’ll probably just try it again a few hours later. Most of us accept that these blips happen. But if a site is consistently slow over a longer period of time, you’ll eventually stop visiting altogether.

Over time, slow pages hurt your traffic, they hurt your revenue, they hurt your brand, and they hurt your customer satisfaction levels. If you care about the total customer experience, then you need to focus on more than outages. And you need to care all year year round.

Takeaway

There’s no magic formula that will guarantee you a 100% seamless holiday shopping season (or any other special event). The best course of action, not surprisingly, is a customer-first philosophy, the right tools, real-time monitoring, year-round diligence, and the willingness to see failure as a learning opportunity.

Related posts:

web performance monitoring case studies


Tammy Everts

About the Author

Tammy Everts


Tammy has spent the past two decades obsessed with the many factors that go into creating the best possible user experience. As senior researcher and evangelist at SOASTA, she explores the intersection between web performance, UX, and business metrics. Tammy is a frequent speaker at events including IRCE, Shop.org Summit, Velocity, and Smashing Conference. She is the author of 'Time Is Money: The Business Value of Web Performance' (O'Reilly, 2016).

Follow @tameverts