Taking the Business Failure Out of Website Crashes
November 20, 2013

Mehdi Daoudi

Share this

The headlines are filled with news of retail website failures and crashes – most recently with the launch of Obamacare and the continuing healthcare.gov crashes due to high visitor load. Some of this attention is due to the media's insatiable appetite for bad news, some of it is fueled by massive user dissatisfaction, but for the most part; websites are just simply failing more.

Load-driven performance issues aside, the causes of most failures are unavoidable. Malicious attacks are getting more sophisticated; natural disasters are taking out datacenters like we saw with Sandy. Attaining perfection is impossible, so human error will always be a factor, and as we heard at Yahoo, sometimes even a single squirrel can bring business to a halt.

Quite often however, sites go down because organizations are not sufficiently prepared to manage the risks that exist because of the complexity that surrounds their sites. Most websites are intricate ecosystems of different services, tools and platforms. More players than ever are involved in creating a rich, engaging and profitable experience.

Operations must worry not only about the health of the infrastructure and applications they own and manage, but also about those of their vendors, their vendors’ vendors and so on. Just one broken component in the delivery chain of a website can take down the entire service, as we have seen in the case of SPoF (single point of failure).

So with all of this in mind, companies need to accept that failure will happen and plan for it to alleviate and minimize its negative business and branding impacts. As Benjamin Franklin once said, "By failing to prepare, you are preparing to fail." By planning, you can get creative, as did the New York Times when it took to social media to keep pushing the news when its site went down in August.

Prevention and Readiness

So, how to plan?

1. Identify every situation that can make your business fail - Dig through every part of your infrastructure and applications and identify who your vendors are and what their impacts are to your service.

2. Monitor every aspect of your site's availability on a regular basis – Keep an eye on your partners’ servers to truly understand the availability of your site.

3. Do capacity testing on all of your servers - Test load balancers, front end, back end, edge servers, vendors – everything.

4. Design your strategy for each case of failure - Ensure you have a capacity plan for the worst case scenario and build it into your release cycle. A capacity plan is especially important before an event or promotion when you expect a lot of traffic to come to your site. Smart companies will stagger promotions to prevent drastic spikes in traffic.

As a backup plan, have a lightweight site ready and on hand if your business requires 100 percent uptime. Even if it's simply a bunch of Apache servers hosted in the cloud, have one ready. Absolutely no third parties or personalization, keep it bare-boned so it can be turned on during any and all types of downtime.

Creative Response to Failures

When you do fail, make it fun and give what could be a frustrated user a chuckle. This will provide a happy memory of your page even if they were unable to access it and will elicit a better chance of return.

A good error page is like a good airport bar. You are still stuck at the airport, but at least you are enjoying yourself.


If you do experience a site crash:

1. Offer some incentive for your customers to come back and revisit the site once it's back up - Offer a "failure discount" to keep a customer from immediately going to a competing site to purchase the power drill they originally intended to buy from you.

2. Collect data during the outage - Monitor and understand what is going on to determine the root cause and analyze the events leading up to the downtime.

3. Ask questions - Have we experienced this before? Was my infrastructure at fault? Could this have been avoided? Understanding the failure allows you to adjust your disaster plans accordingly.

4. Share your post-mortem analysis both internally and externally - Let everyone learn what you learned; sharing knowledge is the best way to make the web better, stronger and faster for everyone.

Mehdi Daoudi is CEO and Co-Founder of Catchpoint
Share this

The Latest

August 12, 2022

The development of the Thousand Brains Theory of Intelligence framework will now serve as a foundation for further research and new developments in Artificial Intelligence (AI) and Machine Learning (ML) ...

August 11, 2022

IT teams feel overwhelmed by too many tools that do not provide a unified view of the entire IT infrastructure, according to The Shift to Unified Observability: Reasons, Requirements, and Returns, a new independent survey conducted by IDC in collaboration with Riverbed ...

August 10, 2022

Legacy systems require a great deal of a prior knowledge, and then significant configuration, for anomaly detection to work effectively. ML and AI are beginning to change that, but it's important to really validate the claims of any NPM solution ...

August 09, 2022

Successful insight into the performance of a company's networks starts with effective network performance management (NPM) tools. However, with the plethora of options it can be overwhelming for IT teams to choose the right one. Here are 10 essential questions to ask before selecting an NPM tool ...

August 08, 2022

Hybrid and remote work environments have been growing significantly in the past few years. As individuals move away from traditional office settings in today's new remote and hybrid environments, many operational issues such as poor visibility into asset status and refreshes, unaccounted assets, and overspending on software are becoming a bigger challenge for IT departments ...

August 05, 2022

MLOps or Machine Learning Operations are a combination of best processes and practices that businesses use to run AI successfully ... While it is a relatively new field, MLOps is a collective effort that captured the interest of data scientists, DevOps engineers, AI enthusiasts, and IT ...

August 04, 2022

The data is in: enterprises are not happy with their managed service providers (MSPs) and cloud service providers (CSPs). According to the latest CloudBolt Industry Insights report, Filling the Gap: Service Providers' Increasingly Important Role in Multi-Cloud Success, 80% are so unsatisfied with their existing MSP and/or CSP, they are actively looking to replace them within 12 months ...

August 03, 2022

The last two years have accelerated massive changes in how we work, do business, and engage with customers. According to Pega research, nearly three out of four employees (71%) feel their job complexity continues to rise as customer demands increase, and employees at all levels feel overloaded with information, systems, and processes that make it difficult to adapt to these new challenges and meet their customers' growing needs ...

August 02, 2022

Investing in employees will always be smart business. And right now, investing in employees means giving people the resources — and ability — to optimize performance ... For pretty much every company, that means delivering the digital tools necessary to facilitate seamless, secure, user-friendly access and connectivity ...

August 01, 2022

Digital transformation can be the difference between becoming the next Netflix and becoming the next Blockbuster Video. With corporate survival on the line, "digital transformation" is no longer merely an impressive buzzword to throw around in boardrooms. It's the ticket for entry into the digital era, a fundamental business strategy for every modern company ...