Skip to main content

Taking the Business Failure Out of Website Crashes

Mehdi Daoudi

The headlines are filled with news of retail website failures and crashes – most recently with the launch of Obamacare and the continuing healthcare.gov crashes due to high visitor load. Some of this attention is due to the media's insatiable appetite for bad news, some of it is fueled by massive user dissatisfaction, but for the most part; websites are just simply failing more.

Load-driven performance issues aside, the causes of most failures are unavoidable. Malicious attacks are getting more sophisticated; natural disasters are taking out datacenters like we saw with Sandy. Attaining perfection is impossible, so human error will always be a factor, and as we heard at Yahoo, sometimes even a single squirrel can bring business to a halt.

Quite often however, sites go down because organizations are not sufficiently prepared to manage the risks that exist because of the complexity that surrounds their sites. Most websites are intricate ecosystems of different services, tools and platforms. More players than ever are involved in creating a rich, engaging and profitable experience.

Operations must worry not only about the health of the infrastructure and applications they own and manage, but also about those of their vendors, their vendors’ vendors and so on. Just one broken component in the delivery chain of a website can take down the entire service, as we have seen in the case of SPoF (single point of failure).

So with all of this in mind, companies need to accept that failure will happen and plan for it to alleviate and minimize its negative business and branding impacts. As Benjamin Franklin once said, "By failing to prepare, you are preparing to fail." By planning, you can get creative, as did the New York Times when it took to social media to keep pushing the news when its site went down in August.

Prevention and Readiness

So, how to plan?

1. Identify every situation that can make your business fail - Dig through every part of your infrastructure and applications and identify who your vendors are and what their impacts are to your service.

2. Monitor every aspect of your site's availability on a regular basis – Keep an eye on your partners’ servers to truly understand the availability of your site.

3. Do capacity testing on all of your servers - Test load balancers, front end, back end, edge servers, vendors – everything.

4. Design your strategy for each case of failure - Ensure you have a capacity plan for the worst case scenario and build it into your release cycle. A capacity plan is especially important before an event or promotion when you expect a lot of traffic to come to your site. Smart companies will stagger promotions to prevent drastic spikes in traffic.

As a backup plan, have a lightweight site ready and on hand if your business requires 100 percent uptime. Even if it's simply a bunch of Apache servers hosted in the cloud, have one ready. Absolutely no third parties or personalization, keep it bare-boned so it can be turned on during any and all types of downtime.

Creative Response to Failures

When you do fail, make it fun and give what could be a frustrated user a chuckle. This will provide a happy memory of your page even if they were unable to access it and will elicit a better chance of return.

A good error page is like a good airport bar. You are still stuck at the airport, but at least you are enjoying yourself.

Recovery

If you do experience a site crash:

1. Offer some incentive for your customers to come back and revisit the site once it's back up - Offer a "failure discount" to keep a customer from immediately going to a competing site to purchase the power drill they originally intended to buy from you.

2. Collect data during the outage - Monitor and understand what is going on to determine the root cause and analyze the events leading up to the downtime.

3. Ask questions - Have we experienced this before? Was my infrastructure at fault? Could this have been avoided? Understanding the failure allows you to adjust your disaster plans accordingly.

4. Share your post-mortem analysis both internally and externally - Let everyone learn what you learned; sharing knowledge is the best way to make the web better, stronger and faster for everyone.

The Latest

According to Auvik's 2025 IT Trends Report, 60% of IT professionals feel at least moderately burned out on the job, with 43% stating that their workload is contributing to work stress. At the same time, many IT professionals are naming AI and machine learning as key areas they'd most like to upskill ...

Businesses that face downtime or outages risk financial and reputational damage, as well as reducing partner, shareholder, and customer trust. One of the major challenges that enterprises face is implementing a robust business continuity plan. What's the solution? The answer may lie in disaster recovery tactics such as truly immutable storage and regular disaster recovery testing ...

IT spending is expected to jump nearly 10% in 2025, and organizations are now facing pressure to manage costs without slowing down critical functions like observability. To meet the challenge, leaders are turning to smarter, more cost effective business strategies. Enter stage right: OpenTelemetry, the missing piece of the puzzle that is no longer just an option but rather a strategic advantage ...

Amidst the threat of cyberhacks and data breaches, companies install several security measures to keep their business safely afloat. These measures aim to protect businesses, employees, and crucial data. Yet, employees perceive them as burdensome. Frustrated with complex logins, slow access, and constant security checks, workers decide to completely bypass all security set-ups ...

Image
Cloudbrink's Personal SASE services provide last-mile acceleration and reduction in latency

In MEAN TIME TO INSIGHT Episode 13, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses hybrid multi-cloud networking strategy ... 

In high-traffic environments, the sheer volume and unpredictable nature of network incidents can quickly overwhelm even the most skilled teams, hindering their ability to react swiftly and effectively, potentially impacting service availability and overall business performance. This is where closed-loop remediation comes into the picture: an IT management concept designed to address the escalating complexity of modern networks ...

In 2025, enterprise workflows are undergoing a seismic shift. Propelled by breakthroughs in generative AI (GenAI), large language models (LLMs), and natural language processing (NLP), a new paradigm is emerging — agentic AI. This technology is not just automating tasks; it's reimagining how organizations make decisions, engage customers, and operate at scale ...

In the early days of the cloud revolution, business leaders perceived cloud services as a means of sidelining IT organizations. IT was too slow, too expensive, or incapable of supporting new technologies. With a team of developers, line of business managers could deploy new applications and services in the cloud. IT has been fighting to retake control ever since. Today, IT is back in the driver's seat, according to new research by Enterprise Management Associates (EMA) ...

In today's fast-paced and increasingly complex network environments, Network Operations Centers (NOCs) are the backbone of ensuring continuous uptime, smooth service delivery, and rapid issue resolution. However, the challenges faced by NOC teams are only growing. In a recent study, 78% state network complexity has grown significantly over the last few years while 84% regularly learn about network issues from users. It is imperative we adopt a new approach to managing today's network experiences ...

Image
Broadcom

From growing reliance on FinOps teams to the increasing attention on artificial intelligence (AI), and software licensing, the Flexera 2025 State of the Cloud Report digs into how organizations are improving cloud spend efficiency, while tackling the complexities of emerging technologies ...

Taking the Business Failure Out of Website Crashes

Mehdi Daoudi

The headlines are filled with news of retail website failures and crashes – most recently with the launch of Obamacare and the continuing healthcare.gov crashes due to high visitor load. Some of this attention is due to the media's insatiable appetite for bad news, some of it is fueled by massive user dissatisfaction, but for the most part; websites are just simply failing more.

Load-driven performance issues aside, the causes of most failures are unavoidable. Malicious attacks are getting more sophisticated; natural disasters are taking out datacenters like we saw with Sandy. Attaining perfection is impossible, so human error will always be a factor, and as we heard at Yahoo, sometimes even a single squirrel can bring business to a halt.

Quite often however, sites go down because organizations are not sufficiently prepared to manage the risks that exist because of the complexity that surrounds their sites. Most websites are intricate ecosystems of different services, tools and platforms. More players than ever are involved in creating a rich, engaging and profitable experience.

Operations must worry not only about the health of the infrastructure and applications they own and manage, but also about those of their vendors, their vendors’ vendors and so on. Just one broken component in the delivery chain of a website can take down the entire service, as we have seen in the case of SPoF (single point of failure).

So with all of this in mind, companies need to accept that failure will happen and plan for it to alleviate and minimize its negative business and branding impacts. As Benjamin Franklin once said, "By failing to prepare, you are preparing to fail." By planning, you can get creative, as did the New York Times when it took to social media to keep pushing the news when its site went down in August.

Prevention and Readiness

So, how to plan?

1. Identify every situation that can make your business fail - Dig through every part of your infrastructure and applications and identify who your vendors are and what their impacts are to your service.

2. Monitor every aspect of your site's availability on a regular basis – Keep an eye on your partners’ servers to truly understand the availability of your site.

3. Do capacity testing on all of your servers - Test load balancers, front end, back end, edge servers, vendors – everything.

4. Design your strategy for each case of failure - Ensure you have a capacity plan for the worst case scenario and build it into your release cycle. A capacity plan is especially important before an event or promotion when you expect a lot of traffic to come to your site. Smart companies will stagger promotions to prevent drastic spikes in traffic.

As a backup plan, have a lightweight site ready and on hand if your business requires 100 percent uptime. Even if it's simply a bunch of Apache servers hosted in the cloud, have one ready. Absolutely no third parties or personalization, keep it bare-boned so it can be turned on during any and all types of downtime.

Creative Response to Failures

When you do fail, make it fun and give what could be a frustrated user a chuckle. This will provide a happy memory of your page even if they were unable to access it and will elicit a better chance of return.

A good error page is like a good airport bar. You are still stuck at the airport, but at least you are enjoying yourself.

Recovery

If you do experience a site crash:

1. Offer some incentive for your customers to come back and revisit the site once it's back up - Offer a "failure discount" to keep a customer from immediately going to a competing site to purchase the power drill they originally intended to buy from you.

2. Collect data during the outage - Monitor and understand what is going on to determine the root cause and analyze the events leading up to the downtime.

3. Ask questions - Have we experienced this before? Was my infrastructure at fault? Could this have been avoided? Understanding the failure allows you to adjust your disaster plans accordingly.

4. Share your post-mortem analysis both internally and externally - Let everyone learn what you learned; sharing knowledge is the best way to make the web better, stronger and faster for everyone.

The Latest

According to Auvik's 2025 IT Trends Report, 60% of IT professionals feel at least moderately burned out on the job, with 43% stating that their workload is contributing to work stress. At the same time, many IT professionals are naming AI and machine learning as key areas they'd most like to upskill ...

Businesses that face downtime or outages risk financial and reputational damage, as well as reducing partner, shareholder, and customer trust. One of the major challenges that enterprises face is implementing a robust business continuity plan. What's the solution? The answer may lie in disaster recovery tactics such as truly immutable storage and regular disaster recovery testing ...

IT spending is expected to jump nearly 10% in 2025, and organizations are now facing pressure to manage costs without slowing down critical functions like observability. To meet the challenge, leaders are turning to smarter, more cost effective business strategies. Enter stage right: OpenTelemetry, the missing piece of the puzzle that is no longer just an option but rather a strategic advantage ...

Amidst the threat of cyberhacks and data breaches, companies install several security measures to keep their business safely afloat. These measures aim to protect businesses, employees, and crucial data. Yet, employees perceive them as burdensome. Frustrated with complex logins, slow access, and constant security checks, workers decide to completely bypass all security set-ups ...

Image
Cloudbrink's Personal SASE services provide last-mile acceleration and reduction in latency

In MEAN TIME TO INSIGHT Episode 13, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses hybrid multi-cloud networking strategy ... 

In high-traffic environments, the sheer volume and unpredictable nature of network incidents can quickly overwhelm even the most skilled teams, hindering their ability to react swiftly and effectively, potentially impacting service availability and overall business performance. This is where closed-loop remediation comes into the picture: an IT management concept designed to address the escalating complexity of modern networks ...

In 2025, enterprise workflows are undergoing a seismic shift. Propelled by breakthroughs in generative AI (GenAI), large language models (LLMs), and natural language processing (NLP), a new paradigm is emerging — agentic AI. This technology is not just automating tasks; it's reimagining how organizations make decisions, engage customers, and operate at scale ...

In the early days of the cloud revolution, business leaders perceived cloud services as a means of sidelining IT organizations. IT was too slow, too expensive, or incapable of supporting new technologies. With a team of developers, line of business managers could deploy new applications and services in the cloud. IT has been fighting to retake control ever since. Today, IT is back in the driver's seat, according to new research by Enterprise Management Associates (EMA) ...

In today's fast-paced and increasingly complex network environments, Network Operations Centers (NOCs) are the backbone of ensuring continuous uptime, smooth service delivery, and rapid issue resolution. However, the challenges faced by NOC teams are only growing. In a recent study, 78% state network complexity has grown significantly over the last few years while 84% regularly learn about network issues from users. It is imperative we adopt a new approach to managing today's network experiences ...

Image
Broadcom

From growing reliance on FinOps teams to the increasing attention on artificial intelligence (AI), and software licensing, the Flexera 2025 State of the Cloud Report digs into how organizations are improving cloud spend efficiency, while tackling the complexities of emerging technologies ...