Skip to main content

Preventing Outages in 2023: What We Can Learn from Recent Failures

"What the recent failures from Internet giants demonstrate is that the question of the next outage is not if, but when," says Dritan Suljoti, Chief Product and Technology Officer of Catchpoint, referencing the company's new white paper, Preventing Outages in 2023: What We Learned from Recent Failures. "Moreover, the downstream effect of major outages to essential Internet infrastructure, such as cloud platforms, CDNs or DNS providers, means that no company is immune, no matter how well prepared they think they are. The white paper demonstrates why it's so important for all of us to be proactive to reduce Mean Time to Repair (MTTR) when the next outage occurs." 

Key lessons from the past

■ Develop an Internet Performance Monitoring strategy that allows you to monitor precisely what customers, workforce, and other users expect and build an Experience Score. 

■ Monitor not only what is under your direct control, map your Internet stack to ensure you are monitoring every component of the Internet Stack relied on to deliver your content (including DNS, CDN, ISP, BGP, TCP configuration, SSL, and other cloud services, etc.). ■ Automate intelligently – design and test automation to ensure there are no bugs hiding in the code. 

■ Be prepared to take fast action to remediate outages as they occur, for example, switching to a backup solution or dropping the third-party causing the issue. Develop runbooks and practice recovery. 

■ Whenever change is scheduled, ensure your team is ready for any outages that may occur (intentionally or not) with a crisis call plan that includes a communication plan and templates, a plan to mitigate failures from third-parties, and a best practices monitoring and observability plan. ‍ 

"Given the impact of serious outages to the bottom line, not to mention the long-tail impact to brand and reputation, amidst a landscape of increased Internet reliance alongside ever-growing Internet fragility and greater and great complexity, the need for community learnings from past failures to be shared and practical advice disseminated around stemming future major incidents and ensuring Internet Resilience is imperative," says Gerardo Dada, CMO at Catchpoint. "We believe this white paper offers an invaluable deep dive into recent outages past and key lessons learned that all of us can learn from to prevent (or mitigate the consequences of) the next major outage."

Hot Topics

The Latest

Cloud migration is a highly strategic decision that involves leadership sponsorship, business justifications for moving to the cloud, and a clear understanding of expected value. Lack of this alignment can be the reigning cause of cost and budget overruns and why almost half of the migration efforts underway today will fail in the next three years ...

One of the most misunderstood culprits of poor application performance is packet loss. Even minimal packet loss can cripple the throughput of a high-speed connection, making enterprise applications sluggish and frustrating for remote employee ... So, what's going wrong? And why does adding more bandwidth fail to fix the issue? ...

Image
Cloudbrink

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 5 covers the infrastructure and hardware supporting AI ...

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 4 covers advancements in AI technology ...

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 3 covers AI's impact on employees and their roles ...

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 2 covers the challenges presented by AI, as well as solutions to those problems ...

In the final part of APMdigest's 2025 Predictions Series, industry experts offer predictions on how AI will evolve and impact technology and business in 2025 ...

E-commerce is set to skyrocket with a 9% rise over the next few years ... To thrive in this competitive environment, retailers must identify digital resilience as their top priority. In a world where savvy shoppers expect 24/7 access to online deals and experiences, any unexpected downtime to digital services can lead to significant financial losses, damage to brand reputation, abandoned carts with designer shoes, and additional issues ...

Efficiency is a highly-desirable objective in business ... We're seeing this scenario play out in enterprises around the world as they continue to struggle with infrastructures and remote work models with an eye toward operational efficiencies. In contrast to that goal, a recent Broadcom survey of global IT and network professionals found widespread adoption of these strategies is making the network more complex and hampering observability, leading to uptime, performance and security issues. Let's look more closely at these challenges ...

Image
Broadcom

The 2025 Catchpoint SRE Report dives into the forces transforming the SRE landscape, exploring both the challenges and opportunities ahead. Let's break down the key findings and what they mean for SRE professionals and the businesses relying on them ...

Image
Catchpoint

Preventing Outages in 2023: What We Can Learn from Recent Failures

"What the recent failures from Internet giants demonstrate is that the question of the next outage is not if, but when," says Dritan Suljoti, Chief Product and Technology Officer of Catchpoint, referencing the company's new white paper, Preventing Outages in 2023: What We Learned from Recent Failures. "Moreover, the downstream effect of major outages to essential Internet infrastructure, such as cloud platforms, CDNs or DNS providers, means that no company is immune, no matter how well prepared they think they are. The white paper demonstrates why it's so important for all of us to be proactive to reduce Mean Time to Repair (MTTR) when the next outage occurs." 

Key lessons from the past

■ Develop an Internet Performance Monitoring strategy that allows you to monitor precisely what customers, workforce, and other users expect and build an Experience Score. 

■ Monitor not only what is under your direct control, map your Internet stack to ensure you are monitoring every component of the Internet Stack relied on to deliver your content (including DNS, CDN, ISP, BGP, TCP configuration, SSL, and other cloud services, etc.). ■ Automate intelligently – design and test automation to ensure there are no bugs hiding in the code. 

■ Be prepared to take fast action to remediate outages as they occur, for example, switching to a backup solution or dropping the third-party causing the issue. Develop runbooks and practice recovery. 

■ Whenever change is scheduled, ensure your team is ready for any outages that may occur (intentionally or not) with a crisis call plan that includes a communication plan and templates, a plan to mitigate failures from third-parties, and a best practices monitoring and observability plan. ‍ 

"Given the impact of serious outages to the bottom line, not to mention the long-tail impact to brand and reputation, amidst a landscape of increased Internet reliance alongside ever-growing Internet fragility and greater and great complexity, the need for community learnings from past failures to be shared and practical advice disseminated around stemming future major incidents and ensuring Internet Resilience is imperative," says Gerardo Dada, CMO at Catchpoint. "We believe this white paper offers an invaluable deep dive into recent outages past and key lessons learned that all of us can learn from to prevent (or mitigate the consequences of) the next major outage."

Hot Topics

The Latest

Cloud migration is a highly strategic decision that involves leadership sponsorship, business justifications for moving to the cloud, and a clear understanding of expected value. Lack of this alignment can be the reigning cause of cost and budget overruns and why almost half of the migration efforts underway today will fail in the next three years ...

One of the most misunderstood culprits of poor application performance is packet loss. Even minimal packet loss can cripple the throughput of a high-speed connection, making enterprise applications sluggish and frustrating for remote employee ... So, what's going wrong? And why does adding more bandwidth fail to fix the issue? ...

Image
Cloudbrink

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 5 covers the infrastructure and hardware supporting AI ...

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 4 covers advancements in AI technology ...

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 3 covers AI's impact on employees and their roles ...

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 2 covers the challenges presented by AI, as well as solutions to those problems ...

In the final part of APMdigest's 2025 Predictions Series, industry experts offer predictions on how AI will evolve and impact technology and business in 2025 ...

E-commerce is set to skyrocket with a 9% rise over the next few years ... To thrive in this competitive environment, retailers must identify digital resilience as their top priority. In a world where savvy shoppers expect 24/7 access to online deals and experiences, any unexpected downtime to digital services can lead to significant financial losses, damage to brand reputation, abandoned carts with designer shoes, and additional issues ...

Efficiency is a highly-desirable objective in business ... We're seeing this scenario play out in enterprises around the world as they continue to struggle with infrastructures and remote work models with an eye toward operational efficiencies. In contrast to that goal, a recent Broadcom survey of global IT and network professionals found widespread adoption of these strategies is making the network more complex and hampering observability, leading to uptime, performance and security issues. Let's look more closely at these challenges ...

Image
Broadcom

The 2025 Catchpoint SRE Report dives into the forces transforming the SRE landscape, exploring both the challenges and opportunities ahead. Let's break down the key findings and what they mean for SRE professionals and the businesses relying on them ...

Image
Catchpoint