Skip to main content

The $600 Billion Wake Up Call

Patrick Lin
Splunk

In 2026, the cost of downtime or an outage is no longer just a technical inconvenience; it's a $600 billion wake up call for global businesses. As our digital ecosystems become  more interconnected, each touchpoint introduces new risks and multiplies the consequences when things go wrong. And the data is clear: aggregate downtime costs  for Global 2,000 companies have surged 50% since 2024, reaching a staggering $600 billion.

According to Splunk's Hidden Costs of Downtime report, organizations face an average  of 60 service degradation incidents annually. And this official count reflects only the incidents organizations detected — the true number is likely larger. ITOps-related human error is the number one culprit of downtime, according to the report. As our IT estates become more complex and distributed, teams encounter more blind spots, increasing the likelihood of mistakes.

The negative impacts of an outage don't just stop with the hard costs. Downtime includes consequences that may be harder to track such as brand erosion, loss of customers and diminished shareholder value. Enterprises must also understand that hidden downtime costs don't just occur in the heat of an outage, but also well past the date(s) the incident occurred. Customer frustration continues, engineering teams can fall behind on the product roadmap, and marketing efforts meant for hyping up products  are now spent fixing a damaged reputation. It can take months of careful messaging and flawless service to rebuild the trust that was lost in mere moments.

The Causes of Downtime Vary

Mitigating the cost of downtime first calls for a fundamental understanding of how and  why system outages occur. While human error is still the leading culprit for many organizations, phishing scams and malware are often the gateway for the most dangerous types of system downtime — namely, ransomware attacks. Since 2004, ransomware payouts nearly tripled according to the survey findings, reaching $40M respectively.

After human error, software failure and third-party outages are the most common causes of application- or infrastructure-related downtime. Today's ITOps and engineering teams are relying on external providers that have become a primary source  of instability. That's why visibility into unowned networks and external dependencies — along with the applications themselves, of course — is a prerequisite for digital resilience.

Harnessing AI and the Practice of Observability to Lower the Cost of Downtime

The reality is system outages or disruptions are inevitable — and the most resilient organizations implement tools and practices that enable them to respond effectively under pressure. A comprehensive observability practice is a key supporting business function to a resilient organization. More than ever, today's organizations must be able to see, understand, and diagnose every issue within their tech stack, regardless of the type of environment, and as early as possible. This means visibility into any application or infrastructure, whether on-premises or cloud-delivered, along with the implications of their health and performance on business KPIs and user experience. 72% of ITOps and  engineering leaders rank end-to-end observability as their top investment priority, ahead  of spending on the infrastructure itself, even.

More importantly, AI is arming today's threat actors with the ability to raise the cost of system downtime, so today's cyber defenders must also leverage AI. A powerful ally in  the fight against downtime, AI can be used by teams to accelerate insight and incident  detection of issues. Observability, powered by agentic AI, can independently diagnose  issues, execute common fixes, perform code rollbacks, and escalate more important functions for human approval. It should be noted that these human-in-the-loop measures aren't just for safety; it is a governance framework for any AI use within an observability practice. This governance model ensures speed never comes at the expense of trust and accountability.

Bouncing Back: A Blueprint for Resilience

Beyond observability, the most resilient organizations bounce back faster by following these four best practices:

1. Treat downtime as a business risk. With key decision-makers, translate technical metrics into business language — connect incidents to profit impact, recovery timelines, and customer trust. This will help get executive attention and  support.

2. Design systems for humans. Complex systems can result in more human error. Standardize deployment practices to ensure consistency, accountability,  and controlled execution across teams.

3. Make detection and root cause analysis a team sport. Reduce silos by leveraging platforms, tools and workspaces that provide shared data across the  SecOps and DevOps teams to encourage collaboration and holistic visibility.

4. Use AI to accelerate insight. When deploying AI to speed up incident detection,  root cause analysis, or prioritization, always pair AI's speed with expert human  judgment and oversight.

These tenets, combined with implementing an observability practice as a core business  function, can both put a major dent in the cost of downtime while making operational resilience a reality. 

Patrick Lin is SVP and GM, Observability at Splunk, a Cisco company

Hot Topics

The Latest

In 2026, the cost of downtime or an outage is no longer just a technical inconvenience; it's a $600 billion wake up call for global businesses. As our digital ecosystems become  more interconnected, each touchpoint introduces new risks and multiplies the consequences when things go wrong. And the data is clear: aggregate downtime costs  for Global 2,000 companies have surged 50% since 2024, reaching a staggering $600 billion ...

Deloitte found that 74% of enterprises expect to deploy agentic AI solutions in the next 24 months. However, the rush to deployment is outpacing foundational work, though. Only 21% of enterprises have fully formed agent governance models in place. The result? AI agents deployed without guidance or governance begin to function as fragmented islands of complexity ...

Cloud spending is no longer viewed as a passthrough IT expense, but as a strategic financial lever that directly impacts innovation capacity, profitability and enterprise resilience, according to the CFO Cloud Cost Optimization Report from Azul ...

As AI moves from generating responses to performing actions, the need for trust increases exponentially. And as organizations enlist AI agents for increasingly sophisticated business processes, trust is going to be the single most important theme for spurring adoption. What can organizations do to build trustworthy AI agents? ...

I've spent a lot of time in the channel, and one thing I keep coming back to is this: a partner program is only as good as what it looks like in the field. Many programs look great on paper, but when a partner is in front of a customer navigating a complex hybrid environment or trying to make the case for AI-powered observability, the gap between what a vendor promises and what it actually delivers becomes very clear, very fast ...

Enterprises today operate in a real-time environment where uninterrupted access to trusted data has become a baseline expectation for users, applications and automated systems. Traditional DataOps models, built on manual effort and human triage, cannot keep pace with this always active demand. AI agents are emerging as the operational backbone, ensuring consistent data availability, reinforcing trustworthiness and enabling a level of scale that manual processes cannot achieve ...

For decades, trust in the digital workplace rested on familiar signals. We trusted faces on video calls, voices on the phone, and emails that appeared to come from people we knew. These cues felt human and intuitive. They anchored how decisions were made, approvals were granted, and access was authorized. AI-powered deepfakes have quietly broken that model ...

Cloud migration was supposed to be a one-way door. For most enterprises, it turns out it isn't. Cloud data repatriation is a real and growing trend. A new survey ... finds that 89% of organizations plan to expand their on-premises infrastructure footprint over the next two years — and 75% have already moved at least some workloads back from public cloud in the past 24 months. The findings point to a broad rethinking of where data belongs ...

Over the past few years, large language models (LLMs) have revolutionized the software industry. Given their ability to excel at multi-step reasoning, LLMs have helped enterprises streamline workflows and adapt to the unknown. However, employing such models comes with sky-high costs, latency issues, and limited flexibility. In the realm of IT operations, it is generally wiser to employ smaller, domain-specific models instead ...

For years, DevOps teams operated under a simple assumption: collect enough telemetry, and you can find and fix any problem. That assumption is breaking down. Modern enterprises now operate across microservices, hybrid cloud environments, APIs, Kubernetes, and highly automated delivery pipelines. Releases happen continuously, dependencies shift constantly, and failures spread faster than teams can diagnose them ...

The $600 Billion Wake Up Call

Patrick Lin
Splunk

In 2026, the cost of downtime or an outage is no longer just a technical inconvenience; it's a $600 billion wake up call for global businesses. As our digital ecosystems become  more interconnected, each touchpoint introduces new risks and multiplies the consequences when things go wrong. And the data is clear: aggregate downtime costs  for Global 2,000 companies have surged 50% since 2024, reaching a staggering $600 billion.

According to Splunk's Hidden Costs of Downtime report, organizations face an average  of 60 service degradation incidents annually. And this official count reflects only the incidents organizations detected — the true number is likely larger. ITOps-related human error is the number one culprit of downtime, according to the report. As our IT estates become more complex and distributed, teams encounter more blind spots, increasing the likelihood of mistakes.

The negative impacts of an outage don't just stop with the hard costs. Downtime includes consequences that may be harder to track such as brand erosion, loss of customers and diminished shareholder value. Enterprises must also understand that hidden downtime costs don't just occur in the heat of an outage, but also well past the date(s) the incident occurred. Customer frustration continues, engineering teams can fall behind on the product roadmap, and marketing efforts meant for hyping up products  are now spent fixing a damaged reputation. It can take months of careful messaging and flawless service to rebuild the trust that was lost in mere moments.

The Causes of Downtime Vary

Mitigating the cost of downtime first calls for a fundamental understanding of how and  why system outages occur. While human error is still the leading culprit for many organizations, phishing scams and malware are often the gateway for the most dangerous types of system downtime — namely, ransomware attacks. Since 2004, ransomware payouts nearly tripled according to the survey findings, reaching $40M respectively.

After human error, software failure and third-party outages are the most common causes of application- or infrastructure-related downtime. Today's ITOps and engineering teams are relying on external providers that have become a primary source  of instability. That's why visibility into unowned networks and external dependencies — along with the applications themselves, of course — is a prerequisite for digital resilience.

Harnessing AI and the Practice of Observability to Lower the Cost of Downtime

The reality is system outages or disruptions are inevitable — and the most resilient organizations implement tools and practices that enable them to respond effectively under pressure. A comprehensive observability practice is a key supporting business function to a resilient organization. More than ever, today's organizations must be able to see, understand, and diagnose every issue within their tech stack, regardless of the type of environment, and as early as possible. This means visibility into any application or infrastructure, whether on-premises or cloud-delivered, along with the implications of their health and performance on business KPIs and user experience. 72% of ITOps and  engineering leaders rank end-to-end observability as their top investment priority, ahead  of spending on the infrastructure itself, even.

More importantly, AI is arming today's threat actors with the ability to raise the cost of system downtime, so today's cyber defenders must also leverage AI. A powerful ally in  the fight against downtime, AI can be used by teams to accelerate insight and incident  detection of issues. Observability, powered by agentic AI, can independently diagnose  issues, execute common fixes, perform code rollbacks, and escalate more important functions for human approval. It should be noted that these human-in-the-loop measures aren't just for safety; it is a governance framework for any AI use within an observability practice. This governance model ensures speed never comes at the expense of trust and accountability.

Bouncing Back: A Blueprint for Resilience

Beyond observability, the most resilient organizations bounce back faster by following these four best practices:

1. Treat downtime as a business risk. With key decision-makers, translate technical metrics into business language — connect incidents to profit impact, recovery timelines, and customer trust. This will help get executive attention and  support.

2. Design systems for humans. Complex systems can result in more human error. Standardize deployment practices to ensure consistency, accountability,  and controlled execution across teams.

3. Make detection and root cause analysis a team sport. Reduce silos by leveraging platforms, tools and workspaces that provide shared data across the  SecOps and DevOps teams to encourage collaboration and holistic visibility.

4. Use AI to accelerate insight. When deploying AI to speed up incident detection,  root cause analysis, or prioritization, always pair AI's speed with expert human  judgment and oversight.

These tenets, combined with implementing an observability practice as a core business  function, can both put a major dent in the cost of downtime while making operational resilience a reality. 

Patrick Lin is SVP and GM, Observability at Splunk, a Cisco company

Hot Topics

The Latest

In 2026, the cost of downtime or an outage is no longer just a technical inconvenience; it's a $600 billion wake up call for global businesses. As our digital ecosystems become  more interconnected, each touchpoint introduces new risks and multiplies the consequences when things go wrong. And the data is clear: aggregate downtime costs  for Global 2,000 companies have surged 50% since 2024, reaching a staggering $600 billion ...

Deloitte found that 74% of enterprises expect to deploy agentic AI solutions in the next 24 months. However, the rush to deployment is outpacing foundational work, though. Only 21% of enterprises have fully formed agent governance models in place. The result? AI agents deployed without guidance or governance begin to function as fragmented islands of complexity ...

Cloud spending is no longer viewed as a passthrough IT expense, but as a strategic financial lever that directly impacts innovation capacity, profitability and enterprise resilience, according to the CFO Cloud Cost Optimization Report from Azul ...

As AI moves from generating responses to performing actions, the need for trust increases exponentially. And as organizations enlist AI agents for increasingly sophisticated business processes, trust is going to be the single most important theme for spurring adoption. What can organizations do to build trustworthy AI agents? ...

I've spent a lot of time in the channel, and one thing I keep coming back to is this: a partner program is only as good as what it looks like in the field. Many programs look great on paper, but when a partner is in front of a customer navigating a complex hybrid environment or trying to make the case for AI-powered observability, the gap between what a vendor promises and what it actually delivers becomes very clear, very fast ...

Enterprises today operate in a real-time environment where uninterrupted access to trusted data has become a baseline expectation for users, applications and automated systems. Traditional DataOps models, built on manual effort and human triage, cannot keep pace with this always active demand. AI agents are emerging as the operational backbone, ensuring consistent data availability, reinforcing trustworthiness and enabling a level of scale that manual processes cannot achieve ...

For decades, trust in the digital workplace rested on familiar signals. We trusted faces on video calls, voices on the phone, and emails that appeared to come from people we knew. These cues felt human and intuitive. They anchored how decisions were made, approvals were granted, and access was authorized. AI-powered deepfakes have quietly broken that model ...

Cloud migration was supposed to be a one-way door. For most enterprises, it turns out it isn't. Cloud data repatriation is a real and growing trend. A new survey ... finds that 89% of organizations plan to expand their on-premises infrastructure footprint over the next two years — and 75% have already moved at least some workloads back from public cloud in the past 24 months. The findings point to a broad rethinking of where data belongs ...

Over the past few years, large language models (LLMs) have revolutionized the software industry. Given their ability to excel at multi-step reasoning, LLMs have helped enterprises streamline workflows and adapt to the unknown. However, employing such models comes with sky-high costs, latency issues, and limited flexibility. In the realm of IT operations, it is generally wiser to employ smaller, domain-specific models instead ...

For years, DevOps teams operated under a simple assumption: collect enough telemetry, and you can find and fix any problem. That assumption is breaking down. Modern enterprises now operate across microservices, hybrid cloud environments, APIs, Kubernetes, and highly automated delivery pipelines. Releases happen continuously, dependencies shift constantly, and failures spread faster than teams can diagnose them ...