Skip to main content

MELTDOWN: Single Software Update Causes Largest IT Outage in History

Pete Goldin
APMdigest

A defective software update caused what some experts are calling the largest IT outage in history on Friday, July 19. The impact reverberated through multiple industries around the world. Thousands of flights were canceled. TV stations went offline. Some 911 systems were down. Hospital operations were disrupted. Bank accounts were inaccessible. Many businesses and government services were unable to function.

The problem started with a bug in an automatic update for CrowdStrike's Falcon sensor — which is used to block online cyberattacks — and quickly escalated globally, causing Microsoft Windows systems to crash. CrowdStrike confirmed that the cause was a defect in a single content update for Windows hosts, not a security incident or cyberattack.


The Automation Challenge

"As companies transition to products with fully automated updates, they gain touchless update and patch remediation. However, automation is useless if it's supplied with bad content or configuration," said Kent Feid, Senior Director of Product Management at Quest.

"This event demonstrates that even the best companies can push out patches that cripple environments and, at times, entire essential service industries, and highlights the need for a balance between control and automation when it comes to software releases. While automation is necessary, it is the balanced approach that provides the best control and minimizes risk."

The issue also shines a spotlight on quality assurance. "A simple defect found in a single content update for Windows hosts was enough to cause havoc globally. The lesson to be learned is to integrate quality assurance into the software development lifecycle and to assure business outcomes not just technology," said Tom Reuner, Executive Research Leader, HFS Research.

Managing and Controlling Change

This massive outage shows how relying on outside services can cause major problems — something Catchpoint has been warning companies about for a long time.

At any moment, even the smallest oversight or piece of unpreparedness can bring systems — and consequently businesses — down

"The scale of today's global IT outage is unparalleled in recent history. It serves as a stark reminder that our entire world is powered by digital experiences and that the internet is neither magically infallible nor inherently resilient. This is a reminder you need to manage and control change: Don't blindly update software or change configuration," Mehdi Daoudi, CEO of Catchpoint, said on Friday. "At any moment, even the smallest oversight or piece of unpreparedness can bring systems — and consequently businesses — down."

Image removed.

Daoudi continued, "Preparation and visibility are key, not just to prevent such outages but to mitigate the vast financial risks they pose. The fallout from today's event will likely be measured not just in the disruption of services but in exponential financial losses worldwide, potentially amounting to millions or even billions in lost revenue. It highlights a critical vulnerability: our increasing dependency on digital infrastructure can translate into staggering costs when that infrastructure fails."

Real-Time Observability

"The massive Microsoft outage, caused by a faulty CrowdStrike update, underscores the new reality companies face: globally distributed software platforms that drive business today are a complex web of interdependencies, not all of which are under any one actor's control," explained Antony Falco, VP at Hydrolix.

"A modest mistake can literally grind global business to a halt. The monitoring and observability solutions we rely on to spot these modest mistakes and critical issues have struggled to keep up, even with systems of smaller scale. Clearly we need a new approach to observability — one that is real-time and can simplify the management of tremendous volumes of data streaming in from myriad sources so events can be detected and mitigated before they spread."

Redundancy and Diversity

In addition, this type of event demonstrates that for critical services, redundancy and diversity are key, according to Olaf Kolkman, Principal - Internet Technology, Policy, and Advocacy, and Dan York, Director, Internet Technology, both from the Internet Society. "We need diversity across all aspects of tech, including the operating systems. For example, systems using Linux or Mac OS were not affected by this particular issue. We need to ensure that our systems and networks use a range of different products and services so that an issue with one system will not bring them all down."

They added, "The reality is that in our world of complex, interconnected systems, incidents like this happen. They have happened in the past and they will happen in the future. The important part is how we learn from them and how we improve the resilience of our systems, so that similar issues do not happen again."

The Cost of Downtime

Just as a final thought, I would point out that several recent reports have shown that the cost of downtime is high, and downtime can impact companies in many ways. Catchpoint's Internet Resilience Report 2024 found that almost half of survey respondents said outages cost them from $1 million to $10 million every month.

Similarly, Splunk's recent report, The Hidden Costs of Downtime calculates lost revenue due to downtime averages $49 million, regulatory fines average $22 million, and missed SLA penalties average $16 million annually.

Downtime also negatively impacts customer experience, employee productivity, innovation, brand reputation and even share value. In fact, AP reported that shares of CrowdStrike stock fell nearly 10% on Friday, and Microsoft stock fell more than 3%. These numbers speak louder than words.

Pete Goldin is Editor and Publisher of APMdigest

Hot Topics

The Latest

AI is becoming the operating system of the enterprise. It acts as an invisible coordination layer that understands intent, connects systems, and executes work across complex SaaS environments. Previously, employees had to click through multiple systems — CRM, ERP, support tools, collaboration platforms — to complete a single task. Now, instead of navigating each application manually, they can simply state what they need to accomplish ...

In 2026, the cost of downtime or an outage is no longer just a technical inconvenience; it's a $600 billion wake up call for global businesses. As our digital ecosystems become  more interconnected, each touchpoint introduces new risks and multiplies the consequences when things go wrong. And the data is clear: aggregate downtime costs  for Global 2,000 companies have surged 50% since 2024, reaching a staggering $600 billion ...

Deloitte found that 74% of enterprises expect to deploy agentic AI solutions in the next 24 months. However, the rush to deployment is outpacing foundational work, though. Only 21% of enterprises have fully formed agent governance models in place. The result? AI agents deployed without guidance or governance begin to function as fragmented islands of complexity ...

Cloud spending is no longer viewed as a passthrough IT expense, but as a strategic financial lever that directly impacts innovation capacity, profitability and enterprise resilience, according to the CFO Cloud Cost Optimization Report from Azul ...

As AI moves from generating responses to performing actions, the need for trust increases exponentially. And as organizations enlist AI agents for increasingly sophisticated business processes, trust is going to be the single most important theme for spurring adoption. What can organizations do to build trustworthy AI agents? ...

I've spent a lot of time in the channel, and one thing I keep coming back to is this: a partner program is only as good as what it looks like in the field. Many programs look great on paper, but when a partner is in front of a customer navigating a complex hybrid environment or trying to make the case for AI-powered observability, the gap between what a vendor promises and what it actually delivers becomes very clear, very fast ...

Enterprises today operate in a real-time environment where uninterrupted access to trusted data has become a baseline expectation for users, applications and automated systems. Traditional DataOps models, built on manual effort and human triage, cannot keep pace with this always active demand. AI agents are emerging as the operational backbone, ensuring consistent data availability, reinforcing trustworthiness and enabling a level of scale that manual processes cannot achieve ...

For decades, trust in the digital workplace rested on familiar signals. We trusted faces on video calls, voices on the phone, and emails that appeared to come from people we knew. These cues felt human and intuitive. They anchored how decisions were made, approvals were granted, and access was authorized. AI-powered deepfakes have quietly broken that model ...

Cloud migration was supposed to be a one-way door. For most enterprises, it turns out it isn't. Cloud data repatriation is a real and growing trend. A new survey ... finds that 89% of organizations plan to expand their on-premises infrastructure footprint over the next two years — and 75% have already moved at least some workloads back from public cloud in the past 24 months. The findings point to a broad rethinking of where data belongs ...

Over the past few years, large language models (LLMs) have revolutionized the software industry. Given their ability to excel at multi-step reasoning, LLMs have helped enterprises streamline workflows and adapt to the unknown. However, employing such models comes with sky-high costs, latency issues, and limited flexibility. In the realm of IT operations, it is generally wiser to employ smaller, domain-specific models instead ...

MELTDOWN: Single Software Update Causes Largest IT Outage in History

Pete Goldin
APMdigest

A defective software update caused what some experts are calling the largest IT outage in history on Friday, July 19. The impact reverberated through multiple industries around the world. Thousands of flights were canceled. TV stations went offline. Some 911 systems were down. Hospital operations were disrupted. Bank accounts were inaccessible. Many businesses and government services were unable to function.

The problem started with a bug in an automatic update for CrowdStrike's Falcon sensor — which is used to block online cyberattacks — and quickly escalated globally, causing Microsoft Windows systems to crash. CrowdStrike confirmed that the cause was a defect in a single content update for Windows hosts, not a security incident or cyberattack.


The Automation Challenge

"As companies transition to products with fully automated updates, they gain touchless update and patch remediation. However, automation is useless if it's supplied with bad content or configuration," said Kent Feid, Senior Director of Product Management at Quest.

"This event demonstrates that even the best companies can push out patches that cripple environments and, at times, entire essential service industries, and highlights the need for a balance between control and automation when it comes to software releases. While automation is necessary, it is the balanced approach that provides the best control and minimizes risk."

The issue also shines a spotlight on quality assurance. "A simple defect found in a single content update for Windows hosts was enough to cause havoc globally. The lesson to be learned is to integrate quality assurance into the software development lifecycle and to assure business outcomes not just technology," said Tom Reuner, Executive Research Leader, HFS Research.

Managing and Controlling Change

This massive outage shows how relying on outside services can cause major problems — something Catchpoint has been warning companies about for a long time.

At any moment, even the smallest oversight or piece of unpreparedness can bring systems — and consequently businesses — down

"The scale of today's global IT outage is unparalleled in recent history. It serves as a stark reminder that our entire world is powered by digital experiences and that the internet is neither magically infallible nor inherently resilient. This is a reminder you need to manage and control change: Don't blindly update software or change configuration," Mehdi Daoudi, CEO of Catchpoint, said on Friday. "At any moment, even the smallest oversight or piece of unpreparedness can bring systems — and consequently businesses — down."

Image removed.

Daoudi continued, "Preparation and visibility are key, not just to prevent such outages but to mitigate the vast financial risks they pose. The fallout from today's event will likely be measured not just in the disruption of services but in exponential financial losses worldwide, potentially amounting to millions or even billions in lost revenue. It highlights a critical vulnerability: our increasing dependency on digital infrastructure can translate into staggering costs when that infrastructure fails."

Real-Time Observability

"The massive Microsoft outage, caused by a faulty CrowdStrike update, underscores the new reality companies face: globally distributed software platforms that drive business today are a complex web of interdependencies, not all of which are under any one actor's control," explained Antony Falco, VP at Hydrolix.

"A modest mistake can literally grind global business to a halt. The monitoring and observability solutions we rely on to spot these modest mistakes and critical issues have struggled to keep up, even with systems of smaller scale. Clearly we need a new approach to observability — one that is real-time and can simplify the management of tremendous volumes of data streaming in from myriad sources so events can be detected and mitigated before they spread."

Redundancy and Diversity

In addition, this type of event demonstrates that for critical services, redundancy and diversity are key, according to Olaf Kolkman, Principal - Internet Technology, Policy, and Advocacy, and Dan York, Director, Internet Technology, both from the Internet Society. "We need diversity across all aspects of tech, including the operating systems. For example, systems using Linux or Mac OS were not affected by this particular issue. We need to ensure that our systems and networks use a range of different products and services so that an issue with one system will not bring them all down."

They added, "The reality is that in our world of complex, interconnected systems, incidents like this happen. They have happened in the past and they will happen in the future. The important part is how we learn from them and how we improve the resilience of our systems, so that similar issues do not happen again."

The Cost of Downtime

Just as a final thought, I would point out that several recent reports have shown that the cost of downtime is high, and downtime can impact companies in many ways. Catchpoint's Internet Resilience Report 2024 found that almost half of survey respondents said outages cost them from $1 million to $10 million every month.

Similarly, Splunk's recent report, The Hidden Costs of Downtime calculates lost revenue due to downtime averages $49 million, regulatory fines average $22 million, and missed SLA penalties average $16 million annually.

Downtime also negatively impacts customer experience, employee productivity, innovation, brand reputation and even share value. In fact, AP reported that shares of CrowdStrike stock fell nearly 10% on Friday, and Microsoft stock fell more than 3%. These numbers speak louder than words.

Pete Goldin is Editor and Publisher of APMdigest

Hot Topics

The Latest

AI is becoming the operating system of the enterprise. It acts as an invisible coordination layer that understands intent, connects systems, and executes work across complex SaaS environments. Previously, employees had to click through multiple systems — CRM, ERP, support tools, collaboration platforms — to complete a single task. Now, instead of navigating each application manually, they can simply state what they need to accomplish ...

In 2026, the cost of downtime or an outage is no longer just a technical inconvenience; it's a $600 billion wake up call for global businesses. As our digital ecosystems become  more interconnected, each touchpoint introduces new risks and multiplies the consequences when things go wrong. And the data is clear: aggregate downtime costs  for Global 2,000 companies have surged 50% since 2024, reaching a staggering $600 billion ...

Deloitte found that 74% of enterprises expect to deploy agentic AI solutions in the next 24 months. However, the rush to deployment is outpacing foundational work, though. Only 21% of enterprises have fully formed agent governance models in place. The result? AI agents deployed without guidance or governance begin to function as fragmented islands of complexity ...

Cloud spending is no longer viewed as a passthrough IT expense, but as a strategic financial lever that directly impacts innovation capacity, profitability and enterprise resilience, according to the CFO Cloud Cost Optimization Report from Azul ...

As AI moves from generating responses to performing actions, the need for trust increases exponentially. And as organizations enlist AI agents for increasingly sophisticated business processes, trust is going to be the single most important theme for spurring adoption. What can organizations do to build trustworthy AI agents? ...

I've spent a lot of time in the channel, and one thing I keep coming back to is this: a partner program is only as good as what it looks like in the field. Many programs look great on paper, but when a partner is in front of a customer navigating a complex hybrid environment or trying to make the case for AI-powered observability, the gap between what a vendor promises and what it actually delivers becomes very clear, very fast ...

Enterprises today operate in a real-time environment where uninterrupted access to trusted data has become a baseline expectation for users, applications and automated systems. Traditional DataOps models, built on manual effort and human triage, cannot keep pace with this always active demand. AI agents are emerging as the operational backbone, ensuring consistent data availability, reinforcing trustworthiness and enabling a level of scale that manual processes cannot achieve ...

For decades, trust in the digital workplace rested on familiar signals. We trusted faces on video calls, voices on the phone, and emails that appeared to come from people we knew. These cues felt human and intuitive. They anchored how decisions were made, approvals were granted, and access was authorized. AI-powered deepfakes have quietly broken that model ...

Cloud migration was supposed to be a one-way door. For most enterprises, it turns out it isn't. Cloud data repatriation is a real and growing trend. A new survey ... finds that 89% of organizations plan to expand their on-premises infrastructure footprint over the next two years — and 75% have already moved at least some workloads back from public cloud in the past 24 months. The findings point to a broad rethinking of where data belongs ...

Over the past few years, large language models (LLMs) have revolutionized the software industry. Given their ability to excel at multi-step reasoning, LLMs have helped enterprises streamline workflows and adapt to the unknown. However, employing such models comes with sky-high costs, latency issues, and limited flexibility. In the realm of IT operations, it is generally wiser to employ smaller, domain-specific models instead ...