Skip to main content

MELTDOWN: Single Software Update Causes Largest IT Outage in History

Pete Goldin
APMdigest

A defective software update caused what some experts are calling the largest IT outage in history on Friday, July 19. The impact reverberated through multiple industries around the world. Thousands of flights were canceled. TV stations went offline. Some 911 systems were down. Hospital operations were disrupted. Bank accounts were inaccessible. Many businesses and government services were unable to function.

The problem started with a bug in an automatic update for CrowdStrike's Falcon sensor — which is used to block online cyberattacks — and quickly escalated globally, causing Microsoft Windows systems to crash. CrowdStrike confirmed that the cause was a defect in a single content update for Windows hosts, not a security incident or cyberattack.


The Automation Challenge

"As companies transition to products with fully automated updates, they gain touchless update and patch remediation. However, automation is useless if it's supplied with bad content or configuration," said Kent Feid, Senior Director of Product Management at Quest.

"This event demonstrates that even the best companies can push out patches that cripple environments and, at times, entire essential service industries, and highlights the need for a balance between control and automation when it comes to software releases. While automation is necessary, it is the balanced approach that provides the best control and minimizes risk."

The issue also shines a spotlight on quality assurance. "A simple defect found in a single content update for Windows hosts was enough to cause havoc globally. The lesson to be learned is to integrate quality assurance into the software development lifecycle and to assure business outcomes not just technology," said Tom Reuner, Executive Research Leader, HFS Research.

Managing and Controlling Change

This massive outage shows how relying on outside services can cause major problems — something Catchpoint has been warning companies about for a long time.

At any moment, even the smallest oversight or piece of unpreparedness can bring systems — and consequently businesses — down

"The scale of today's global IT outage is unparalleled in recent history. It serves as a stark reminder that our entire world is powered by digital experiences and that the internet is neither magically infallible nor inherently resilient. This is a reminder you need to manage and control change: Don't blindly update software or change configuration," Mehdi Daoudi, CEO of Catchpoint, said on Friday. "At any moment, even the smallest oversight or piece of unpreparedness can bring systems — and consequently businesses — down."

Image removed.

Daoudi continued, "Preparation and visibility are key, not just to prevent such outages but to mitigate the vast financial risks they pose. The fallout from today's event will likely be measured not just in the disruption of services but in exponential financial losses worldwide, potentially amounting to millions or even billions in lost revenue. It highlights a critical vulnerability: our increasing dependency on digital infrastructure can translate into staggering costs when that infrastructure fails."

Real-Time Observability

"The massive Microsoft outage, caused by a faulty CrowdStrike update, underscores the new reality companies face: globally distributed software platforms that drive business today are a complex web of interdependencies, not all of which are under any one actor's control," explained Antony Falco, VP at Hydrolix.

"A modest mistake can literally grind global business to a halt. The monitoring and observability solutions we rely on to spot these modest mistakes and critical issues have struggled to keep up, even with systems of smaller scale. Clearly we need a new approach to observability — one that is real-time and can simplify the management of tremendous volumes of data streaming in from myriad sources so events can be detected and mitigated before they spread."

Redundancy and Diversity

In addition, this type of event demonstrates that for critical services, redundancy and diversity are key, according to Olaf Kolkman, Principal - Internet Technology, Policy, and Advocacy, and Dan York, Director, Internet Technology, both from the Internet Society. "We need diversity across all aspects of tech, including the operating systems. For example, systems using Linux or Mac OS were not affected by this particular issue. We need to ensure that our systems and networks use a range of different products and services so that an issue with one system will not bring them all down."

They added, "The reality is that in our world of complex, interconnected systems, incidents like this happen. They have happened in the past and they will happen in the future. The important part is how we learn from them and how we improve the resilience of our systems, so that similar issues do not happen again."

The Cost of Downtime

Just as a final thought, I would point out that several recent reports have shown that the cost of downtime is high, and downtime can impact companies in many ways. Catchpoint's Internet Resilience Report 2024 found that almost half of survey respondents said outages cost them from $1 million to $10 million every month.

Similarly, Splunk's recent report, The Hidden Costs of Downtime calculates lost revenue due to downtime averages $49 million, regulatory fines average $22 million, and missed SLA penalties average $16 million annually.

Downtime also negatively impacts customer experience, employee productivity, innovation, brand reputation and even share value. In fact, AP reported that shares of CrowdStrike stock fell nearly 10% on Friday, and Microsoft stock fell more than 3%. These numbers speak louder than words.

Pete Goldin is Editor and Publisher of APMdigest

Hot Topics

The Latest

Many organizations describe AI as strategic, but they do not manage it strategically. When AI plans are disconnected from strategy, detached from organizational learning, and protected from serious assumptions testing, the problem is no longer technical immaturity; it is a failure of management discipline ... Executives too often tell organizations to "use AI" before they define what AI is supposed to change. The problem deepens in organizations where strategy isn't well articulated in the first place ...

Across the enterprise technology landscape, a quiet crisis is playing out. Organizations have run hundreds, sometimes thousands, of generative AI pilots. Leadership has celebrated the proof of concept (POCs) ... Industry experience points to a sobering reality: only 5-10% of AI POCs that progress to the pilot stage successfully reach scaled production. The remaining 90% fail because the enterprise environment around them was never ready to absorb them, not the AI models ...

Today's modern systems are not what they once were. Organizations now rely on distributed systems, event-driven workflows, hybrid and multi-cloud environments and continuous delivery pipelines. While each adds flexibility, it also introduces new, often invisible failures. Development speed is no longer the primary bottleneck of innovation. Reliability is ...

Seeing is believing, or in this case, seeing is understanding, according to New Relic's 2025 Observability Forecast for Retail and eCommerce report. Retailers who want to provide exceptional customer experiences while improving IT operations efficiency are leaning on observability ... Here are five key takeaways from the report ...

Technology leaders across the federal landscape are facing, and will continue to face, an uphill battle when it comes to fortifying their digital environments against hostile and persistent threat actors. On one hand, they are being asked to push digital transformation ... On the other hand, they are facing the fiscal uncertainty of continuing resolutions (CR) and government shutdowns looming near and far. In the face of these challenges, CIOs, CTOs, and CISOs must figure out how to modernize legacy systems and infrastructure while doing more with less and still defending against external and internal threats ...

Reliability is no longer proven by uptime alone, according to the The SRE Report 2026 from LogicMonitor. In the AI era, it is experienced through speed, consistency, and user trust, and increasingly judged by business impact. As digital services grow more complex and AI systems move into production, traditional monitoring approaches are struggling to keep pace, increasing the need for AI-first observability that spans applications, infrastructure, and the Internet ...

If AI is the engine of a modern organization, then data engineering is the road system beneath it. You can build the most powerful engine in the world, but without paved roads, traffic signals, and bridges that can support its weight, it will stall. In many enterprises, the engine is ready. The roads are not ...

In the world of digital-first business, there is no tolerance for service outages. Businesses know that outages are the quickest way to lose money and customers. For smaller organizations, unplanned downtime could even force the business to close ... A new study from PagerDuty, The State of AI-First Operations, reveals that companies actively incorporating AI into operations now view operational resilience as a growth driver rather than a cost center. But how are they achieving it? ...

In live financial environments, capital markets software cannot pause for rebuilds. New capabilities are introduced as stacked technology layers to meet evolving demands while systems remain active, data keeps moving, and controls stay intact. AI is no exception, and its opportunities are significant: accelerated decision cycles, compressed manual workflows, and more effective operations across complex environments. The constraint isn't the models themselves, but the architectural environments they enter ...

Like most digital transformation shifts, organizations often prioritize productivity and leave security and observability to keep pace. This usually translates to both the mass implementation of new technology and fragmented monitoring and observability (M&O) tooling. In the era of AI and varied cloud architecture, a disparate observability function can be dangerous. IT teams will lack a complete picture of their IT environment, making it harder to diagnose issues while slowing down mean time to resolve (MTTR). In fact, according to recent data from the SolarWinds State of Monitoring & Observability Report, 77% of IT personnel said the lack of visibility across their on-prem and cloud architecture was an issue ...

MELTDOWN: Single Software Update Causes Largest IT Outage in History

Pete Goldin
APMdigest

A defective software update caused what some experts are calling the largest IT outage in history on Friday, July 19. The impact reverberated through multiple industries around the world. Thousands of flights were canceled. TV stations went offline. Some 911 systems were down. Hospital operations were disrupted. Bank accounts were inaccessible. Many businesses and government services were unable to function.

The problem started with a bug in an automatic update for CrowdStrike's Falcon sensor — which is used to block online cyberattacks — and quickly escalated globally, causing Microsoft Windows systems to crash. CrowdStrike confirmed that the cause was a defect in a single content update for Windows hosts, not a security incident or cyberattack.


The Automation Challenge

"As companies transition to products with fully automated updates, they gain touchless update and patch remediation. However, automation is useless if it's supplied with bad content or configuration," said Kent Feid, Senior Director of Product Management at Quest.

"This event demonstrates that even the best companies can push out patches that cripple environments and, at times, entire essential service industries, and highlights the need for a balance between control and automation when it comes to software releases. While automation is necessary, it is the balanced approach that provides the best control and minimizes risk."

The issue also shines a spotlight on quality assurance. "A simple defect found in a single content update for Windows hosts was enough to cause havoc globally. The lesson to be learned is to integrate quality assurance into the software development lifecycle and to assure business outcomes not just technology," said Tom Reuner, Executive Research Leader, HFS Research.

Managing and Controlling Change

This massive outage shows how relying on outside services can cause major problems — something Catchpoint has been warning companies about for a long time.

At any moment, even the smallest oversight or piece of unpreparedness can bring systems — and consequently businesses — down

"The scale of today's global IT outage is unparalleled in recent history. It serves as a stark reminder that our entire world is powered by digital experiences and that the internet is neither magically infallible nor inherently resilient. This is a reminder you need to manage and control change: Don't blindly update software or change configuration," Mehdi Daoudi, CEO of Catchpoint, said on Friday. "At any moment, even the smallest oversight or piece of unpreparedness can bring systems — and consequently businesses — down."

Image removed.

Daoudi continued, "Preparation and visibility are key, not just to prevent such outages but to mitigate the vast financial risks they pose. The fallout from today's event will likely be measured not just in the disruption of services but in exponential financial losses worldwide, potentially amounting to millions or even billions in lost revenue. It highlights a critical vulnerability: our increasing dependency on digital infrastructure can translate into staggering costs when that infrastructure fails."

Real-Time Observability

"The massive Microsoft outage, caused by a faulty CrowdStrike update, underscores the new reality companies face: globally distributed software platforms that drive business today are a complex web of interdependencies, not all of which are under any one actor's control," explained Antony Falco, VP at Hydrolix.

"A modest mistake can literally grind global business to a halt. The monitoring and observability solutions we rely on to spot these modest mistakes and critical issues have struggled to keep up, even with systems of smaller scale. Clearly we need a new approach to observability — one that is real-time and can simplify the management of tremendous volumes of data streaming in from myriad sources so events can be detected and mitigated before they spread."

Redundancy and Diversity

In addition, this type of event demonstrates that for critical services, redundancy and diversity are key, according to Olaf Kolkman, Principal - Internet Technology, Policy, and Advocacy, and Dan York, Director, Internet Technology, both from the Internet Society. "We need diversity across all aspects of tech, including the operating systems. For example, systems using Linux or Mac OS were not affected by this particular issue. We need to ensure that our systems and networks use a range of different products and services so that an issue with one system will not bring them all down."

They added, "The reality is that in our world of complex, interconnected systems, incidents like this happen. They have happened in the past and they will happen in the future. The important part is how we learn from them and how we improve the resilience of our systems, so that similar issues do not happen again."

The Cost of Downtime

Just as a final thought, I would point out that several recent reports have shown that the cost of downtime is high, and downtime can impact companies in many ways. Catchpoint's Internet Resilience Report 2024 found that almost half of survey respondents said outages cost them from $1 million to $10 million every month.

Similarly, Splunk's recent report, The Hidden Costs of Downtime calculates lost revenue due to downtime averages $49 million, regulatory fines average $22 million, and missed SLA penalties average $16 million annually.

Downtime also negatively impacts customer experience, employee productivity, innovation, brand reputation and even share value. In fact, AP reported that shares of CrowdStrike stock fell nearly 10% on Friday, and Microsoft stock fell more than 3%. These numbers speak louder than words.

Pete Goldin is Editor and Publisher of APMdigest

Hot Topics

The Latest

Many organizations describe AI as strategic, but they do not manage it strategically. When AI plans are disconnected from strategy, detached from organizational learning, and protected from serious assumptions testing, the problem is no longer technical immaturity; it is a failure of management discipline ... Executives too often tell organizations to "use AI" before they define what AI is supposed to change. The problem deepens in organizations where strategy isn't well articulated in the first place ...

Across the enterprise technology landscape, a quiet crisis is playing out. Organizations have run hundreds, sometimes thousands, of generative AI pilots. Leadership has celebrated the proof of concept (POCs) ... Industry experience points to a sobering reality: only 5-10% of AI POCs that progress to the pilot stage successfully reach scaled production. The remaining 90% fail because the enterprise environment around them was never ready to absorb them, not the AI models ...

Today's modern systems are not what they once were. Organizations now rely on distributed systems, event-driven workflows, hybrid and multi-cloud environments and continuous delivery pipelines. While each adds flexibility, it also introduces new, often invisible failures. Development speed is no longer the primary bottleneck of innovation. Reliability is ...

Seeing is believing, or in this case, seeing is understanding, according to New Relic's 2025 Observability Forecast for Retail and eCommerce report. Retailers who want to provide exceptional customer experiences while improving IT operations efficiency are leaning on observability ... Here are five key takeaways from the report ...

Technology leaders across the federal landscape are facing, and will continue to face, an uphill battle when it comes to fortifying their digital environments against hostile and persistent threat actors. On one hand, they are being asked to push digital transformation ... On the other hand, they are facing the fiscal uncertainty of continuing resolutions (CR) and government shutdowns looming near and far. In the face of these challenges, CIOs, CTOs, and CISOs must figure out how to modernize legacy systems and infrastructure while doing more with less and still defending against external and internal threats ...

Reliability is no longer proven by uptime alone, according to the The SRE Report 2026 from LogicMonitor. In the AI era, it is experienced through speed, consistency, and user trust, and increasingly judged by business impact. As digital services grow more complex and AI systems move into production, traditional monitoring approaches are struggling to keep pace, increasing the need for AI-first observability that spans applications, infrastructure, and the Internet ...

If AI is the engine of a modern organization, then data engineering is the road system beneath it. You can build the most powerful engine in the world, but without paved roads, traffic signals, and bridges that can support its weight, it will stall. In many enterprises, the engine is ready. The roads are not ...

In the world of digital-first business, there is no tolerance for service outages. Businesses know that outages are the quickest way to lose money and customers. For smaller organizations, unplanned downtime could even force the business to close ... A new study from PagerDuty, The State of AI-First Operations, reveals that companies actively incorporating AI into operations now view operational resilience as a growth driver rather than a cost center. But how are they achieving it? ...

In live financial environments, capital markets software cannot pause for rebuilds. New capabilities are introduced as stacked technology layers to meet evolving demands while systems remain active, data keeps moving, and controls stay intact. AI is no exception, and its opportunities are significant: accelerated decision cycles, compressed manual workflows, and more effective operations across complex environments. The constraint isn't the models themselves, but the architectural environments they enter ...

Like most digital transformation shifts, organizations often prioritize productivity and leave security and observability to keep pace. This usually translates to both the mass implementation of new technology and fragmented monitoring and observability (M&O) tooling. In the era of AI and varied cloud architecture, a disparate observability function can be dangerous. IT teams will lack a complete picture of their IT environment, making it harder to diagnose issues while slowing down mean time to resolve (MTTR). In fact, according to recent data from the SolarWinds State of Monitoring & Observability Report, 77% of IT personnel said the lack of visibility across their on-prem and cloud architecture was an issue ...