Skip to main content

AI Scale Is Outpacing Infrastructure - and IT Leaders Are Running Out of Time

Rob Reid
Cockroach Labs

AI workloads require an enormous amount of computing power. So much so that discussions around putting additional data centers in space are heating up (it's actually very interesting and involves arranging them in helio-synchronous orbits, but I digress). What's also becoming abundantly clear is just how quickly AI's computing needs are leading to enterprise systems failure.

According to Cockroach Labs' State of AI Infrastructure 2026 report, enterprise systems are much closer to failure than their organizations realize. The report, which is based on a global survey of 1,125 senior cloud architects, engineers, and technology executives, suggests AI scale could cause widespread failures in as little as one year — making it a clear risk for business performance and reliability.

This is one "pulse of the industry" that IT leaders can't afford to miss, because its implications are both far-reaching and immediate. Several storylines jump off the page.

AI Workloads Are Growing Faster Than Infrastructure Can Handle

AI doesn't follow normal business hours, sleep, or take breaks to eat or watch the kids like humans do. It doesn't follow predictable usage patterns. And it doesn't show signs of slowing anytime soon.

A full 100% of the report's respondents expect AI workloads at their organization to grow in the next year. More than 60% expect workloads to increase by at least 20%. So, we know AI deployments will only grow larger, but what does this mean for the underlying infrastructure these systems rely upon?

For years, IT leaders have relied upon trusted historical patterns to determine how much computing power they'll require to support the organization. This strategy is no longer feasible. Architects must assume that AI-driven load will exceed previously set forecasts exponentially and design for volatility rather than averages.

The One-Year Outage Countdown Is On

Perhaps the most troubling takeaway from the report is just how close most organizations are to experiencing systems failure related to AI scale. 83% of respondents expect AI-driven demand to push their data infrastructure to failure within just two years. One-third believe it'll occur within the next 11 months.

There are a number of factors at play. AI innovations in recent years have made it possible for agents to operate continuously, completing transactions in real time, personalizing responses for consumers, and just about anything else you can imagine. Much of the enterprise infrastructure deployed today was engineered for an entirely different era and is set to woefully underserve their organizations.

For many organizations, systems that worked well in 2022 are in grave danger of being overwhelmed without significant upgrades. IT leaders need to treat AI-driven systems failure as an immediate operational risk, not something to worry about tomorrow.

The Enormous Cost of Outages

One of the most frequently discussed (and rarely agreed upon) aspects of an outage is the financial cost to the organization. Calculations must factor in the length of the outage, how many customers were impacted, how damaging it was to customer satisfaction, the list goes on. In many ways, it's incalculable and case-by-case dependent.

It's startling to discover that 98% of global tech leaders expect one hour of AI-related downtime would cost their business at least $10,000 and nearly two-thirds believe losses would exceed $100,000 per hour. There's no metric that's more urgent to understand and be on top of.

As AI workloads continue to grow and accelerate, the timeline before an outage occurs becomes shorter and significant financial risks present themselves. And with many outages caused by random spikes in demand, leaders need to build systems to withstand both scale and unpredictability.

Leadership Misalignment

These factors present a golden opportunity for technology leaders to justify a strong business case for modernizing and updating their data architectures. There's just one more problem … Most leadership teams aren't aware of the risks yet.

According to the survey, 63% of tech leaders say their leadership teams underestimate how quickly AI demands will outpace existing data infrastructure. This gap in knowledge is occurring at a time when nearly every single respondent (99.6%) acknowledges that investment in AI scalability is a priority in the coming year.

The big takeaway is that while companies have been investing heavily in AI, their spending has skewed towards reactive product upgrades instead of essential infrastructure needs. If a significant portion of an organization's AI investment is not dedicated to modernizing architecture for continuous, agent-driven scale, it's in for a rude awakening.

The Opportunity Ahead

While stark, these findings are ultimately not too surprising. Database infrastructures have been approaching end-of-life for many years now, and the past few years' explosion in AI-driven demand only speeds up that timeline.

The findings also highlight several key priorities as enterprises approach the 1-2 year failure countdown.

First, IT leaders must re-architect their systems for continuous, machine-driven load. Do not make assumptions about peaks and troughs; rather, assume every time of day could be a peak.

When designing this modernized architecture, another critical consideration is that resilience is just as important as performance. AI exacerbates failures that may have already been disastrous for organizations, so reliability must come first. Add to this the stampeding herd effect of not only humans but also agents returning to a recovered system and the risk of immediate and repeated failures cannot be ignored.

Finally, given the significant gap between tech leaders and the C-suite, achieving executive buy-in from the outset is crucial. Future infrastructure will only be as resilient as executives allow it to be, so they must be on board from the get-go.

The countdown is on. How will your business respond?

Rob Reid is Technical Evangelist at Cockroach Labs

Hot Topics

The Latest

Across the enterprise technology landscape, a quiet crisis is playing out. Organizations have run hundreds, sometimes thousands, of generative AI pilots. Leadership has celebrated the proof of concept (POCs) ... Industry experience points to a sobering reality: only 5-10% of AI POCs that progress to the pilot stage successfully reach scaled production. The remaining 90% fail because the enterprise environment around them was never ready to absorb them, not the AI models ...

Today's modern systems are not what they once were. Organizations now rely on distributed systems, event-driven workflows, hybrid and multi-cloud environments and continuous delivery pipelines. While each adds flexibility, it also introduces new, often invisible failures. Development speed is no longer the primary bottleneck of innovation. Reliability is ...

Seeing is believing, or in this case, seeing is understanding, according to New Relic's 2025 Observability Forecast for Retail and eCommerce report. Retailers who want to provide exceptional customer experiences while improving IT operations efficiency are leaning on observability ... Here are five key takeaways from the report ...

Technology leaders across the federal landscape are facing, and will continue to face, an uphill battle when it comes to fortifying their digital environments against hostile and persistent threat actors. On one hand, they are being asked to push digital transformation ... On the other hand, they are facing the fiscal uncertainty of continuing resolutions (CR) and government shutdowns looming near and far. In the face of these challenges, CIOs, CTOs, and CISOs must figure out how to modernize legacy systems and infrastructure while doing more with less and still defending against external and internal threats ...

Reliability is no longer proven by uptime alone, according to the The SRE Report 2026 from LogicMonitor. In the AI era, it is experienced through speed, consistency, and user trust, and increasingly judged by business impact. As digital services grow more complex and AI systems move into production, traditional monitoring approaches are struggling to keep pace, increasing the need for AI-first observability that spans applications, infrastructure, and the Internet ...

If AI is the engine of a modern organization, then data engineering is the road system beneath it. You can build the most powerful engine in the world, but without paved roads, traffic signals, and bridges that can support its weight, it will stall. In many enterprises, the engine is ready. The roads are not ...

In the world of digital-first business, there is no tolerance for service outages. Businesses know that outages are the quickest way to lose money and customers. For smaller organizations, unplanned downtime could even force the business to close ... A new study from PagerDuty, The State of AI-First Operations, reveals that companies actively incorporating AI into operations now view operational resilience as a growth driver rather than a cost center. But how are they achieving it? ...

In live financial environments, capital markets software cannot pause for rebuilds. New capabilities are introduced as stacked technology layers to meet evolving demands while systems remain active, data keeps moving, and controls stay intact. AI is no exception, and its opportunities are significant: accelerated decision cycles, compressed manual workflows, and more effective operations across complex environments. The constraint isn't the models themselves, but the architectural environments they enter ...

Like most digital transformation shifts, organizations often prioritize productivity and leave security and observability to keep pace. This usually translates to both the mass implementation of new technology and fragmented monitoring and observability (M&O) tooling. In the era of AI and varied cloud architecture, a disparate observability function can be dangerous. IT teams will lack a complete picture of their IT environment, making it harder to diagnose issues while slowing down mean time to resolve (MTTR). In fact, according to recent data from the SolarWinds State of Monitoring & Observability Report, 77% of IT personnel said the lack of visibility across their on-prem and cloud architecture was an issue ...

In MEAN TIME TO INSIGHT Episode 23, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses the NetOps labor shortage ... 

AI Scale Is Outpacing Infrastructure - and IT Leaders Are Running Out of Time

Rob Reid
Cockroach Labs

AI workloads require an enormous amount of computing power. So much so that discussions around putting additional data centers in space are heating up (it's actually very interesting and involves arranging them in helio-synchronous orbits, but I digress). What's also becoming abundantly clear is just how quickly AI's computing needs are leading to enterprise systems failure.

According to Cockroach Labs' State of AI Infrastructure 2026 report, enterprise systems are much closer to failure than their organizations realize. The report, which is based on a global survey of 1,125 senior cloud architects, engineers, and technology executives, suggests AI scale could cause widespread failures in as little as one year — making it a clear risk for business performance and reliability.

This is one "pulse of the industry" that IT leaders can't afford to miss, because its implications are both far-reaching and immediate. Several storylines jump off the page.

AI Workloads Are Growing Faster Than Infrastructure Can Handle

AI doesn't follow normal business hours, sleep, or take breaks to eat or watch the kids like humans do. It doesn't follow predictable usage patterns. And it doesn't show signs of slowing anytime soon.

A full 100% of the report's respondents expect AI workloads at their organization to grow in the next year. More than 60% expect workloads to increase by at least 20%. So, we know AI deployments will only grow larger, but what does this mean for the underlying infrastructure these systems rely upon?

For years, IT leaders have relied upon trusted historical patterns to determine how much computing power they'll require to support the organization. This strategy is no longer feasible. Architects must assume that AI-driven load will exceed previously set forecasts exponentially and design for volatility rather than averages.

The One-Year Outage Countdown Is On

Perhaps the most troubling takeaway from the report is just how close most organizations are to experiencing systems failure related to AI scale. 83% of respondents expect AI-driven demand to push their data infrastructure to failure within just two years. One-third believe it'll occur within the next 11 months.

There are a number of factors at play. AI innovations in recent years have made it possible for agents to operate continuously, completing transactions in real time, personalizing responses for consumers, and just about anything else you can imagine. Much of the enterprise infrastructure deployed today was engineered for an entirely different era and is set to woefully underserve their organizations.

For many organizations, systems that worked well in 2022 are in grave danger of being overwhelmed without significant upgrades. IT leaders need to treat AI-driven systems failure as an immediate operational risk, not something to worry about tomorrow.

The Enormous Cost of Outages

One of the most frequently discussed (and rarely agreed upon) aspects of an outage is the financial cost to the organization. Calculations must factor in the length of the outage, how many customers were impacted, how damaging it was to customer satisfaction, the list goes on. In many ways, it's incalculable and case-by-case dependent.

It's startling to discover that 98% of global tech leaders expect one hour of AI-related downtime would cost their business at least $10,000 and nearly two-thirds believe losses would exceed $100,000 per hour. There's no metric that's more urgent to understand and be on top of.

As AI workloads continue to grow and accelerate, the timeline before an outage occurs becomes shorter and significant financial risks present themselves. And with many outages caused by random spikes in demand, leaders need to build systems to withstand both scale and unpredictability.

Leadership Misalignment

These factors present a golden opportunity for technology leaders to justify a strong business case for modernizing and updating their data architectures. There's just one more problem … Most leadership teams aren't aware of the risks yet.

According to the survey, 63% of tech leaders say their leadership teams underestimate how quickly AI demands will outpace existing data infrastructure. This gap in knowledge is occurring at a time when nearly every single respondent (99.6%) acknowledges that investment in AI scalability is a priority in the coming year.

The big takeaway is that while companies have been investing heavily in AI, their spending has skewed towards reactive product upgrades instead of essential infrastructure needs. If a significant portion of an organization's AI investment is not dedicated to modernizing architecture for continuous, agent-driven scale, it's in for a rude awakening.

The Opportunity Ahead

While stark, these findings are ultimately not too surprising. Database infrastructures have been approaching end-of-life for many years now, and the past few years' explosion in AI-driven demand only speeds up that timeline.

The findings also highlight several key priorities as enterprises approach the 1-2 year failure countdown.

First, IT leaders must re-architect their systems for continuous, machine-driven load. Do not make assumptions about peaks and troughs; rather, assume every time of day could be a peak.

When designing this modernized architecture, another critical consideration is that resilience is just as important as performance. AI exacerbates failures that may have already been disastrous for organizations, so reliability must come first. Add to this the stampeding herd effect of not only humans but also agents returning to a recovered system and the risk of immediate and repeated failures cannot be ignored.

Finally, given the significant gap between tech leaders and the C-suite, achieving executive buy-in from the outset is crucial. Future infrastructure will only be as resilient as executives allow it to be, so they must be on board from the get-go.

The countdown is on. How will your business respond?

Rob Reid is Technical Evangelist at Cockroach Labs

Hot Topics

The Latest

Across the enterprise technology landscape, a quiet crisis is playing out. Organizations have run hundreds, sometimes thousands, of generative AI pilots. Leadership has celebrated the proof of concept (POCs) ... Industry experience points to a sobering reality: only 5-10% of AI POCs that progress to the pilot stage successfully reach scaled production. The remaining 90% fail because the enterprise environment around them was never ready to absorb them, not the AI models ...

Today's modern systems are not what they once were. Organizations now rely on distributed systems, event-driven workflows, hybrid and multi-cloud environments and continuous delivery pipelines. While each adds flexibility, it also introduces new, often invisible failures. Development speed is no longer the primary bottleneck of innovation. Reliability is ...

Seeing is believing, or in this case, seeing is understanding, according to New Relic's 2025 Observability Forecast for Retail and eCommerce report. Retailers who want to provide exceptional customer experiences while improving IT operations efficiency are leaning on observability ... Here are five key takeaways from the report ...

Technology leaders across the federal landscape are facing, and will continue to face, an uphill battle when it comes to fortifying their digital environments against hostile and persistent threat actors. On one hand, they are being asked to push digital transformation ... On the other hand, they are facing the fiscal uncertainty of continuing resolutions (CR) and government shutdowns looming near and far. In the face of these challenges, CIOs, CTOs, and CISOs must figure out how to modernize legacy systems and infrastructure while doing more with less and still defending against external and internal threats ...

Reliability is no longer proven by uptime alone, according to the The SRE Report 2026 from LogicMonitor. In the AI era, it is experienced through speed, consistency, and user trust, and increasingly judged by business impact. As digital services grow more complex and AI systems move into production, traditional monitoring approaches are struggling to keep pace, increasing the need for AI-first observability that spans applications, infrastructure, and the Internet ...

If AI is the engine of a modern organization, then data engineering is the road system beneath it. You can build the most powerful engine in the world, but without paved roads, traffic signals, and bridges that can support its weight, it will stall. In many enterprises, the engine is ready. The roads are not ...

In the world of digital-first business, there is no tolerance for service outages. Businesses know that outages are the quickest way to lose money and customers. For smaller organizations, unplanned downtime could even force the business to close ... A new study from PagerDuty, The State of AI-First Operations, reveals that companies actively incorporating AI into operations now view operational resilience as a growth driver rather than a cost center. But how are they achieving it? ...

In live financial environments, capital markets software cannot pause for rebuilds. New capabilities are introduced as stacked technology layers to meet evolving demands while systems remain active, data keeps moving, and controls stay intact. AI is no exception, and its opportunities are significant: accelerated decision cycles, compressed manual workflows, and more effective operations across complex environments. The constraint isn't the models themselves, but the architectural environments they enter ...

Like most digital transformation shifts, organizations often prioritize productivity and leave security and observability to keep pace. This usually translates to both the mass implementation of new technology and fragmented monitoring and observability (M&O) tooling. In the era of AI and varied cloud architecture, a disparate observability function can be dangerous. IT teams will lack a complete picture of their IT environment, making it harder to diagnose issues while slowing down mean time to resolve (MTTR). In fact, according to recent data from the SolarWinds State of Monitoring & Observability Report, 77% of IT personnel said the lack of visibility across their on-prem and cloud architecture was an issue ...

In MEAN TIME TO INSIGHT Episode 23, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses the NetOps labor shortage ...