Skip to main content

Reliability Is the New Bottleneck of Innovation

Ronak Desai
Ciroos

Today's modern systems are not what they once were. Organizations now rely on distributed systems, event-driven workflows, hybrid and multi-cloud environments and continuous delivery pipelines. While each adds flexibility, it also introduces new, often invisible failures.

Development speed is no longer the primary bottleneck of innovation. Reliability is.

A Seismic Shift: Complicated to Complex

For more than a decade, digital transformation focused on abstracting infrastructure and it worked. Engineering teams quickly gained speed, scalability, and flexibility. However, this came with a hidden cost: a fundamental shift in the nature of system complexity.

There's an important distinction worth highlighting. Complicated systems are understood by analyzing their individual parts (think about a car engine or a mechanical watch). A complex system, however, shows emergent behavior that can't be predicted by examining its components in isolation. Modern software systems have crossed that threshold. They're not just complicated, they're complex.

AWS research showcases how modern applications typically involve hundreds of microservices working and communicating together, creating processes and depending on a shared infrastructure. A small change in one system can trigger a chain reaction across the platform. 
This shift from complicated to complex deeply impacts how enterprises experience and respond to any sort of failure.

Failure Is the Norm

Failure used to be gradual and localized in traditional systems. An alert was triggered, an engineer investigated, and the problem was quickly contained. Failures are now sudden and invisible … until they aren't.

Why is this happening?

Factors include hidden service dependencies, retry loops that amplify failures rather than containing them, and external service degradations that lie outside an organization's control. Incident response now requires engineers to simultaneously reason across metrics, logs, traces, configuration changes, external dependencies and historical behavior — usually under immense time constraints and with incomplete information.

The financial stakes of any disruption could not be higher. The Uptime Institute's Annual Outage Analysis found 54% of outages cost organizations more than $100,000, and 16% exceed $1 million.

The October 2025 service disruption of Amazon DynamoDB US-EAST-1 showcases this. A rare event where the system's own automation capabilities caused the deletion of the DNS record for the regional DynamoDB endpoint, leaving it with no valid DNS record. This rippled across AWS provided services and impacted consumer platforms like Spotify, Uber, Delta and some of Amazon's products like Prime Video. While DNS functionality was restored relatively quickly, systems gradually recovered over the course of over 15 hours, costing an estimated $75 million per hour globally.

Observability alone is not enough. It's not about just knowing what's happening. It's about making sense of things quickly and solving them under pressure.

Reliability Is a Knowledge Problem

A group of senior engineers typically hold all of the cards. As knowledge workers, they have an understanding of things most do not: system architecture, past incidents and resolutions, and the small signs to look out for that typically precede an issue. Unfortunately, when failures occur, organizations rely on these workers to quickly connect the dots.

The problem?

A model like this creates systemic risk. In the event that engineers are unavailable, time to resolution is significantly slower. The debugging process quickly becomes trial and error, slowing recovery. Unfortunately, institutional knowledge isn't scaled across teams. Strictly relying on the knowledge held by a handful of SREs impacts productivity. In fact, McKinsey's research on developer productivity shows developers spend up to 40% of their time on operational "toil" (maintenance, debugging, and firefighting) rather than building.

Reliability isn't impacted by access to data. Instead, it's constrained by access to understanding.

AI SRE: Scaling With Humans

Traditional reliability models were designed for a simpler time. They just can't keep up with the needs and environments that organizations have. These reliability models were designed for a different era of system complexity.

AI Site Reliability Engineering (AI SRE) introduces a different model. Gone are the days of waiting for signals to be interpreted. AI SRE continuously analyzes, correlates, and interprets operational data across the entire system. Identifying patterns and root causes transforms incident response into a proactive process versus being a reactive one.

This is about giving human engineers superpowers. AI SRE helps close the gap between incidents and resolutions, by scaling the deep system understanding that only a handful of engineers typically possess. Every team member now has the necessary knowledge, making operational excellence spread across the organization rather than held by just a few key individuals.

Reliability at scale is a competitive advantage. Systems that fail less and recover faster allow teams to build more than firefight.

Innovation is no longer defined by how fast software is built. It's defined by whether it operates reliably. Systems are growing in complexity, which has outpaced what human teams can track, reason and resolve any failures. It's not about removing humans from the equation. It's about scaling what makes them effective. We need human-like reasoning at AI-scale.

Ronak Desai is CEO and Co-Founder of Ciroos

The Latest

Enterprises today operate in a real-time environment where uninterrupted access to trusted data has become a baseline expectation for users, applications and automated systems. Traditional DataOps models, built on manual effort and human triage, cannot keep pace with this always active demand. AI agents are emerging as the operational backbone, ensuring consistent data availability, reinforcing trustworthiness and enabling a level of scale that manual processes cannot achieve ...

For decades, trust in the digital workplace rested on familiar signals. We trusted faces on video calls, voices on the phone, and emails that appeared to come from people we knew. These cues felt human and intuitive. They anchored how decisions were made, approvals were granted, and access was authorized. AI-powered deepfakes have quietly broken that model ...

Cloud migration was supposed to be a one-way door. For most enterprises, it turns out it isn't. Cloud data repatriation is a real and growing trend. A new survey ... finds that 89% of organizations plan to expand their on-premises infrastructure footprint over the next two years — and 75% have already moved at least some workloads back from public cloud in the past 24 months. The findings point to a broad rethinking of where data belongs ...

Over the past few years, large language models (LLMs) have revolutionized the software industry. Given their ability to excel at multi-step reasoning, LLMs have helped enterprises streamline workflows and adapt to the unknown. However, employing such models comes with sky-high costs, latency issues, and limited flexibility. In the realm of IT operations, it is generally wiser to employ smaller, domain-specific models instead ...

For years, DevOps teams operated under a simple assumption: collect enough telemetry, and you can find and fix any problem. That assumption is breaking down. Modern enterprises now operate across microservices, hybrid cloud environments, APIs, Kubernetes, and highly automated delivery pipelines. Releases happen continuously, dependencies shift constantly, and failures spread faster than teams can diagnose them ...

New Relic surveyed IT and engineering leaders from the media and entertainment (M&E) sector to understand what's working — and where challenges persist with their observability practices. The findings reveal how M&E organizations are navigating rising platform complexity, audience expectations, and AI-driven change. Below are five takeaways that stand out ...

Let me start with something I've seen play out more times than I can count. A team hits a wall with the cloud. Costs creep up, then spike. Performance starts to feel inconsistent. Someone in finance asks a simple question like "why did this double?" and nobody has a clean answer ... Maybe this isn't the right place for everything. That realization feels like a breakthrough, like you've identified the problem. In reality, you've just identified the starting line ...

In MEAN TIME TO INSIGHT Episode 24, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses network observability tool sprawl ... 

In cloud-native systems, scaling is often as simple as moving a slider. For on-premise databases, the stakes are different. Over-provisioning hardware is expensive. Under-provisioning leads to performance bottlenecks that are difficult to fix once the equipment is in the rack ...

When most people think about cybersecurity, they picture firewalls, encryption, and access controls — technical tools designed to protect systems and data. But beneath the technology lies a deeper set of principles about trust, decision-making, and resilience ... The best leaders don't eliminate risk. They manage it intelligently. And in many ways, cybersecurity offers a surprisingly useful playbook for doing exactly that ...

Reliability Is the New Bottleneck of Innovation

Ronak Desai
Ciroos

Today's modern systems are not what they once were. Organizations now rely on distributed systems, event-driven workflows, hybrid and multi-cloud environments and continuous delivery pipelines. While each adds flexibility, it also introduces new, often invisible failures.

Development speed is no longer the primary bottleneck of innovation. Reliability is.

A Seismic Shift: Complicated to Complex

For more than a decade, digital transformation focused on abstracting infrastructure and it worked. Engineering teams quickly gained speed, scalability, and flexibility. However, this came with a hidden cost: a fundamental shift in the nature of system complexity.

There's an important distinction worth highlighting. Complicated systems are understood by analyzing their individual parts (think about a car engine or a mechanical watch). A complex system, however, shows emergent behavior that can't be predicted by examining its components in isolation. Modern software systems have crossed that threshold. They're not just complicated, they're complex.

AWS research showcases how modern applications typically involve hundreds of microservices working and communicating together, creating processes and depending on a shared infrastructure. A small change in one system can trigger a chain reaction across the platform. 
This shift from complicated to complex deeply impacts how enterprises experience and respond to any sort of failure.

Failure Is the Norm

Failure used to be gradual and localized in traditional systems. An alert was triggered, an engineer investigated, and the problem was quickly contained. Failures are now sudden and invisible … until they aren't.

Why is this happening?

Factors include hidden service dependencies, retry loops that amplify failures rather than containing them, and external service degradations that lie outside an organization's control. Incident response now requires engineers to simultaneously reason across metrics, logs, traces, configuration changes, external dependencies and historical behavior — usually under immense time constraints and with incomplete information.

The financial stakes of any disruption could not be higher. The Uptime Institute's Annual Outage Analysis found 54% of outages cost organizations more than $100,000, and 16% exceed $1 million.

The October 2025 service disruption of Amazon DynamoDB US-EAST-1 showcases this. A rare event where the system's own automation capabilities caused the deletion of the DNS record for the regional DynamoDB endpoint, leaving it with no valid DNS record. This rippled across AWS provided services and impacted consumer platforms like Spotify, Uber, Delta and some of Amazon's products like Prime Video. While DNS functionality was restored relatively quickly, systems gradually recovered over the course of over 15 hours, costing an estimated $75 million per hour globally.

Observability alone is not enough. It's not about just knowing what's happening. It's about making sense of things quickly and solving them under pressure.

Reliability Is a Knowledge Problem

A group of senior engineers typically hold all of the cards. As knowledge workers, they have an understanding of things most do not: system architecture, past incidents and resolutions, and the small signs to look out for that typically precede an issue. Unfortunately, when failures occur, organizations rely on these workers to quickly connect the dots.

The problem?

A model like this creates systemic risk. In the event that engineers are unavailable, time to resolution is significantly slower. The debugging process quickly becomes trial and error, slowing recovery. Unfortunately, institutional knowledge isn't scaled across teams. Strictly relying on the knowledge held by a handful of SREs impacts productivity. In fact, McKinsey's research on developer productivity shows developers spend up to 40% of their time on operational "toil" (maintenance, debugging, and firefighting) rather than building.

Reliability isn't impacted by access to data. Instead, it's constrained by access to understanding.

AI SRE: Scaling With Humans

Traditional reliability models were designed for a simpler time. They just can't keep up with the needs and environments that organizations have. These reliability models were designed for a different era of system complexity.

AI Site Reliability Engineering (AI SRE) introduces a different model. Gone are the days of waiting for signals to be interpreted. AI SRE continuously analyzes, correlates, and interprets operational data across the entire system. Identifying patterns and root causes transforms incident response into a proactive process versus being a reactive one.

This is about giving human engineers superpowers. AI SRE helps close the gap between incidents and resolutions, by scaling the deep system understanding that only a handful of engineers typically possess. Every team member now has the necessary knowledge, making operational excellence spread across the organization rather than held by just a few key individuals.

Reliability at scale is a competitive advantage. Systems that fail less and recover faster allow teams to build more than firefight.

Innovation is no longer defined by how fast software is built. It's defined by whether it operates reliably. Systems are growing in complexity, which has outpaced what human teams can track, reason and resolve any failures. It's not about removing humans from the equation. It's about scaling what makes them effective. We need human-like reasoning at AI-scale.

Ronak Desai is CEO and Co-Founder of Ciroos

The Latest

Enterprises today operate in a real-time environment where uninterrupted access to trusted data has become a baseline expectation for users, applications and automated systems. Traditional DataOps models, built on manual effort and human triage, cannot keep pace with this always active demand. AI agents are emerging as the operational backbone, ensuring consistent data availability, reinforcing trustworthiness and enabling a level of scale that manual processes cannot achieve ...

For decades, trust in the digital workplace rested on familiar signals. We trusted faces on video calls, voices on the phone, and emails that appeared to come from people we knew. These cues felt human and intuitive. They anchored how decisions were made, approvals were granted, and access was authorized. AI-powered deepfakes have quietly broken that model ...

Cloud migration was supposed to be a one-way door. For most enterprises, it turns out it isn't. Cloud data repatriation is a real and growing trend. A new survey ... finds that 89% of organizations plan to expand their on-premises infrastructure footprint over the next two years — and 75% have already moved at least some workloads back from public cloud in the past 24 months. The findings point to a broad rethinking of where data belongs ...

Over the past few years, large language models (LLMs) have revolutionized the software industry. Given their ability to excel at multi-step reasoning, LLMs have helped enterprises streamline workflows and adapt to the unknown. However, employing such models comes with sky-high costs, latency issues, and limited flexibility. In the realm of IT operations, it is generally wiser to employ smaller, domain-specific models instead ...

For years, DevOps teams operated under a simple assumption: collect enough telemetry, and you can find and fix any problem. That assumption is breaking down. Modern enterprises now operate across microservices, hybrid cloud environments, APIs, Kubernetes, and highly automated delivery pipelines. Releases happen continuously, dependencies shift constantly, and failures spread faster than teams can diagnose them ...

New Relic surveyed IT and engineering leaders from the media and entertainment (M&E) sector to understand what's working — and where challenges persist with their observability practices. The findings reveal how M&E organizations are navigating rising platform complexity, audience expectations, and AI-driven change. Below are five takeaways that stand out ...

Let me start with something I've seen play out more times than I can count. A team hits a wall with the cloud. Costs creep up, then spike. Performance starts to feel inconsistent. Someone in finance asks a simple question like "why did this double?" and nobody has a clean answer ... Maybe this isn't the right place for everything. That realization feels like a breakthrough, like you've identified the problem. In reality, you've just identified the starting line ...

In MEAN TIME TO INSIGHT Episode 24, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses network observability tool sprawl ... 

In cloud-native systems, scaling is often as simple as moving a slider. For on-premise databases, the stakes are different. Over-provisioning hardware is expensive. Under-provisioning leads to performance bottlenecks that are difficult to fix once the equipment is in the rack ...

When most people think about cybersecurity, they picture firewalls, encryption, and access controls — technical tools designed to protect systems and data. But beneath the technology lies a deeper set of principles about trust, decision-making, and resilience ... The best leaders don't eliminate risk. They manage it intelligently. And in many ways, cybersecurity offers a surprisingly useful playbook for doing exactly that ...