Skip to main content

Why AI Is the Differentiator for Operationally Resilient Organizations

Eric Johnson
PagerDuty

In the world of digital-first business, there is no tolerance for service outages. Businesses know that outages are the quickest way to lose money and customers. For smaller organizations, unplanned downtime could even force the business to close.

That's why having "good enough" operational resilience is no longer enough, and minimizing downtime is now a business imperative. In a bid to optimize resilience, many businesses have adopted AI to solve their operations headaches, and this mentality has propelled AI from a tool for early tech adopters to an indispensable part of the operations team's suite.

A new study from PagerDuty, The State of AI-First Operations, reveals that companies actively incorporating AI into operations now view operational resilience as a growth driver rather than a cost center. But how are they achieving it?

Downtime Costs Money and Reputation

The financial stakes couldn't be higher. More than two-thirds (68%) of organizations lose more than $300,000 per hour during IT incidents, and a third lose at least $500,000. For nearly a tenth of organizations, the figure can top $1m per hour. These high costs place intense pressure on organizations to preserve customer trust, investor confidence and the bottom line.

But it doesn't stop there. Incidents also create more pervasive problems around reduced staff productivity and an increase in developer burnout. The latter can be particularly insidious when organizations are already struggling to retain their top engineering talent. If staff are continuously dragged out of bed in the middle of the night or pulled away from their work to handle alert pings, they're more likely to leave for a competitor that can offer a better work-life balance. Those left guarding the fort will be even more stretched and demotivated.

AI is part of the problem, as well as the solution. As more companies roll out customer service chatbots, coding assistants and business process agents, they also expose themselves to more outage risks. More than four in five business report experiencing at least one AI-related outage.

This all creates a clear mandate for the C-suite: reduce the number of incidents and accelerate recovery times, and you will turn resilience into a competitive advantage. Almost all (95%) of survey respondents say their leadership understands this.

The AI Difference

Business and technology leaders are not just understanding the need for operational resilience. They're also taking action.

The "AI pioneers" are more likely (75%) to say they are operationally mature than the organizations that are discussing, but not deploying, the technology (66%). The difference is that mature organizations can recognize the value of AI at every stage of the incident resolution pipeline.

AI-first operations management tools reduce noise and streamline triage by grouping alerts into a single incident, and auto-pausing notifications for transient issues that are often resolved on their own. AI agents can also run auto-diagnostics via one-click runbooks, establishing contributing factors before humans are brought in. Alerts are then directed to the most appropriate subject matter expert (SME) based on expertise, workload and past response times. Together, these features save time and reduce alert fatigue for responders.

For more common and recurring incident types, AI agents can take on remediation and recovery autonomously, reducing the need for manual intervention. Their value in digital operations lies in the ability to operate through a continuous cycle of perceiving, reasoning, acting and learning independent of human teams. That's not just useful for remediation, but also tasks like capturing information for post-incident reviews and coordinating on-call schedules for SMEs.

Generative AI (GenAI) also plays a complementary role. It can support SMEs as a chatbot-based assistant, helping them query and investigate incidents in real-time, while also enabling proactive and automated customer-facing status updates.

The real differentiation comes from AI that operates across the entire technology stack to anticipate and prevent incidents before they ever impact customers. This shifts digital operations towards a proactive model, freeing SMEs to focus on innovation stepping in only during the most challenging incidents.

Beyond Resilience

Organizations are keen to embrace this future, seeing benefits that go beyond operational resilience to broader improvements in how operations teams work. More than two-fifths of organizations surveyed expect AI-first digital operations to improve competitiveness by allowing them more time for innovation and experimentation.

The shift to AI-first operations can also help to mitigate current talent shortages by appealing to existing employees and prospective hires. A growing number of engineers recognize that AI could liberate them from repetitive and manual toil, rather than serve as a potential rival.

Trust in the Future

Not all operations leaders are fully sold on AI. Confidence is higher for tasks like incident analysis than for activities with direct customer impact, which is why many organizations stop short of granting full autonomy in some situations. Keeping a human in the loop remains a sensible way for organizations to strike the right balance between efficiency and control.

These concerns, however, should not slow the pace of adoption. Boards that commit to AI-driven operations are starting to pull away from their competitors, demonstrating how the function can evolve from reactive response to proactive prevention.

The direction is clear, and the gap will widen for those that delay.

Eric Johnson is Chief Information Officer at PagerDuty

Hot Topics

The Latest

Enterprises today operate in a real-time environment where uninterrupted access to trusted data has become a baseline expectation for users, applications and automated systems. Traditional DataOps models, built on manual effort and human triage, cannot keep pace with this always active demand. AI agents are emerging as the operational backbone, ensuring consistent data availability, reinforcing trustworthiness and enabling a level of scale that manual processes cannot achieve ...

For decades, trust in the digital workplace rested on familiar signals. We trusted faces on video calls, voices on the phone, and emails that appeared to come from people we knew. These cues felt human and intuitive. They anchored how decisions were made, approvals were granted, and access was authorized. AI-powered deepfakes have quietly broken that model ...

Cloud migration was supposed to be a one-way door. For most enterprises, it turns out it isn't. Cloud data repatriation is a real and growing trend. A new survey ... finds that 89% of organizations plan to expand their on-premises infrastructure footprint over the next two years — and 75% have already moved at least some workloads back from public cloud in the past 24 months. The findings point to a broad rethinking of where data belongs ...

Over the past few years, large language models (LLMs) have revolutionized the software industry. Given their ability to excel at multi-step reasoning, LLMs have helped enterprises streamline workflows and adapt to the unknown. However, employing such models comes with sky-high costs, latency issues, and limited flexibility. In the realm of IT operations, it is generally wiser to employ smaller, domain-specific models instead ...

For years, DevOps teams operated under a simple assumption: collect enough telemetry, and you can find and fix any problem. That assumption is breaking down. Modern enterprises now operate across microservices, hybrid cloud environments, APIs, Kubernetes, and highly automated delivery pipelines. Releases happen continuously, dependencies shift constantly, and failures spread faster than teams can diagnose them ...

New Relic surveyed IT and engineering leaders from the media and entertainment (M&E) sector to understand what's working — and where challenges persist with their observability practices. The findings reveal how M&E organizations are navigating rising platform complexity, audience expectations, and AI-driven change. Below are five takeaways that stand out ...

Let me start with something I've seen play out more times than I can count. A team hits a wall with the cloud. Costs creep up, then spike. Performance starts to feel inconsistent. Someone in finance asks a simple question like "why did this double?" and nobody has a clean answer ... Maybe this isn't the right place for everything. That realization feels like a breakthrough, like you've identified the problem. In reality, you've just identified the starting line ...

In MEAN TIME TO INSIGHT Episode 24, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses network observability tool sprawl ... 

In cloud-native systems, scaling is often as simple as moving a slider. For on-premise databases, the stakes are different. Over-provisioning hardware is expensive. Under-provisioning leads to performance bottlenecks that are difficult to fix once the equipment is in the rack ...

When most people think about cybersecurity, they picture firewalls, encryption, and access controls — technical tools designed to protect systems and data. But beneath the technology lies a deeper set of principles about trust, decision-making, and resilience ... The best leaders don't eliminate risk. They manage it intelligently. And in many ways, cybersecurity offers a surprisingly useful playbook for doing exactly that ...

Why AI Is the Differentiator for Operationally Resilient Organizations

Eric Johnson
PagerDuty

In the world of digital-first business, there is no tolerance for service outages. Businesses know that outages are the quickest way to lose money and customers. For smaller organizations, unplanned downtime could even force the business to close.

That's why having "good enough" operational resilience is no longer enough, and minimizing downtime is now a business imperative. In a bid to optimize resilience, many businesses have adopted AI to solve their operations headaches, and this mentality has propelled AI from a tool for early tech adopters to an indispensable part of the operations team's suite.

A new study from PagerDuty, The State of AI-First Operations, reveals that companies actively incorporating AI into operations now view operational resilience as a growth driver rather than a cost center. But how are they achieving it?

Downtime Costs Money and Reputation

The financial stakes couldn't be higher. More than two-thirds (68%) of organizations lose more than $300,000 per hour during IT incidents, and a third lose at least $500,000. For nearly a tenth of organizations, the figure can top $1m per hour. These high costs place intense pressure on organizations to preserve customer trust, investor confidence and the bottom line.

But it doesn't stop there. Incidents also create more pervasive problems around reduced staff productivity and an increase in developer burnout. The latter can be particularly insidious when organizations are already struggling to retain their top engineering talent. If staff are continuously dragged out of bed in the middle of the night or pulled away from their work to handle alert pings, they're more likely to leave for a competitor that can offer a better work-life balance. Those left guarding the fort will be even more stretched and demotivated.

AI is part of the problem, as well as the solution. As more companies roll out customer service chatbots, coding assistants and business process agents, they also expose themselves to more outage risks. More than four in five business report experiencing at least one AI-related outage.

This all creates a clear mandate for the C-suite: reduce the number of incidents and accelerate recovery times, and you will turn resilience into a competitive advantage. Almost all (95%) of survey respondents say their leadership understands this.

The AI Difference

Business and technology leaders are not just understanding the need for operational resilience. They're also taking action.

The "AI pioneers" are more likely (75%) to say they are operationally mature than the organizations that are discussing, but not deploying, the technology (66%). The difference is that mature organizations can recognize the value of AI at every stage of the incident resolution pipeline.

AI-first operations management tools reduce noise and streamline triage by grouping alerts into a single incident, and auto-pausing notifications for transient issues that are often resolved on their own. AI agents can also run auto-diagnostics via one-click runbooks, establishing contributing factors before humans are brought in. Alerts are then directed to the most appropriate subject matter expert (SME) based on expertise, workload and past response times. Together, these features save time and reduce alert fatigue for responders.

For more common and recurring incident types, AI agents can take on remediation and recovery autonomously, reducing the need for manual intervention. Their value in digital operations lies in the ability to operate through a continuous cycle of perceiving, reasoning, acting and learning independent of human teams. That's not just useful for remediation, but also tasks like capturing information for post-incident reviews and coordinating on-call schedules for SMEs.

Generative AI (GenAI) also plays a complementary role. It can support SMEs as a chatbot-based assistant, helping them query and investigate incidents in real-time, while also enabling proactive and automated customer-facing status updates.

The real differentiation comes from AI that operates across the entire technology stack to anticipate and prevent incidents before they ever impact customers. This shifts digital operations towards a proactive model, freeing SMEs to focus on innovation stepping in only during the most challenging incidents.

Beyond Resilience

Organizations are keen to embrace this future, seeing benefits that go beyond operational resilience to broader improvements in how operations teams work. More than two-fifths of organizations surveyed expect AI-first digital operations to improve competitiveness by allowing them more time for innovation and experimentation.

The shift to AI-first operations can also help to mitigate current talent shortages by appealing to existing employees and prospective hires. A growing number of engineers recognize that AI could liberate them from repetitive and manual toil, rather than serve as a potential rival.

Trust in the Future

Not all operations leaders are fully sold on AI. Confidence is higher for tasks like incident analysis than for activities with direct customer impact, which is why many organizations stop short of granting full autonomy in some situations. Keeping a human in the loop remains a sensible way for organizations to strike the right balance between efficiency and control.

These concerns, however, should not slow the pace of adoption. Boards that commit to AI-driven operations are starting to pull away from their competitors, demonstrating how the function can evolve from reactive response to proactive prevention.

The direction is clear, and the gap will widen for those that delay.

Eric Johnson is Chief Information Officer at PagerDuty

Hot Topics

The Latest

Enterprises today operate in a real-time environment where uninterrupted access to trusted data has become a baseline expectation for users, applications and automated systems. Traditional DataOps models, built on manual effort and human triage, cannot keep pace with this always active demand. AI agents are emerging as the operational backbone, ensuring consistent data availability, reinforcing trustworthiness and enabling a level of scale that manual processes cannot achieve ...

For decades, trust in the digital workplace rested on familiar signals. We trusted faces on video calls, voices on the phone, and emails that appeared to come from people we knew. These cues felt human and intuitive. They anchored how decisions were made, approvals were granted, and access was authorized. AI-powered deepfakes have quietly broken that model ...

Cloud migration was supposed to be a one-way door. For most enterprises, it turns out it isn't. Cloud data repatriation is a real and growing trend. A new survey ... finds that 89% of organizations plan to expand their on-premises infrastructure footprint over the next two years — and 75% have already moved at least some workloads back from public cloud in the past 24 months. The findings point to a broad rethinking of where data belongs ...

Over the past few years, large language models (LLMs) have revolutionized the software industry. Given their ability to excel at multi-step reasoning, LLMs have helped enterprises streamline workflows and adapt to the unknown. However, employing such models comes with sky-high costs, latency issues, and limited flexibility. In the realm of IT operations, it is generally wiser to employ smaller, domain-specific models instead ...

For years, DevOps teams operated under a simple assumption: collect enough telemetry, and you can find and fix any problem. That assumption is breaking down. Modern enterprises now operate across microservices, hybrid cloud environments, APIs, Kubernetes, and highly automated delivery pipelines. Releases happen continuously, dependencies shift constantly, and failures spread faster than teams can diagnose them ...

New Relic surveyed IT and engineering leaders from the media and entertainment (M&E) sector to understand what's working — and where challenges persist with their observability practices. The findings reveal how M&E organizations are navigating rising platform complexity, audience expectations, and AI-driven change. Below are five takeaways that stand out ...

Let me start with something I've seen play out more times than I can count. A team hits a wall with the cloud. Costs creep up, then spike. Performance starts to feel inconsistent. Someone in finance asks a simple question like "why did this double?" and nobody has a clean answer ... Maybe this isn't the right place for everything. That realization feels like a breakthrough, like you've identified the problem. In reality, you've just identified the starting line ...

In MEAN TIME TO INSIGHT Episode 24, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses network observability tool sprawl ... 

In cloud-native systems, scaling is often as simple as moving a slider. For on-premise databases, the stakes are different. Over-provisioning hardware is expensive. Under-provisioning leads to performance bottlenecks that are difficult to fix once the equipment is in the rack ...

When most people think about cybersecurity, they picture firewalls, encryption, and access controls — technical tools designed to protect systems and data. But beneath the technology lies a deeper set of principles about trust, decision-making, and resilience ... The best leaders don't eliminate risk. They manage it intelligently. And in many ways, cybersecurity offers a surprisingly useful playbook for doing exactly that ...