Skip to main content

Automation and AI Are Critical to Incident Response

Jessica Abelson
Transposit

In an era defined by the continuous evolution of technology and the ever-expanding digital landscape, the complexity of modern operations has reached new heights. Businesses continue to embrace cutting-edge applications and technologies to stay competitive — but amidst this complexity, one thing remains unwavering: the need to maintain reliable services and uphold customer satisfaction. Yet there's a disconnect — new research shows that over four in 10 organizations believe their current incident management process is not effective or is only being used by some team members, causing tedious and time-consuming workflows and impacting their ability to maintain reliability at scale.

Image removed.

The reality is incident management processes are not keeping pace with the demands of modern operations teams, failing to meet the needs of SREs as well as platform and ops teams. Results from the State of DevOps Automation and AI Survey, commissioned by Transposit, point to an incident management paradox. Despite nearly 60% of ITOps and DevOps professionals reporting they have a defined incident management process that's fully documented in one place and over 70% saying they have a level of automation that meets their needs, teams are unable to quickly resolve incidents. Organizations are experiencing an uptick in incidents with increased downtime, costing them hundreds of thousands — and in some cases millions — of dollars.

Luckily, a majority of respondents are optimistic that generative AI should be used to address the incident management paradox: 84.5% believe AI can significantly streamline incident management processes and improve overall efficiency or are excited about the opportunities AI presents for automating certain aspects of incident management.

Rise of Incidents, Knowledge Gaps and Confusing Processes Result in Increased Cost of Downtime

A majority (61.5%) of organizations cited an increase in the amount of time it takes to resolve incidents in the last year, with nearly 8 in 10 respondents saying it takes up to 6 hours on average to resolve incidents from the first alert to resolution. 63% of respondents said these downtime-producing incidents (i.e., application outages, service degradation) are putting their organizations at risk of losing up to an average of $499,999 per hour — a nearly 5% increase from 2022. And almost half said downtime can cost anywhere from $100,000 to $2 million.

What's causing the disarray?

Three-quarters (73.9%) of respondents responsible for reliability engineering experienced challenges while trying to solve incidents due to brittle automation scripts, too many manual processes and lack of access to specialized knowledge. What's more, 42.5% said their current incident management process is not effective or is only being used by some team members because of confusing documentation, limited access to tools and reliance on institutional knowledge.

A significant portion of team members are finding it challenging to understand and apply their organization's defined incident management procedures. Only about one-third of organizations report that select team members have a comprehensive understanding of the defined incident management process and adhere to it consistently.

Top Barriers to Automation

Implementing automation is a rising challenge for IT and DevOps teams according to report findings. One-third of respondents cited only 11-25% of their incident management tasks or workflows are automated and respondents expressed interest in automating pivotal aspects of the incident lifecycle, such as incident setup, communication protocols, investigative processes and remediation scripts.

Despite the interest in implementing automation, teams cited the following top four barriers:

■ Not enough buy-in from leadership or management (57.1%)

■ Not enough share of knowledge (54.3%)

■ Inadequate documentation of institutional knowledge and existing processes (54%)

■ Lack of clarity about what to automate (52.4%)

SRE and platform engineering play a vital role in implementing automation, and the survey found that there's a growing emphasis on bolstering these areas in the next 12 months. With the intention to hire more site reliability and platform engineers, over 60% of respondents increased their focus on SRE practices while over half enhanced platform engineering efforts, which highlights the commitment to fortify incident management capabilities.

Human-In-The-Loop AI and Automation Present as a Viable Solution to Increase Downtime and MTTR

The results of the report underscore the opportunity for more automation and AI across incident management processes. Over the next year, teams expect to expand their tech stack and plan to implement new AI and automation tools to strengthen incident management processes and decrease mean time to resolution/repair (MTTR).

Almost 90% of respondents indicated that integrating generative AI capabilities into incident management tools or platforms decreased the time it takes to create new automations. Almost all (96.3%) believe it would be beneficial if the tools their organization used during an incident were integrated through one tool or platform.

For the 79.5% of organizations that have embraced AI in their tech stack, the impact has already been significant with more than half feeling that AI is making their job better, improving the accuracy and quality of data, making time to incident resolution faster, and streamlining IT operations effectively.

Moreover, an overwhelming majority (90.4%) of respondents believe that leveraging insights from human data — such as archived Slack communications, retrospective interviews, and group feedback — could improve incident management and operational efficiency. The vast majority also agree automation should let humans use judgment at critical decision points to be more reliable and effective — a nearly 10% increase from last year.

The findings support the notion that human-in-the-loop automation and AI are critical to incident response and operational excellence. The results highlight the importance of a clear incident response lifecycle and emphasize the need for a single SaaS tool or platform that seamlessly integrates incident management tools, human data insights and generative AI to accelerate operational efficiency.

Jessica Abelson is Director of Product Marketing at Transposit

The Latest

Cloud migration is a highly strategic decision that involves leadership sponsorship, business justifications for moving to the cloud, and a clear understanding of expected value. Lack of this alignment can be the reigning cause of cost and budget overruns and why almost half of the migration efforts underway today will fail in the next three years ...

One of the most misunderstood culprits of poor application performance is packet loss. Even minimal packet loss can cripple the throughput of a high-speed connection, making enterprise applications sluggish and frustrating for remote employee ... So, what's going wrong? And why does adding more bandwidth fail to fix the issue? ...

Image
Cloudbrink

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 5 covers the infrastructure and hardware supporting AI ...

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 4 covers advancements in AI technology ...

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 3 covers AI's impact on employees and their roles ...

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 2 covers the challenges presented by AI, as well as solutions to those problems ...

In the final part of APMdigest's 2025 Predictions Series, industry experts offer predictions on how AI will evolve and impact technology and business in 2025 ...

E-commerce is set to skyrocket with a 9% rise over the next few years ... To thrive in this competitive environment, retailers must identify digital resilience as their top priority. In a world where savvy shoppers expect 24/7 access to online deals and experiences, any unexpected downtime to digital services can lead to significant financial losses, damage to brand reputation, abandoned carts with designer shoes, and additional issues ...

Efficiency is a highly-desirable objective in business ... We're seeing this scenario play out in enterprises around the world as they continue to struggle with infrastructures and remote work models with an eye toward operational efficiencies. In contrast to that goal, a recent Broadcom survey of global IT and network professionals found widespread adoption of these strategies is making the network more complex and hampering observability, leading to uptime, performance and security issues. Let's look more closely at these challenges ...

Image
Broadcom

The 2025 Catchpoint SRE Report dives into the forces transforming the SRE landscape, exploring both the challenges and opportunities ahead. Let's break down the key findings and what they mean for SRE professionals and the businesses relying on them ...

Image
Catchpoint

Automation and AI Are Critical to Incident Response

Jessica Abelson
Transposit

In an era defined by the continuous evolution of technology and the ever-expanding digital landscape, the complexity of modern operations has reached new heights. Businesses continue to embrace cutting-edge applications and technologies to stay competitive — but amidst this complexity, one thing remains unwavering: the need to maintain reliable services and uphold customer satisfaction. Yet there's a disconnect — new research shows that over four in 10 organizations believe their current incident management process is not effective or is only being used by some team members, causing tedious and time-consuming workflows and impacting their ability to maintain reliability at scale.

Image removed.

The reality is incident management processes are not keeping pace with the demands of modern operations teams, failing to meet the needs of SREs as well as platform and ops teams. Results from the State of DevOps Automation and AI Survey, commissioned by Transposit, point to an incident management paradox. Despite nearly 60% of ITOps and DevOps professionals reporting they have a defined incident management process that's fully documented in one place and over 70% saying they have a level of automation that meets their needs, teams are unable to quickly resolve incidents. Organizations are experiencing an uptick in incidents with increased downtime, costing them hundreds of thousands — and in some cases millions — of dollars.

Luckily, a majority of respondents are optimistic that generative AI should be used to address the incident management paradox: 84.5% believe AI can significantly streamline incident management processes and improve overall efficiency or are excited about the opportunities AI presents for automating certain aspects of incident management.

Rise of Incidents, Knowledge Gaps and Confusing Processes Result in Increased Cost of Downtime

A majority (61.5%) of organizations cited an increase in the amount of time it takes to resolve incidents in the last year, with nearly 8 in 10 respondents saying it takes up to 6 hours on average to resolve incidents from the first alert to resolution. 63% of respondents said these downtime-producing incidents (i.e., application outages, service degradation) are putting their organizations at risk of losing up to an average of $499,999 per hour — a nearly 5% increase from 2022. And almost half said downtime can cost anywhere from $100,000 to $2 million.

What's causing the disarray?

Three-quarters (73.9%) of respondents responsible for reliability engineering experienced challenges while trying to solve incidents due to brittle automation scripts, too many manual processes and lack of access to specialized knowledge. What's more, 42.5% said their current incident management process is not effective or is only being used by some team members because of confusing documentation, limited access to tools and reliance on institutional knowledge.

A significant portion of team members are finding it challenging to understand and apply their organization's defined incident management procedures. Only about one-third of organizations report that select team members have a comprehensive understanding of the defined incident management process and adhere to it consistently.

Top Barriers to Automation

Implementing automation is a rising challenge for IT and DevOps teams according to report findings. One-third of respondents cited only 11-25% of their incident management tasks or workflows are automated and respondents expressed interest in automating pivotal aspects of the incident lifecycle, such as incident setup, communication protocols, investigative processes and remediation scripts.

Despite the interest in implementing automation, teams cited the following top four barriers:

■ Not enough buy-in from leadership or management (57.1%)

■ Not enough share of knowledge (54.3%)

■ Inadequate documentation of institutional knowledge and existing processes (54%)

■ Lack of clarity about what to automate (52.4%)

SRE and platform engineering play a vital role in implementing automation, and the survey found that there's a growing emphasis on bolstering these areas in the next 12 months. With the intention to hire more site reliability and platform engineers, over 60% of respondents increased their focus on SRE practices while over half enhanced platform engineering efforts, which highlights the commitment to fortify incident management capabilities.

Human-In-The-Loop AI and Automation Present as a Viable Solution to Increase Downtime and MTTR

The results of the report underscore the opportunity for more automation and AI across incident management processes. Over the next year, teams expect to expand their tech stack and plan to implement new AI and automation tools to strengthen incident management processes and decrease mean time to resolution/repair (MTTR).

Almost 90% of respondents indicated that integrating generative AI capabilities into incident management tools or platforms decreased the time it takes to create new automations. Almost all (96.3%) believe it would be beneficial if the tools their organization used during an incident were integrated through one tool or platform.

For the 79.5% of organizations that have embraced AI in their tech stack, the impact has already been significant with more than half feeling that AI is making their job better, improving the accuracy and quality of data, making time to incident resolution faster, and streamlining IT operations effectively.

Moreover, an overwhelming majority (90.4%) of respondents believe that leveraging insights from human data — such as archived Slack communications, retrospective interviews, and group feedback — could improve incident management and operational efficiency. The vast majority also agree automation should let humans use judgment at critical decision points to be more reliable and effective — a nearly 10% increase from last year.

The findings support the notion that human-in-the-loop automation and AI are critical to incident response and operational excellence. The results highlight the importance of a clear incident response lifecycle and emphasize the need for a single SaaS tool or platform that seamlessly integrates incident management tools, human data insights and generative AI to accelerate operational efficiency.

Jessica Abelson is Director of Product Marketing at Transposit

The Latest

Cloud migration is a highly strategic decision that involves leadership sponsorship, business justifications for moving to the cloud, and a clear understanding of expected value. Lack of this alignment can be the reigning cause of cost and budget overruns and why almost half of the migration efforts underway today will fail in the next three years ...

One of the most misunderstood culprits of poor application performance is packet loss. Even minimal packet loss can cripple the throughput of a high-speed connection, making enterprise applications sluggish and frustrating for remote employee ... So, what's going wrong? And why does adding more bandwidth fail to fix the issue? ...

Image
Cloudbrink

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 5 covers the infrastructure and hardware supporting AI ...

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 4 covers advancements in AI technology ...

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 3 covers AI's impact on employees and their roles ...

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 2 covers the challenges presented by AI, as well as solutions to those problems ...

In the final part of APMdigest's 2025 Predictions Series, industry experts offer predictions on how AI will evolve and impact technology and business in 2025 ...

E-commerce is set to skyrocket with a 9% rise over the next few years ... To thrive in this competitive environment, retailers must identify digital resilience as their top priority. In a world where savvy shoppers expect 24/7 access to online deals and experiences, any unexpected downtime to digital services can lead to significant financial losses, damage to brand reputation, abandoned carts with designer shoes, and additional issues ...

Efficiency is a highly-desirable objective in business ... We're seeing this scenario play out in enterprises around the world as they continue to struggle with infrastructures and remote work models with an eye toward operational efficiencies. In contrast to that goal, a recent Broadcom survey of global IT and network professionals found widespread adoption of these strategies is making the network more complex and hampering observability, leading to uptime, performance and security issues. Let's look more closely at these challenges ...

Image
Broadcom

The 2025 Catchpoint SRE Report dives into the forces transforming the SRE landscape, exploring both the challenges and opportunities ahead. Let's break down the key findings and what they mean for SRE professionals and the businesses relying on them ...

Image
Catchpoint