Skip to main content

Automation and AI Are Critical to Incident Response

Jessica Abelson
Transposit

In an era defined by the continuous evolution of technology and the ever-expanding digital landscape, the complexity of modern operations has reached new heights. Businesses continue to embrace cutting-edge applications and technologies to stay competitive — but amidst this complexity, one thing remains unwavering: the need to maintain reliable services and uphold customer satisfaction. Yet there's a disconnect — new research shows that over four in 10 organizations believe their current incident management process is not effective or is only being used by some team members, causing tedious and time-consuming workflows and impacting their ability to maintain reliability at scale.


The reality is incident management processes are not keeping pace with the demands of modern operations teams, failing to meet the needs of SREs as well as platform and ops teams. Results from the State of DevOps Automation and AI Survey, commissioned by Transposit, point to an incident management paradox. Despite nearly 60% of ITOps and DevOps professionals reporting they have a defined incident management process that's fully documented in one place and over 70% saying they have a level of automation that meets their needs, teams are unable to quickly resolve incidents. Organizations are experiencing an uptick in incidents with increased downtime, costing them hundreds of thousands — and in some cases millions — of dollars.

Luckily, a majority of respondents are optimistic that generative AI should be used to address the incident management paradox: 84.5% believe AI can significantly streamline incident management processes and improve overall efficiency or are excited about the opportunities AI presents for automating certain aspects of incident management.

Rise of Incidents, Knowledge Gaps and Confusing Processes Result in Increased Cost of Downtime

A majority (61.5%) of organizations cited an increase in the amount of time it takes to resolve incidents in the last year, with nearly 8 in 10 respondents saying it takes up to 6 hours on average to resolve incidents from the first alert to resolution. 63% of respondents said these downtime-producing incidents (i.e., application outages, service degradation) are putting their organizations at risk of losing up to an average of $499,999 per hour — a nearly 5% increase from 2022. And almost half said downtime can cost anywhere from $100,000 to $2 million.

What's causing the disarray?

Three-quarters (73.9%) of respondents responsible for reliability engineering experienced challenges while trying to solve incidents due to brittle automation scripts, too many manual processes and lack of access to specialized knowledge. What's more, 42.5% said their current incident management process is not effective or is only being used by some team members because of confusing documentation, limited access to tools and reliance on institutional knowledge.

A significant portion of team members are finding it challenging to understand and apply their organization's defined incident management procedures. Only about one-third of organizations report that select team members have a comprehensive understanding of the defined incident management process and adhere to it consistently.

Top Barriers to Automation

Implementing automation is a rising challenge for IT and DevOps teams according to report findings. One-third of respondents cited only 11-25% of their incident management tasks or workflows are automated and respondents expressed interest in automating pivotal aspects of the incident lifecycle, such as incident setup, communication protocols, investigative processes and remediation scripts.

Despite the interest in implementing automation, teams cited the following top four barriers:

■ Not enough buy-in from leadership or management (57.1%)

■ Not enough share of knowledge (54.3%)

■ Inadequate documentation of institutional knowledge and existing processes (54%)

■ Lack of clarity about what to automate (52.4%)

SRE and platform engineering play a vital role in implementing automation, and the survey found that there's a growing emphasis on bolstering these areas in the next 12 months. With the intention to hire more site reliability and platform engineers, over 60% of respondents increased their focus on SRE practices while over half enhanced platform engineering efforts, which highlights the commitment to fortify incident management capabilities.

Human-In-The-Loop AI and Automation Present as a Viable Solution to Increase Downtime and MTTR

The results of the report underscore the opportunity for more automation and AI across incident management processes. Over the next year, teams expect to expand their tech stack and plan to implement new AI and automation tools to strengthen incident management processes and decrease mean time to resolution/repair (MTTR).

Almost 90% of respondents indicated that integrating generative AI capabilities into incident management tools or platforms decreased the time it takes to create new automations. Almost all (96.3%) believe it would be beneficial if the tools their organization used during an incident were integrated through one tool or platform.

For the 79.5% of organizations that have embraced AI in their tech stack, the impact has already been significant with more than half feeling that AI is making their job better, improving the accuracy and quality of data, making time to incident resolution faster, and streamlining IT operations effectively.

Moreover, an overwhelming majority (90.4%) of respondents believe that leveraging insights from human data — such as archived Slack communications, retrospective interviews, and group feedback — could improve incident management and operational efficiency. The vast majority also agree automation should let humans use judgment at critical decision points to be more reliable and effective — a nearly 10% increase from last year.

The findings support the notion that human-in-the-loop automation and AI are critical to incident response and operational excellence. The results highlight the importance of a clear incident response lifecycle and emphasize the need for a single SaaS tool or platform that seamlessly integrates incident management tools, human data insights and generative AI to accelerate operational efficiency.

Jessica Abelson is Director of Product Marketing at Transposit

The Latest

While companies adopt AI at a record pace, they also face the challenge of finding a smart and scalable way to manage its rapidly growing costs. This requires balancing the massive possibilities inherent in AI with the need to control cloud costs, aim for long-term profitability and optimize spending ...

Telecommunications is expanding at an unprecedented pace ... But progress brings complexity. As WanAware's 2025 Telecom Observability Benchmark Report reveals, many operators are discovering that modernization requires more than physical build outs and CapEx — it also demands the tools and insights to manage, secure, and optimize this fast-growing infrastructure in real time ...

As businesses increasingly rely on high-performance applications to deliver seamless user experiences, the demand for fast, reliable, and scalable data storage systems has never been greater. Redis — an open-source, in-memory data structure store — has emerged as a popular choice for use cases ranging from caching to real-time analytics. But with great performance comes the need for vigilant monitoring ...

Kubernetes was not initially designed with AI's vast resource variability in mind, and the rapid rise of AI has exposed Kubernetes limitations, particularly when it comes to cost and resource efficiency. Indeed, AI workloads differ from traditional applications in that they require a staggering amount and variety of compute resources, and their consumption is far less consistent than traditional workloads ... Considering the speed of AI innovation, teams cannot afford to be bogged down by these constant infrastructure concerns. A solution is needed ...

AI is the catalyst for significant investment in data teams as enterprises require higher-quality data to power their AI applications, according to the State of Analytics Engineering Report from dbt Labs ...

Misaligned architecture can lead to business consequences, with 93% of respondents reporting negative outcomes such as service disruptions, high operational costs and security challenges ...

A Gartner analyst recently suggested that GenAI tools could create 25% time savings for network operational teams. Where might these time savings come from? How are GenAI tools helping NetOps teams today, and what other tasks might they take on in the future as models continue improving? In general, these savings come from automating or streamlining manual NetOps tasks ...

IT and line-of-business teams are increasingly aligned in their efforts to close the data gap and drive greater collaboration to alleviate IT bottlenecks and offload growing demands on IT teams, according to The 2025 Automation Benchmark Report: Insights from IT Leaders on Enterprise Automation & the Future of AI-Driven Businesses from Jitterbit ...

A large majority (86%) of data management and AI decision makers cite protecting data privacy as a top concern, with 76% of respondents citing ROI on data privacy and AI initiatives across their organization, according to a new Harris Poll from Collibra ...

According to Gartner, Inc. the following six trends will shape the future of cloud over the next four years, ultimately resulting in new ways of working that are digital in nature and transformative in impact ...

Automation and AI Are Critical to Incident Response

Jessica Abelson
Transposit

In an era defined by the continuous evolution of technology and the ever-expanding digital landscape, the complexity of modern operations has reached new heights. Businesses continue to embrace cutting-edge applications and technologies to stay competitive — but amidst this complexity, one thing remains unwavering: the need to maintain reliable services and uphold customer satisfaction. Yet there's a disconnect — new research shows that over four in 10 organizations believe their current incident management process is not effective or is only being used by some team members, causing tedious and time-consuming workflows and impacting their ability to maintain reliability at scale.


The reality is incident management processes are not keeping pace with the demands of modern operations teams, failing to meet the needs of SREs as well as platform and ops teams. Results from the State of DevOps Automation and AI Survey, commissioned by Transposit, point to an incident management paradox. Despite nearly 60% of ITOps and DevOps professionals reporting they have a defined incident management process that's fully documented in one place and over 70% saying they have a level of automation that meets their needs, teams are unable to quickly resolve incidents. Organizations are experiencing an uptick in incidents with increased downtime, costing them hundreds of thousands — and in some cases millions — of dollars.

Luckily, a majority of respondents are optimistic that generative AI should be used to address the incident management paradox: 84.5% believe AI can significantly streamline incident management processes and improve overall efficiency or are excited about the opportunities AI presents for automating certain aspects of incident management.

Rise of Incidents, Knowledge Gaps and Confusing Processes Result in Increased Cost of Downtime

A majority (61.5%) of organizations cited an increase in the amount of time it takes to resolve incidents in the last year, with nearly 8 in 10 respondents saying it takes up to 6 hours on average to resolve incidents from the first alert to resolution. 63% of respondents said these downtime-producing incidents (i.e., application outages, service degradation) are putting their organizations at risk of losing up to an average of $499,999 per hour — a nearly 5% increase from 2022. And almost half said downtime can cost anywhere from $100,000 to $2 million.

What's causing the disarray?

Three-quarters (73.9%) of respondents responsible for reliability engineering experienced challenges while trying to solve incidents due to brittle automation scripts, too many manual processes and lack of access to specialized knowledge. What's more, 42.5% said their current incident management process is not effective or is only being used by some team members because of confusing documentation, limited access to tools and reliance on institutional knowledge.

A significant portion of team members are finding it challenging to understand and apply their organization's defined incident management procedures. Only about one-third of organizations report that select team members have a comprehensive understanding of the defined incident management process and adhere to it consistently.

Top Barriers to Automation

Implementing automation is a rising challenge for IT and DevOps teams according to report findings. One-third of respondents cited only 11-25% of their incident management tasks or workflows are automated and respondents expressed interest in automating pivotal aspects of the incident lifecycle, such as incident setup, communication protocols, investigative processes and remediation scripts.

Despite the interest in implementing automation, teams cited the following top four barriers:

■ Not enough buy-in from leadership or management (57.1%)

■ Not enough share of knowledge (54.3%)

■ Inadequate documentation of institutional knowledge and existing processes (54%)

■ Lack of clarity about what to automate (52.4%)

SRE and platform engineering play a vital role in implementing automation, and the survey found that there's a growing emphasis on bolstering these areas in the next 12 months. With the intention to hire more site reliability and platform engineers, over 60% of respondents increased their focus on SRE practices while over half enhanced platform engineering efforts, which highlights the commitment to fortify incident management capabilities.

Human-In-The-Loop AI and Automation Present as a Viable Solution to Increase Downtime and MTTR

The results of the report underscore the opportunity for more automation and AI across incident management processes. Over the next year, teams expect to expand their tech stack and plan to implement new AI and automation tools to strengthen incident management processes and decrease mean time to resolution/repair (MTTR).

Almost 90% of respondents indicated that integrating generative AI capabilities into incident management tools or platforms decreased the time it takes to create new automations. Almost all (96.3%) believe it would be beneficial if the tools their organization used during an incident were integrated through one tool or platform.

For the 79.5% of organizations that have embraced AI in their tech stack, the impact has already been significant with more than half feeling that AI is making their job better, improving the accuracy and quality of data, making time to incident resolution faster, and streamlining IT operations effectively.

Moreover, an overwhelming majority (90.4%) of respondents believe that leveraging insights from human data — such as archived Slack communications, retrospective interviews, and group feedback — could improve incident management and operational efficiency. The vast majority also agree automation should let humans use judgment at critical decision points to be more reliable and effective — a nearly 10% increase from last year.

The findings support the notion that human-in-the-loop automation and AI are critical to incident response and operational excellence. The results highlight the importance of a clear incident response lifecycle and emphasize the need for a single SaaS tool or platform that seamlessly integrates incident management tools, human data insights and generative AI to accelerate operational efficiency.

Jessica Abelson is Director of Product Marketing at Transposit

The Latest

While companies adopt AI at a record pace, they also face the challenge of finding a smart and scalable way to manage its rapidly growing costs. This requires balancing the massive possibilities inherent in AI with the need to control cloud costs, aim for long-term profitability and optimize spending ...

Telecommunications is expanding at an unprecedented pace ... But progress brings complexity. As WanAware's 2025 Telecom Observability Benchmark Report reveals, many operators are discovering that modernization requires more than physical build outs and CapEx — it also demands the tools and insights to manage, secure, and optimize this fast-growing infrastructure in real time ...

As businesses increasingly rely on high-performance applications to deliver seamless user experiences, the demand for fast, reliable, and scalable data storage systems has never been greater. Redis — an open-source, in-memory data structure store — has emerged as a popular choice for use cases ranging from caching to real-time analytics. But with great performance comes the need for vigilant monitoring ...

Kubernetes was not initially designed with AI's vast resource variability in mind, and the rapid rise of AI has exposed Kubernetes limitations, particularly when it comes to cost and resource efficiency. Indeed, AI workloads differ from traditional applications in that they require a staggering amount and variety of compute resources, and their consumption is far less consistent than traditional workloads ... Considering the speed of AI innovation, teams cannot afford to be bogged down by these constant infrastructure concerns. A solution is needed ...

AI is the catalyst for significant investment in data teams as enterprises require higher-quality data to power their AI applications, according to the State of Analytics Engineering Report from dbt Labs ...

Misaligned architecture can lead to business consequences, with 93% of respondents reporting negative outcomes such as service disruptions, high operational costs and security challenges ...

A Gartner analyst recently suggested that GenAI tools could create 25% time savings for network operational teams. Where might these time savings come from? How are GenAI tools helping NetOps teams today, and what other tasks might they take on in the future as models continue improving? In general, these savings come from automating or streamlining manual NetOps tasks ...

IT and line-of-business teams are increasingly aligned in their efforts to close the data gap and drive greater collaboration to alleviate IT bottlenecks and offload growing demands on IT teams, according to The 2025 Automation Benchmark Report: Insights from IT Leaders on Enterprise Automation & the Future of AI-Driven Businesses from Jitterbit ...

A large majority (86%) of data management and AI decision makers cite protecting data privacy as a top concern, with 76% of respondents citing ROI on data privacy and AI initiatives across their organization, according to a new Harris Poll from Collibra ...

According to Gartner, Inc. the following six trends will shape the future of cloud over the next four years, ultimately resulting in new ways of working that are digital in nature and transformative in impact ...