Skip to main content

Automation and AI Are Critical to Incident Response

Jessica Abelson
Transposit

In an era defined by the continuous evolution of technology and the ever-expanding digital landscape, the complexity of modern operations has reached new heights. Businesses continue to embrace cutting-edge applications and technologies to stay competitive — but amidst this complexity, one thing remains unwavering: the need to maintain reliable services and uphold customer satisfaction. Yet there's a disconnect — new research shows that over four in 10 organizations believe their current incident management process is not effective or is only being used by some team members, causing tedious and time-consuming workflows and impacting their ability to maintain reliability at scale.


The reality is incident management processes are not keeping pace with the demands of modern operations teams, failing to meet the needs of SREs as well as platform and ops teams. Results from the State of DevOps Automation and AI Survey, commissioned by Transposit, point to an incident management paradox. Despite nearly 60% of ITOps and DevOps professionals reporting they have a defined incident management process that's fully documented in one place and over 70% saying they have a level of automation that meets their needs, teams are unable to quickly resolve incidents. Organizations are experiencing an uptick in incidents with increased downtime, costing them hundreds of thousands — and in some cases millions — of dollars.

Luckily, a majority of respondents are optimistic that generative AI should be used to address the incident management paradox: 84.5% believe AI can significantly streamline incident management processes and improve overall efficiency or are excited about the opportunities AI presents for automating certain aspects of incident management.

Rise of Incidents, Knowledge Gaps and Confusing Processes Result in Increased Cost of Downtime

A majority (61.5%) of organizations cited an increase in the amount of time it takes to resolve incidents in the last year, with nearly 8 in 10 respondents saying it takes up to 6 hours on average to resolve incidents from the first alert to resolution. 63% of respondents said these downtime-producing incidents (i.e., application outages, service degradation) are putting their organizations at risk of losing up to an average of $499,999 per hour — a nearly 5% increase from 2022. And almost half said downtime can cost anywhere from $100,000 to $2 million.

What's causing the disarray?

Three-quarters (73.9%) of respondents responsible for reliability engineering experienced challenges while trying to solve incidents due to brittle automation scripts, too many manual processes and lack of access to specialized knowledge. What's more, 42.5% said their current incident management process is not effective or is only being used by some team members because of confusing documentation, limited access to tools and reliance on institutional knowledge.

A significant portion of team members are finding it challenging to understand and apply their organization's defined incident management procedures. Only about one-third of organizations report that select team members have a comprehensive understanding of the defined incident management process and adhere to it consistently.

Top Barriers to Automation

Implementing automation is a rising challenge for IT and DevOps teams according to report findings. One-third of respondents cited only 11-25% of their incident management tasks or workflows are automated and respondents expressed interest in automating pivotal aspects of the incident lifecycle, such as incident setup, communication protocols, investigative processes and remediation scripts.

Despite the interest in implementing automation, teams cited the following top four barriers:

■ Not enough buy-in from leadership or management (57.1%)

■ Not enough share of knowledge (54.3%)

■ Inadequate documentation of institutional knowledge and existing processes (54%)

■ Lack of clarity about what to automate (52.4%)

SRE and platform engineering play a vital role in implementing automation, and the survey found that there's a growing emphasis on bolstering these areas in the next 12 months. With the intention to hire more site reliability and platform engineers, over 60% of respondents increased their focus on SRE practices while over half enhanced platform engineering efforts, which highlights the commitment to fortify incident management capabilities.

Human-In-The-Loop AI and Automation Present as a Viable Solution to Increase Downtime and MTTR

The results of the report underscore the opportunity for more automation and AI across incident management processes. Over the next year, teams expect to expand their tech stack and plan to implement new AI and automation tools to strengthen incident management processes and decrease mean time to resolution/repair (MTTR).

Almost 90% of respondents indicated that integrating generative AI capabilities into incident management tools or platforms decreased the time it takes to create new automations. Almost all (96.3%) believe it would be beneficial if the tools their organization used during an incident were integrated through one tool or platform.

For the 79.5% of organizations that have embraced AI in their tech stack, the impact has already been significant with more than half feeling that AI is making their job better, improving the accuracy and quality of data, making time to incident resolution faster, and streamlining IT operations effectively.

Moreover, an overwhelming majority (90.4%) of respondents believe that leveraging insights from human data — such as archived Slack communications, retrospective interviews, and group feedback — could improve incident management and operational efficiency. The vast majority also agree automation should let humans use judgment at critical decision points to be more reliable and effective — a nearly 10% increase from last year.

The findings support the notion that human-in-the-loop automation and AI are critical to incident response and operational excellence. The results highlight the importance of a clear incident response lifecycle and emphasize the need for a single SaaS tool or platform that seamlessly integrates incident management tools, human data insights and generative AI to accelerate operational efficiency.

Jessica Abelson is Director of Product Marketing at Transposit

The Latest

Businesses that face downtime or outages risk financial and reputational damage, as well as reducing partner, shareholder, and customer trust. One of the major challenges that enterprises face is implementing a robust business continuity plan. What's the solution? The answer may lie in disaster recovery tactics such as truly immutable storage and regular disaster recovery testing ...

IT spending is expected to jump nearly 10% in 2025, and organizations are now facing pressure to manage costs without slowing down critical functions like observability. To meet the challenge, leaders are turning to smarter, more cost effective business strategies. Enter stage right: OpenTelemetry, the missing piece of the puzzle that is no longer just an option but rather a strategic advantage ...

Amidst the threat of cyberhacks and data breaches, companies install several security measures to keep their business safely afloat. These measures aim to protect businesses, employees, and crucial data. Yet, employees perceive them as burdensome. Frustrated with complex logins, slow access, and constant security checks, workers decide to completely bypass all security set-ups ...

Image
Cloudbrink's Personal SASE services provide last-mile acceleration and reduction in latency

In MEAN TIME TO INSIGHT Episode 13, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses hybrid multi-cloud networking strategy ... 

In high-traffic environments, the sheer volume and unpredictable nature of network incidents can quickly overwhelm even the most skilled teams, hindering their ability to react swiftly and effectively, potentially impacting service availability and overall business performance. This is where closed-loop remediation comes into the picture: an IT management concept designed to address the escalating complexity of modern networks ...

In 2025, enterprise workflows are undergoing a seismic shift. Propelled by breakthroughs in generative AI (GenAI), large language models (LLMs), and natural language processing (NLP), a new paradigm is emerging — agentic AI. This technology is not just automating tasks; it's reimagining how organizations make decisions, engage customers, and operate at scale ...

In the early days of the cloud revolution, business leaders perceived cloud services as a means of sidelining IT organizations. IT was too slow, too expensive, or incapable of supporting new technologies. With a team of developers, line of business managers could deploy new applications and services in the cloud. IT has been fighting to retake control ever since. Today, IT is back in the driver's seat, according to new research by Enterprise Management Associates (EMA) ...

In today's fast-paced and increasingly complex network environments, Network Operations Centers (NOCs) are the backbone of ensuring continuous uptime, smooth service delivery, and rapid issue resolution. However, the challenges faced by NOC teams are only growing. In a recent study, 78% state network complexity has grown significantly over the last few years while 84% regularly learn about network issues from users. It is imperative we adopt a new approach to managing today's network experiences ...

Image
Broadcom

From growing reliance on FinOps teams to the increasing attention on artificial intelligence (AI), and software licensing, the Flexera 2025 State of the Cloud Report digs into how organizations are improving cloud spend efficiency, while tackling the complexities of emerging technologies ...

Today, organizations are generating and processing more data than ever before. From training AI models to running complex analytics, massive datasets have become the backbone of innovation. However, as businesses embrace the cloud for its scalability and flexibility, a new challenge arises: managing the soaring costs of storing and processing this data ...

Automation and AI Are Critical to Incident Response

Jessica Abelson
Transposit

In an era defined by the continuous evolution of technology and the ever-expanding digital landscape, the complexity of modern operations has reached new heights. Businesses continue to embrace cutting-edge applications and technologies to stay competitive — but amidst this complexity, one thing remains unwavering: the need to maintain reliable services and uphold customer satisfaction. Yet there's a disconnect — new research shows that over four in 10 organizations believe their current incident management process is not effective or is only being used by some team members, causing tedious and time-consuming workflows and impacting their ability to maintain reliability at scale.


The reality is incident management processes are not keeping pace with the demands of modern operations teams, failing to meet the needs of SREs as well as platform and ops teams. Results from the State of DevOps Automation and AI Survey, commissioned by Transposit, point to an incident management paradox. Despite nearly 60% of ITOps and DevOps professionals reporting they have a defined incident management process that's fully documented in one place and over 70% saying they have a level of automation that meets their needs, teams are unable to quickly resolve incidents. Organizations are experiencing an uptick in incidents with increased downtime, costing them hundreds of thousands — and in some cases millions — of dollars.

Luckily, a majority of respondents are optimistic that generative AI should be used to address the incident management paradox: 84.5% believe AI can significantly streamline incident management processes and improve overall efficiency or are excited about the opportunities AI presents for automating certain aspects of incident management.

Rise of Incidents, Knowledge Gaps and Confusing Processes Result in Increased Cost of Downtime

A majority (61.5%) of organizations cited an increase in the amount of time it takes to resolve incidents in the last year, with nearly 8 in 10 respondents saying it takes up to 6 hours on average to resolve incidents from the first alert to resolution. 63% of respondents said these downtime-producing incidents (i.e., application outages, service degradation) are putting their organizations at risk of losing up to an average of $499,999 per hour — a nearly 5% increase from 2022. And almost half said downtime can cost anywhere from $100,000 to $2 million.

What's causing the disarray?

Three-quarters (73.9%) of respondents responsible for reliability engineering experienced challenges while trying to solve incidents due to brittle automation scripts, too many manual processes and lack of access to specialized knowledge. What's more, 42.5% said their current incident management process is not effective or is only being used by some team members because of confusing documentation, limited access to tools and reliance on institutional knowledge.

A significant portion of team members are finding it challenging to understand and apply their organization's defined incident management procedures. Only about one-third of organizations report that select team members have a comprehensive understanding of the defined incident management process and adhere to it consistently.

Top Barriers to Automation

Implementing automation is a rising challenge for IT and DevOps teams according to report findings. One-third of respondents cited only 11-25% of their incident management tasks or workflows are automated and respondents expressed interest in automating pivotal aspects of the incident lifecycle, such as incident setup, communication protocols, investigative processes and remediation scripts.

Despite the interest in implementing automation, teams cited the following top four barriers:

■ Not enough buy-in from leadership or management (57.1%)

■ Not enough share of knowledge (54.3%)

■ Inadequate documentation of institutional knowledge and existing processes (54%)

■ Lack of clarity about what to automate (52.4%)

SRE and platform engineering play a vital role in implementing automation, and the survey found that there's a growing emphasis on bolstering these areas in the next 12 months. With the intention to hire more site reliability and platform engineers, over 60% of respondents increased their focus on SRE practices while over half enhanced platform engineering efforts, which highlights the commitment to fortify incident management capabilities.

Human-In-The-Loop AI and Automation Present as a Viable Solution to Increase Downtime and MTTR

The results of the report underscore the opportunity for more automation and AI across incident management processes. Over the next year, teams expect to expand their tech stack and plan to implement new AI and automation tools to strengthen incident management processes and decrease mean time to resolution/repair (MTTR).

Almost 90% of respondents indicated that integrating generative AI capabilities into incident management tools or platforms decreased the time it takes to create new automations. Almost all (96.3%) believe it would be beneficial if the tools their organization used during an incident were integrated through one tool or platform.

For the 79.5% of organizations that have embraced AI in their tech stack, the impact has already been significant with more than half feeling that AI is making their job better, improving the accuracy and quality of data, making time to incident resolution faster, and streamlining IT operations effectively.

Moreover, an overwhelming majority (90.4%) of respondents believe that leveraging insights from human data — such as archived Slack communications, retrospective interviews, and group feedback — could improve incident management and operational efficiency. The vast majority also agree automation should let humans use judgment at critical decision points to be more reliable and effective — a nearly 10% increase from last year.

The findings support the notion that human-in-the-loop automation and AI are critical to incident response and operational excellence. The results highlight the importance of a clear incident response lifecycle and emphasize the need for a single SaaS tool or platform that seamlessly integrates incident management tools, human data insights and generative AI to accelerate operational efficiency.

Jessica Abelson is Director of Product Marketing at Transposit

The Latest

Businesses that face downtime or outages risk financial and reputational damage, as well as reducing partner, shareholder, and customer trust. One of the major challenges that enterprises face is implementing a robust business continuity plan. What's the solution? The answer may lie in disaster recovery tactics such as truly immutable storage and regular disaster recovery testing ...

IT spending is expected to jump nearly 10% in 2025, and organizations are now facing pressure to manage costs without slowing down critical functions like observability. To meet the challenge, leaders are turning to smarter, more cost effective business strategies. Enter stage right: OpenTelemetry, the missing piece of the puzzle that is no longer just an option but rather a strategic advantage ...

Amidst the threat of cyberhacks and data breaches, companies install several security measures to keep their business safely afloat. These measures aim to protect businesses, employees, and crucial data. Yet, employees perceive them as burdensome. Frustrated with complex logins, slow access, and constant security checks, workers decide to completely bypass all security set-ups ...

Image
Cloudbrink's Personal SASE services provide last-mile acceleration and reduction in latency

In MEAN TIME TO INSIGHT Episode 13, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses hybrid multi-cloud networking strategy ... 

In high-traffic environments, the sheer volume and unpredictable nature of network incidents can quickly overwhelm even the most skilled teams, hindering their ability to react swiftly and effectively, potentially impacting service availability and overall business performance. This is where closed-loop remediation comes into the picture: an IT management concept designed to address the escalating complexity of modern networks ...

In 2025, enterprise workflows are undergoing a seismic shift. Propelled by breakthroughs in generative AI (GenAI), large language models (LLMs), and natural language processing (NLP), a new paradigm is emerging — agentic AI. This technology is not just automating tasks; it's reimagining how organizations make decisions, engage customers, and operate at scale ...

In the early days of the cloud revolution, business leaders perceived cloud services as a means of sidelining IT organizations. IT was too slow, too expensive, or incapable of supporting new technologies. With a team of developers, line of business managers could deploy new applications and services in the cloud. IT has been fighting to retake control ever since. Today, IT is back in the driver's seat, according to new research by Enterprise Management Associates (EMA) ...

In today's fast-paced and increasingly complex network environments, Network Operations Centers (NOCs) are the backbone of ensuring continuous uptime, smooth service delivery, and rapid issue resolution. However, the challenges faced by NOC teams are only growing. In a recent study, 78% state network complexity has grown significantly over the last few years while 84% regularly learn about network issues from users. It is imperative we adopt a new approach to managing today's network experiences ...

Image
Broadcom

From growing reliance on FinOps teams to the increasing attention on artificial intelligence (AI), and software licensing, the Flexera 2025 State of the Cloud Report digs into how organizations are improving cloud spend efficiency, while tackling the complexities of emerging technologies ...

Today, organizations are generating and processing more data than ever before. From training AI models to running complex analytics, massive datasets have become the backbone of innovation. However, as businesses embrace the cloud for its scalability and flexibility, a new challenge arises: managing the soaring costs of storing and processing this data ...