Automation and AI Are Critical to Incident Response
November 28, 2023

Jessica Abelson

Share this

In an era defined by the continuous evolution of technology and the ever-expanding digital landscape, the complexity of modern operations has reached new heights. Businesses continue to embrace cutting-edge applications and technologies to stay competitive — but amidst this complexity, one thing remains unwavering: the need to maintain reliable services and uphold customer satisfaction. Yet there's a disconnect — new research shows that over four in 10 organizations believe their current incident management process is not effective or is only being used by some team members, causing tedious and time-consuming workflows and impacting their ability to maintain reliability at scale.

The reality is incident management processes are not keeping pace with the demands of modern operations teams, failing to meet the needs of SREs as well as platform and ops teams. Results from the State of DevOps Automation and AI Survey, commissioned by Transposit, point to an incident management paradox. Despite nearly 60% of ITOps and DevOps professionals reporting they have a defined incident management process that's fully documented in one place and over 70% saying they have a level of automation that meets their needs, teams are unable to quickly resolve incidents. Organizations are experiencing an uptick in incidents with increased downtime, costing them hundreds of thousands — and in some cases millions — of dollars.

Luckily, a majority of respondents are optimistic that generative AI should be used to address the incident management paradox: 84.5% believe AI can significantly streamline incident management processes and improve overall efficiency or are excited about the opportunities AI presents for automating certain aspects of incident management.

Rise of Incidents, Knowledge Gaps and Confusing Processes Result in Increased Cost of Downtime

A majority (61.5%) of organizations cited an increase in the amount of time it takes to resolve incidents in the last year, with nearly 8 in 10 respondents saying it takes up to 6 hours on average to resolve incidents from the first alert to resolution. 63% of respondents said these downtime-producing incidents (i.e., application outages, service degradation) are putting their organizations at risk of losing up to an average of $499,999 per hour — a nearly 5% increase from 2022. And almost half said downtime can cost anywhere from $100,000 to $2 million.

What's causing the disarray?

Three-quarters (73.9%) of respondents responsible for reliability engineering experienced challenges while trying to solve incidents due to brittle automation scripts, too many manual processes and lack of access to specialized knowledge. What's more, 42.5% said their current incident management process is not effective or is only being used by some team members because of confusing documentation, limited access to tools and reliance on institutional knowledge.

A significant portion of team members are finding it challenging to understand and apply their organization's defined incident management procedures. Only about one-third of organizations report that select team members have a comprehensive understanding of the defined incident management process and adhere to it consistently.

Top Barriers to Automation

Implementing automation is a rising challenge for IT and DevOps teams according to report findings. One-third of respondents cited only 11-25% of their incident management tasks or workflows are automated and respondents expressed interest in automating pivotal aspects of the incident lifecycle, such as incident setup, communication protocols, investigative processes and remediation scripts.

Despite the interest in implementing automation, teams cited the following top four barriers:

■ Not enough buy-in from leadership or management (57.1%)

■ Not enough share of knowledge (54.3%)

■ Inadequate documentation of institutional knowledge and existing processes (54%)

■ Lack of clarity about what to automate (52.4%)

SRE and platform engineering play a vital role in implementing automation, and the survey found that there's a growing emphasis on bolstering these areas in the next 12 months. With the intention to hire more site reliability and platform engineers, over 60% of respondents increased their focus on SRE practices while over half enhanced platform engineering efforts, which highlights the commitment to fortify incident management capabilities.

Human-In-The-Loop AI and Automation Present as a Viable Solution to Increase Downtime and MTTR

The results of the report underscore the opportunity for more automation and AI across incident management processes. Over the next year, teams expect to expand their tech stack and plan to implement new AI and automation tools to strengthen incident management processes and decrease mean time to resolution/repair (MTTR).

Almost 90% of respondents indicated that integrating generative AI capabilities into incident management tools or platforms decreased the time it takes to create new automations. Almost all (96.3%) believe it would be beneficial if the tools their organization used during an incident were integrated through one tool or platform.

For the 79.5% of organizations that have embraced AI in their tech stack, the impact has already been significant with more than half feeling that AI is making their job better, improving the accuracy and quality of data, making time to incident resolution faster, and streamlining IT operations effectively.

Moreover, an overwhelming majority (90.4%) of respondents believe that leveraging insights from human data — such as archived Slack communications, retrospective interviews, and group feedback — could improve incident management and operational efficiency. The vast majority also agree automation should let humans use judgment at critical decision points to be more reliable and effective — a nearly 10% increase from last year.

The findings support the notion that human-in-the-loop automation and AI are critical to incident response and operational excellence. The results highlight the importance of a clear incident response lifecycle and emphasize the need for a single SaaS tool or platform that seamlessly integrates incident management tools, human data insights and generative AI to accelerate operational efficiency.

Jessica Abelson is Director of Product Marketing at Transposit
Share this

The Latest

February 21, 2024

Generative AI will usher in advantages within various industries. However, the technology is still nascent, and according to the recent Dynatrace survey there are many challenges and risks that organizations need to overcome to use this technology effectively ...

February 20, 2024

In today's digital era, monitoring and observability are indispensable in software and application development. Their efficacy lies in empowering developers to swiftly identify and address issues, enhance performance, and deliver flawless user experiences. Achieving these objectives requires meticulous planning, strategic implementation, and consistent ongoing maintenance. In this blog, we're sharing our five best practices to fortify your approach to application performance monitoring (APM) and observability ...

February 16, 2024

In MEAN TIME TO INSIGHT Episode 3, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at Enterprise Management Associates (EMA) discusses network security with Chris Steffen, VP of Research Covering Information Security, Risk, and Compliance Management at EMA ...

February 15, 2024

In a time where we're constantly bombarded with new buzzwords and technological advancements, it can be challenging for businesses to determine what is real, what is useful, and what they truly need. Over the years, we've witnessed the rise and fall of various tech trends, such as the promises (and fears) of AI becoming sentient and replacing humans to the declaration that data is the new oil. At the end of the day, one fundamental question remains: How can companies navigate through the tech buzz and make informed decisions for their future? ...

February 14, 2024

We increasingly see companies using their observability data to support security use cases. It's not entirely surprising given the challenges that organizations have with legacy SIEMs. We wanted to dig into this evolving intersection of security and observability, so we surveyed 500 security professionals — 40% of whom were either CISOs or CSOs — for our inaugural State of Security Observability report ...

February 13, 2024

Cloud computing continues to soar, with little signs of slowing down ... But, as with any new program, companies are seeing substantial benefits in the cloud but are also navigating budgetary challenges. With an estimated 94% of companies using cloud services today, priorities for IT teams have shifted from purely adoption-based to deploying new strategies. As they explore new territories, it can be a struggle to exploit the full value of their spend and the cloud's transformative capabilities ...

February 12, 2024

What will the enterprise of the future look like? If we asked this question three years ago, I doubt most of us would have pictured today as we know it: a future where generative AI has become deeply integrated into business and even our daily lives ...

February 09, 2024

With a focus on GenAI, industry experts offer predictions on how AI will evolve and impact IT and business in 2024. Part 5, the final installment in this series, covers the advantages AI will deliver: Generative AI will become increasingly important for resolving complicated data integration challenges, essentially providing a natural-language intermediary between data endpoints ...

February 08, 2024

With a focus on GenAI, industry experts offer predictions on how AI will evolve and impact IT and business in 2024. Part 4 covers the challenges of AI: In the short term, the rapid development and adoption of AI tools and products leveraging AI services will lead to an increase in biased outputs ...

February 07, 2024

With a focus on GenAI, industry experts offer predictions on how AI will evolve and impact IT and business in 2024. Part 3 covers the technologies that will drive AI: The question on every leader's mind in 2023 was - how soon will I see the return on my AI investment? The answer may lie in quantum computing ...