In an era defined by the continuous evolution of technology and the ever-expanding digital landscape, the complexity of modern operations has reached new heights. Businesses continue to embrace cutting-edge applications and technologies to stay competitive — but amidst this complexity, one thing remains unwavering: the need to maintain reliable services and uphold customer satisfaction. Yet there's a disconnect — new research shows that over four in 10 organizations believe their current incident management process is not effective or is only being used by some team members, causing tedious and time-consuming workflows and impacting their ability to maintain reliability at scale.
The reality is incident management processes are not keeping pace with the demands of modern operations teams, failing to meet the needs of SREs as well as platform and ops teams. Results from the State of DevOps Automation and AI Survey, commissioned by Transposit, point to an incident management paradox. Despite nearly 60% of ITOps and DevOps professionals reporting they have a defined incident management process that's fully documented in one place and over 70% saying they have a level of automation that meets their needs, teams are unable to quickly resolve incidents. Organizations are experiencing an uptick in incidents with increased downtime, costing them hundreds of thousands — and in some cases millions — of dollars.
Luckily, a majority of respondents are optimistic that generative AI should be used to address the incident management paradox: 84.5% believe AI can significantly streamline incident management processes and improve overall efficiency or are excited about the opportunities AI presents for automating certain aspects of incident management.
Rise of Incidents, Knowledge Gaps and Confusing Processes Result in Increased Cost of Downtime
A majority (61.5%) of organizations cited an increase in the amount of time it takes to resolve incidents in the last year, with nearly 8 in 10 respondents saying it takes up to 6 hours on average to resolve incidents from the first alert to resolution. 63% of respondents said these downtime-producing incidents (i.e., application outages, service degradation) are putting their organizations at risk of losing up to an average of $499,999 per hour — a nearly 5% increase from 2022. And almost half said downtime can cost anywhere from $100,000 to $2 million.
What's causing the disarray?
Three-quarters (73.9%) of respondents responsible for reliability engineering experienced challenges while trying to solve incidents due to brittle automation scripts, too many manual processes and lack of access to specialized knowledge. What's more, 42.5% said their current incident management process is not effective or is only being used by some team members because of confusing documentation, limited access to tools and reliance on institutional knowledge.
A significant portion of team members are finding it challenging to understand and apply their organization's defined incident management procedures. Only about one-third of organizations report that select team members have a comprehensive understanding of the defined incident management process and adhere to it consistently.
Top Barriers to Automation
Implementing automation is a rising challenge for IT and DevOps teams according to report findings. One-third of respondents cited only 11-25% of their incident management tasks or workflows are automated and respondents expressed interest in automating pivotal aspects of the incident lifecycle, such as incident setup, communication protocols, investigative processes and remediation scripts.
Despite the interest in implementing automation, teams cited the following top four barriers:
■ Not enough buy-in from leadership or management (57.1%)
■ Not enough share of knowledge (54.3%)
■ Inadequate documentation of institutional knowledge and existing processes (54%)
■ Lack of clarity about what to automate (52.4%)
SRE and platform engineering play a vital role in implementing automation, and the survey found that there's a growing emphasis on bolstering these areas in the next 12 months. With the intention to hire more site reliability and platform engineers, over 60% of respondents increased their focus on SRE practices while over half enhanced platform engineering efforts, which highlights the commitment to fortify incident management capabilities.
Human-In-The-Loop AI and Automation Present as a Viable Solution to Increase Downtime and MTTR
The results of the report underscore the opportunity for more automation and AI across incident management processes. Over the next year, teams expect to expand their tech stack and plan to implement new AI and automation tools to strengthen incident management processes and decrease mean time to resolution/repair (MTTR).
Almost 90% of respondents indicated that integrating generative AI capabilities into incident management tools or platforms decreased the time it takes to create new automations. Almost all (96.3%) believe it would be beneficial if the tools their organization used during an incident were integrated through one tool or platform.
For the 79.5% of organizations that have embraced AI in their tech stack, the impact has already been significant with more than half feeling that AI is making their job better, improving the accuracy and quality of data, making time to incident resolution faster, and streamlining IT operations effectively.
Moreover, an overwhelming majority (90.4%) of respondents believe that leveraging insights from human data — such as archived Slack communications, retrospective interviews, and group feedback — could improve incident management and operational efficiency. The vast majority also agree automation should let humans use judgment at critical decision points to be more reliable and effective — a nearly 10% increase from last year.
The findings support the notion that human-in-the-loop automation and AI are critical to incident response and operational excellence. The results highlight the importance of a clear incident response lifecycle and emphasize the need for a single SaaS tool or platform that seamlessly integrates incident management tools, human data insights and generative AI to accelerate operational efficiency.