As Digital Transformation Prevails, Automation Remains a Top Priority for DevOps, ITOps and SRE Teams
June 27, 2022

Jessica Abelson
Transposit

Share this

Hybrid work adoption and the accelerated pace of digital transformation are driving an increasing need for automation and site reliability engineering (SRE) practices, according to new research.

In a new survey collected from 1,046 engineering, IT Operations, DevOps and site reliability engineering professionals in the United States with the role of VP, Director, Manager or individual contributor at organizations with over 300 employees, almost half of respondents (48.2%) said automation is a way to decrease Mean Time to Resolution/Repair (MTTR) and improve service management.

The second annual State of DevOps Automation Report, commissioned by Transposit also revealed close to sixty percent of organizations are losing up to half a million dollars per hour to downtime, a critical issue that can be mitigated with better automation and collaboration.

Organizations Still Lack Full Integration of Incident Response Tools

With 90.2% of organizations reporting an increased focus on digital transformation over the past year, paired with the persistence of hybrid and remote work, almost three-quarters (73.4%) of operations teams have expanded their tech stack. However, when asked how well integrated the various tools used during incident response are, only one quarter (24.7%) said all of their tools are integrated through one tool or platform. This means the vast majority (75.3%) don’t have full integration, leaving teams at risk of slow issue detection and analysis and a decrease in overall quality of service reliability and customer experience.

Broader deployment of automation has led developers to recognize that it’s key to reducing downtime and increasing resolution. This was seen by 3 in 4 organizations that implemented a continuous workflow to incident response for service management after adopting a hybrid workforce model.

Manual Processes Are Outdated and Lead to Higher Cost of Downtime and Service Incident Volume

The survey also found that more than a third (39.7%) of organizations had an increased cost of downtime during the last year (March 2021 to now). In fact, 58.2% reported that downtime (i.e., application outages, service degradation) cost their organization up to $499,999 per hour on average. Of those who reported an increase in the amount of time it takes to resolve incidents, 45.2% said it was due to a lack of unified communication with teammates (people are collaborating using disparate tools).


"Organizations need to deliver innovation faster and more efficiently than ever before. However, too many SRE, ITOps and DevOps teams are wasting time on disconnected, manual processes and playing a reactive game of whack-a-mole as they try to keep applications running," said Divanny Lamas, CEO of Transposit.

Operations teams are experiencing challenges while trying to solve incidents, including difficulties reaching people with specialized knowledge, inadequate support from collaboration methods and tools and lack of automation. When asked if they have observed any change in the frequency of service incidents that have affected their customers over the course of the last year (March 2021 to now), 62.9% of respondents reported an increase. Of those who said there was an increase in service incidents, respondents said the top reasons why this happened are digital transformation (60.7%), rolling out of new products or product updates (55.1%), methods and tools for collaboration did not adequately support their remote team (49.3%) and organizational change including team member churn, influx of new team members, and M&A activity (45.4%).

The Key to Faster Resolution of Incidents and Less Downtime: SRE Practices Combined with Automation

The rising demand for site reliability engineering is clear, as 75.6% of respondents said there has been an increased focus on SRE practices in their organization in the past 12 months, and of those, 35.1% plan to expand SRE efforts in 2022. Additionally, 65.1% of respondents plan to hire site reliability engineers in the next 12 months.

The need for automation tools is evident in the SRE roles to complement organizations’ increased focus on site reliability practices; 42.3% of SREs said the current level of automation is not meeting their organization’s needs and they are actively pursuing a new solution to solve for this shortage.

SREs are still dealing with cumbersome and tedious processes, despite the increased demand for SRE practices. Over half of SREs (56.5%) reported they still manually enter data into an ITSM system or other system or record to keep track of actions that were taken by humans during the resolution of an incident.

To scale, organizations need to implement automation technology to rid teams of these time-consuming manual processes. This is underlined by the fact that a full 100% of the respondents with a VP/Director/Manager SRE title who cited a decrease or no change in service incidents said it was because their organization implemented automation technology to help reduce the number of service incidents. Respondents also said better documentation, process and availability of data during incidents would have the most impact on MTTR, downtime and quality of service reliability.

As seen in the survey, organizations' approaches to automation differ. A majority (63%) responded that their approach to automation was incremental automation, in which they begin by codifying processes and work up to more advanced, fully automated scenarios. When asked whether automation should let humans use their judgment at critical decision points to be more reliable and effective, 80.4% of respondents said yes. Automation that keeps humans in the loop at key decision points increases flexibility and stability while automating repetitive tasks.

The top three tasks respondents would like automated are: service requests (52.6%), change requests (42.9%) and user provisioning (39.8%). Organizations are seeing the need to double-down on automation — the top three ways organizations plan to improve their incident management process are to implement new automation tools or applications (48.2%), implement new communications/collaboration tools or applications (41.5%) and implement new integration tools or applications (40.6%).

The survey makes it clear that ITOps, DevOps and SRE professionals should consider enhancing service reliability through human-in-the-loop automation, SRE practices and better collaboration methods. Teams enabled with these tools and process advancements are better empowered to spend their time and efforts on delivering innovation and competitive advantages, and ultimately creating more business value.

Jessica Abelson is Director of Product Marketing at Transposit
Share this

The Latest

December 08, 2022

Industry experts offer thoughtful, insightful, and often controversial predictions on how APM, AIOps, Observability, OpenTelemetry and related technologies will evolve and impact business in 2023. Part 4 covers monitoring, site reliability engineering and ITSM ...

December 07, 2022

Industry experts offer thoughtful, insightful, and often controversial predictions on how APM, AIOps, Observability, OpenTelemetry and related technologies will evolve and impact business in 2023. Part 3 covers OpenTelemetry ...

December 06, 2022

Industry experts offer thoughtful, insightful, and often controversial predictions on how APM, AIOps, Observability, OpenTelemetry and related technologies will evolve and impact business in 2023. Part 2 covers more on observability ...

December 05, 2022

The Holiday Season means it is time for APMdigest's annual list of Application Performance Management (APM) predictions, covering IT performance topics. Industry experts — from analysts and consultants to the top vendors — offer thoughtful, insightful, and often controversial predictions on how APM, observability, AIOps and related technologies will evolve and impact business in 2023. Part 1 covers APM and Observability ...

December 01, 2022

You could argue that, until the pandemic, and the resulting shift to hybrid working, delivering flawless customer experiences and improving employee productivity were mutually exclusive activities. Evidence from Catchpoint's recently published Site Reliability Engineering (SRE) industry report suggests this is changing ...

November 30, 2022

There are many issues that can contribute to developer dissatisfaction on the job — inadequate pay and work-life imbalance, for example. But increasingly there's also a troubling and growing sense of lacking ownership and feeling out of control ... One key way to increase job satisfaction is to ameliorate this sense of ownership and control whenever possible, and approaches to observability offer several ways to do this ...

November 29, 2022

The need for real-time, reliable data is increasing, and that data is a necessity to remain competitive in today's business landscape. At the same time, observability has become even more critical with the complexity of a hybrid multi-cloud environment. To add to the challenges and complexity, the term "observability" has not been clearly defined ...

November 28, 2022

Many have assumed that the mainframe is a dying entity, but instead, a mainframe renaissance is underway. Despite this notion, we are ushering in a future of more strategic investments, increased capacity, and leading innovations ...

November 22, 2022

Most (85%) consumers shop online or via a mobile app, with 59% using these digital channels as their primary holiday shopping channel, according to the Black Friday Consumer Report from Perforce Software. As brands head into a highly profitable time of year, starting with Black Friday and Cyber Monday, it's imperative development teams prepare for peak traffic, optimal channel performance, and seamless user experiences to retain and attract shoppers ...

November 21, 2022

From staffing issues to ineffective cloud strategies, NetOps teams are looking at how to streamline processes, consolidate tools, and improve network monitoring. What are some best practices that can help achieve this? Let's dive into five ...