Skip to main content

Maximizing Resilience: Insights from the 2025 SRE Report

Leo Vasiliou
Catchpoint

As the digital landscape expands, the stakes for delivering reliable and seamless online experiences have never been higher. In the past year, site reliability engineering (SRE) has continued to evolve into a critical driver of operational success, shaping how organizations approach resilience, collaboration, and customer satisfaction.

The 2025 Catchpoint SRE Report dives into the forces transforming the SRE landscape, exploring both the challenges and opportunities ahead. Let's break down the key findings and what they mean for SRE professionals and the businesses relying on them.

Slow Is the New Down

Performance is about more than just uptime; it's also about speed. This year's report reveals that 53% of organizations believe poor performance is as harmful as downtime, making user experience a critical reliability metric.

What This Means for You: Organizations must elevate their performance monitoring strategies to include experience level objectives (XLOs) for ensuring fast and seamless digital interactions. Proactive performance tuning and real-time observability can mitigate the impact of "slow" on end users.

Toil Levels Are Rising Despite AI

After years of decline, toil — the manual, repetitive tasks that consume engineering resources — has ticked upward. The median reported percentage of work spent on toil rose to 30% from 25% in 2024 causing us to hypothesize whether AI is filling our time with more — instead of less — operational workload.

Why It Matters: This hypothesis suggests that while AI is improving specific workflows, it hasn't eliminated the burden of toil. Teams should evaluate their AI implementations to ensure they target high-impact areas and actively reduce manual effort. As Laura de Vesine, one of this year's report contributors put it: AI is at best "a co-worker you can't trust." Even as AI tools become more integrated into workflows, human oversight and intervention remain critical to ensure these tools don't inadvertently add to the complexity of tasks.

Organizational Priorities Under Pressure

The tension between agility and stability persists. Over two-thirds of respondents reported feeling pressured to prioritize release schedules over reliability, highlighting the ongoing challenge of balancing speed with resilience.

Takeaway: Building a culture that values reliability alongside agility requires clear communication and alignment on priorities. Teams should integrate reliability metrics into performance evaluations and emphasize the long-term benefits of stable releases for both IT and the business.

Monitoring Tools: More Is More

The report found that most organizations use between 2-10 monitoring or observability tools, showing a "value over cost" mindset for effective oversight across complex technology stacks.

What This Means for You: While multiple tools can provide comprehensive coverage, they also introduce complexity. Organizations should focus on integrating these tools to provide unified visibility and actionable insights without overwhelming their teams.

AI Training Universally in High Demand, but Time-Constrained

As AI continues to shape the SRE landscape, 30% of respondents prioritized technical training on AI — a strong indicator of the desire to upskill. However, the top sentiment (37%) reflected caution, as teams balance enthusiasm for AI with practical implementation concerns.

Takeaway: Providing targeted, hands-on training programs can help bridge the knowledge gap and build confidence in AI's capabilities. Organizations should also set realistic expectations for AI adoption, ensuring a smooth transition into daily workflows.

Incidents Are a Certainty

Incident response remains a universal challenge, with 40% of respondents handling between 1 and 5 incidents in the last 30 days. Notably, incident management is a shared responsibility, with higher-level managers as involved as individual contributors.

Why This Matters: Teams should adopt a collaborative approach to incident response, leveraging diverse perspectives to address issues effectively. Implementing clear incident playbooks and blameless post-mortem practices can further enhance preparedness and learning.

Misalignment on Reliability Priorities

While the overall responses paint a positive picture of reliability practices, significant gaps emerge when analyzed by managerial responsibility. Misalignment on priorities and approaches remains a challenge.

Takeaway: Bridging this IT-to-business gap requires the acknowledgment of its existence. Ongoing dialogue, alignment across all levels of the organization, and regularly revisiting and communicating reliability goals can help ensure everyone is pulling in the same direction.

Ownership and Action in SRE

The report shows just how important it is to connect technical work with the bigger picture. It all comes down to teams knowing how their efforts make a real difference and taking thoughtful steps to grab the opportunities in front of them. This year's report sheds light on the ongoing challenges that need attention, like making reliability a part of release planning, giving teams the tools and training they need to tackle incidents smoothly, and getting everyone on the same page, from leadership to contributors.

When it comes to AI, the focus should be on using it in practical ways that actually make work easier rather than more complicated. Building resilience and reliability isn't just about technical know-how. It's about clear goals, teamwork, and always looking for ways to improve. Companies that see SRE as a way to drive real outcomes, rather than just a set of technical tasks, will be in a great spot to succeed as the digital world keeps getting more complex and fast-paced.

Leo Vasiliou is Director of Product Marketing at Catchpoint

The Latest

AI is the catalyst for significant investment in data teams as enterprises require higher-quality data to power their AI applications, according to the State of Analytics Engineering Report from dbt Labs ...

Misaligned architecture can lead to business consequences, with 93% of respondents reporting negative outcomes such as service disruptions, high operational costs and security challenges ...

A Gartner analyst recently suggested that GenAI tools could create 25% time savings for network operational teams. Where might these time savings come from? How are GenAI tools helping NetOps teams today, and what other tasks might they take on in the future as models continue improving? In general, these savings come from automating or streamlining manual NetOps tasks ...

IT and line-of-business teams are increasingly aligned in their efforts to close the data gap and drive greater collaboration to alleviate IT bottlenecks and offload growing demands on IT teams, according to The 2025 Automation Benchmark Report: Insights from IT Leaders on Enterprise Automation & the Future of AI-Driven Businesses from Jitterbit ...

A large majority (86%) of data management and AI decision makers cite protecting data privacy as a top concern, with 76% of respondents citing ROI on data privacy and AI initiatives across their organization, according to a new Harris Poll from Collibra ...

According to Gartner, Inc. the following six trends will shape the future of cloud over the next four years, ultimately resulting in new ways of working that are digital in nature and transformative in impact ...

2020 was the equivalent of a wedding with a top-shelf open bar. As businesses scrambled to adjust to remote work, digital transformation accelerated at breakneck speed. New software categories emerged overnight. Tech stacks ballooned with all sorts of SaaS apps solving ALL the problems — often with little oversight or long-term integration planning, and yes frequently a lot of duplicated functionality ... But now the music's faded. The lights are on. Everyone from the CIO to the CFO is checking the bill. Welcome to the Great SaaS Hangover ...

Regardless of OpenShift being a scalable and flexible software, it can be a pain to monitor since complete visibility into the underlying operations is not guaranteed ... To effectively monitor an OpenShift environment, IT administrators should focus on these five key elements and their associated metrics ...

An overwhelming majority of IT leaders (95%) believe the upcoming wave of AI-powered digital transformation is set to be the most impactful and intensive seen thus far, according to The Science of Productivity: AI, Adoption, And Employee Experience, a new report from Nexthink ...

Overall outage frequency and the general level of reported severity continue to decline, according to the Outage Analysis 2025 from Uptime Institute. However, cyber security incidents are on the rise and often have severe, lasting impacts ...

Maximizing Resilience: Insights from the 2025 SRE Report

Leo Vasiliou
Catchpoint

As the digital landscape expands, the stakes for delivering reliable and seamless online experiences have never been higher. In the past year, site reliability engineering (SRE) has continued to evolve into a critical driver of operational success, shaping how organizations approach resilience, collaboration, and customer satisfaction.

The 2025 Catchpoint SRE Report dives into the forces transforming the SRE landscape, exploring both the challenges and opportunities ahead. Let's break down the key findings and what they mean for SRE professionals and the businesses relying on them.

Slow Is the New Down

Performance is about more than just uptime; it's also about speed. This year's report reveals that 53% of organizations believe poor performance is as harmful as downtime, making user experience a critical reliability metric.

What This Means for You: Organizations must elevate their performance monitoring strategies to include experience level objectives (XLOs) for ensuring fast and seamless digital interactions. Proactive performance tuning and real-time observability can mitigate the impact of "slow" on end users.

Toil Levels Are Rising Despite AI

After years of decline, toil — the manual, repetitive tasks that consume engineering resources — has ticked upward. The median reported percentage of work spent on toil rose to 30% from 25% in 2024 causing us to hypothesize whether AI is filling our time with more — instead of less — operational workload.

Why It Matters: This hypothesis suggests that while AI is improving specific workflows, it hasn't eliminated the burden of toil. Teams should evaluate their AI implementations to ensure they target high-impact areas and actively reduce manual effort. As Laura de Vesine, one of this year's report contributors put it: AI is at best "a co-worker you can't trust." Even as AI tools become more integrated into workflows, human oversight and intervention remain critical to ensure these tools don't inadvertently add to the complexity of tasks.

Organizational Priorities Under Pressure

The tension between agility and stability persists. Over two-thirds of respondents reported feeling pressured to prioritize release schedules over reliability, highlighting the ongoing challenge of balancing speed with resilience.

Takeaway: Building a culture that values reliability alongside agility requires clear communication and alignment on priorities. Teams should integrate reliability metrics into performance evaluations and emphasize the long-term benefits of stable releases for both IT and the business.

Monitoring Tools: More Is More

The report found that most organizations use between 2-10 monitoring or observability tools, showing a "value over cost" mindset for effective oversight across complex technology stacks.

What This Means for You: While multiple tools can provide comprehensive coverage, they also introduce complexity. Organizations should focus on integrating these tools to provide unified visibility and actionable insights without overwhelming their teams.

AI Training Universally in High Demand, but Time-Constrained

As AI continues to shape the SRE landscape, 30% of respondents prioritized technical training on AI — a strong indicator of the desire to upskill. However, the top sentiment (37%) reflected caution, as teams balance enthusiasm for AI with practical implementation concerns.

Takeaway: Providing targeted, hands-on training programs can help bridge the knowledge gap and build confidence in AI's capabilities. Organizations should also set realistic expectations for AI adoption, ensuring a smooth transition into daily workflows.

Incidents Are a Certainty

Incident response remains a universal challenge, with 40% of respondents handling between 1 and 5 incidents in the last 30 days. Notably, incident management is a shared responsibility, with higher-level managers as involved as individual contributors.

Why This Matters: Teams should adopt a collaborative approach to incident response, leveraging diverse perspectives to address issues effectively. Implementing clear incident playbooks and blameless post-mortem practices can further enhance preparedness and learning.

Misalignment on Reliability Priorities

While the overall responses paint a positive picture of reliability practices, significant gaps emerge when analyzed by managerial responsibility. Misalignment on priorities and approaches remains a challenge.

Takeaway: Bridging this IT-to-business gap requires the acknowledgment of its existence. Ongoing dialogue, alignment across all levels of the organization, and regularly revisiting and communicating reliability goals can help ensure everyone is pulling in the same direction.

Ownership and Action in SRE

The report shows just how important it is to connect technical work with the bigger picture. It all comes down to teams knowing how their efforts make a real difference and taking thoughtful steps to grab the opportunities in front of them. This year's report sheds light on the ongoing challenges that need attention, like making reliability a part of release planning, giving teams the tools and training they need to tackle incidents smoothly, and getting everyone on the same page, from leadership to contributors.

When it comes to AI, the focus should be on using it in practical ways that actually make work easier rather than more complicated. Building resilience and reliability isn't just about technical know-how. It's about clear goals, teamwork, and always looking for ways to improve. Companies that see SRE as a way to drive real outcomes, rather than just a set of technical tasks, will be in a great spot to succeed as the digital world keeps getting more complex and fast-paced.

Leo Vasiliou is Director of Product Marketing at Catchpoint

The Latest

AI is the catalyst for significant investment in data teams as enterprises require higher-quality data to power their AI applications, according to the State of Analytics Engineering Report from dbt Labs ...

Misaligned architecture can lead to business consequences, with 93% of respondents reporting negative outcomes such as service disruptions, high operational costs and security challenges ...

A Gartner analyst recently suggested that GenAI tools could create 25% time savings for network operational teams. Where might these time savings come from? How are GenAI tools helping NetOps teams today, and what other tasks might they take on in the future as models continue improving? In general, these savings come from automating or streamlining manual NetOps tasks ...

IT and line-of-business teams are increasingly aligned in their efforts to close the data gap and drive greater collaboration to alleviate IT bottlenecks and offload growing demands on IT teams, according to The 2025 Automation Benchmark Report: Insights from IT Leaders on Enterprise Automation & the Future of AI-Driven Businesses from Jitterbit ...

A large majority (86%) of data management and AI decision makers cite protecting data privacy as a top concern, with 76% of respondents citing ROI on data privacy and AI initiatives across their organization, according to a new Harris Poll from Collibra ...

According to Gartner, Inc. the following six trends will shape the future of cloud over the next four years, ultimately resulting in new ways of working that are digital in nature and transformative in impact ...

2020 was the equivalent of a wedding with a top-shelf open bar. As businesses scrambled to adjust to remote work, digital transformation accelerated at breakneck speed. New software categories emerged overnight. Tech stacks ballooned with all sorts of SaaS apps solving ALL the problems — often with little oversight or long-term integration planning, and yes frequently a lot of duplicated functionality ... But now the music's faded. The lights are on. Everyone from the CIO to the CFO is checking the bill. Welcome to the Great SaaS Hangover ...

Regardless of OpenShift being a scalable and flexible software, it can be a pain to monitor since complete visibility into the underlying operations is not guaranteed ... To effectively monitor an OpenShift environment, IT administrators should focus on these five key elements and their associated metrics ...

An overwhelming majority of IT leaders (95%) believe the upcoming wave of AI-powered digital transformation is set to be the most impactful and intensive seen thus far, according to The Science of Productivity: AI, Adoption, And Employee Experience, a new report from Nexthink ...

Overall outage frequency and the general level of reported severity continue to decline, according to the Outage Analysis 2025 from Uptime Institute. However, cyber security incidents are on the rise and often have severe, lasting impacts ...