Skip to main content

Maximizing Resilience: Insights from the 2025 SRE Report

Leo Vasiliou
Catchpoint

As the digital landscape expands, the stakes for delivering reliable and seamless online experiences have never been higher. In the past year, site reliability engineering (SRE) has continued to evolve into a critical driver of operational success, shaping how organizations approach resilience, collaboration, and customer satisfaction.

The 2025 Catchpoint SRE Report dives into the forces transforming the SRE landscape, exploring both the challenges and opportunities ahead. Let's break down the key findings and what they mean for SRE professionals and the businesses relying on them.

Slow Is the New Down

Performance is about more than just uptime; it's also about speed. This year's report reveals that 53% of organizations believe poor performance is as harmful as downtime, making user experience a critical reliability metric.

What This Means for You: Organizations must elevate their performance monitoring strategies to include experience level objectives (XLOs) for ensuring fast and seamless digital interactions. Proactive performance tuning and real-time observability can mitigate the impact of "slow" on end users.

Toil Levels Are Rising Despite AI

After years of decline, toil — the manual, repetitive tasks that consume engineering resources — has ticked upward. The median reported percentage of work spent on toil rose to 30% from 25% in 2024 causing us to hypothesize whether AI is filling our time with more — instead of less — operational workload.

Why It Matters: This hypothesis suggests that while AI is improving specific workflows, it hasn't eliminated the burden of toil. Teams should evaluate their AI implementations to ensure they target high-impact areas and actively reduce manual effort. As Laura de Vesine, one of this year's report contributors put it: AI is at best "a co-worker you can't trust." Even as AI tools become more integrated into workflows, human oversight and intervention remain critical to ensure these tools don't inadvertently add to the complexity of tasks.

Organizational Priorities Under Pressure

The tension between agility and stability persists. Over two-thirds of respondents reported feeling pressured to prioritize release schedules over reliability, highlighting the ongoing challenge of balancing speed with resilience.

Takeaway: Building a culture that values reliability alongside agility requires clear communication and alignment on priorities. Teams should integrate reliability metrics into performance evaluations and emphasize the long-term benefits of stable releases for both IT and the business.

Monitoring Tools: More Is More

The report found that most organizations use between 2-10 monitoring or observability tools, showing a "value over cost" mindset for effective oversight across complex technology stacks.

What This Means for You: While multiple tools can provide comprehensive coverage, they also introduce complexity. Organizations should focus on integrating these tools to provide unified visibility and actionable insights without overwhelming their teams.

AI Training Universally in High Demand, but Time-Constrained

As AI continues to shape the SRE landscape, 30% of respondents prioritized technical training on AI — a strong indicator of the desire to upskill. However, the top sentiment (37%) reflected caution, as teams balance enthusiasm for AI with practical implementation concerns.

Takeaway: Providing targeted, hands-on training programs can help bridge the knowledge gap and build confidence in AI's capabilities. Organizations should also set realistic expectations for AI adoption, ensuring a smooth transition into daily workflows.

Incidents Are a Certainty

Incident response remains a universal challenge, with 40% of respondents handling between 1 and 5 incidents in the last 30 days. Notably, incident management is a shared responsibility, with higher-level managers as involved as individual contributors.

Why This Matters: Teams should adopt a collaborative approach to incident response, leveraging diverse perspectives to address issues effectively. Implementing clear incident playbooks and blameless post-mortem practices can further enhance preparedness and learning.

Misalignment on Reliability Priorities

While the overall responses paint a positive picture of reliability practices, significant gaps emerge when analyzed by managerial responsibility. Misalignment on priorities and approaches remains a challenge.

Takeaway: Bridging this IT-to-business gap requires the acknowledgment of its existence. Ongoing dialogue, alignment across all levels of the organization, and regularly revisiting and communicating reliability goals can help ensure everyone is pulling in the same direction.

Ownership and Action in SRE

The report shows just how important it is to connect technical work with the bigger picture. It all comes down to teams knowing how their efforts make a real difference and taking thoughtful steps to grab the opportunities in front of them. This year's report sheds light on the ongoing challenges that need attention, like making reliability a part of release planning, giving teams the tools and training they need to tackle incidents smoothly, and getting everyone on the same page, from leadership to contributors.

When it comes to AI, the focus should be on using it in practical ways that actually make work easier rather than more complicated. Building resilience and reliability isn't just about technical know-how. It's about clear goals, teamwork, and always looking for ways to improve. Companies that see SRE as a way to drive real outcomes, rather than just a set of technical tasks, will be in a great spot to succeed as the digital world keeps getting more complex and fast-paced.

Leo Vasiliou is Director of Product Marketing at Catchpoint

The Latest

In APMdigest's 2026 Observability Predictions Series, industry experts offer predictions on how Observability and related technologies will evolve and impact business in 2025. Part 6 covers OpenTelemetry ...

In APMdigest's 2026 Observability Predictions Series, industry experts offer predictions on how Observability and related technologies will evolve and impact business in 2025. Part 5 covers APM and infrastructure monitoring ...

AI continues to be the top story across the industry, but a big test is coming up as retailers make the final preparations before the holiday season starts. Will new AI powered features help load up Santa's sleigh this year? Or are early adopters in for unpleasant surprises in the form of unexpected high costs, poor performance, or even service outages? ...

In APMdigest's 2026 Observability Predictions Series, industry experts offer predictions on how Observability and related technologies will evolve and impact business in 2025. Part 4 covers user experience, digital performance, website performance and ITSM ...

In APMdigest's 2026 Observability Predictions Series, industry experts offer predictions on how Observability and related technologies will evolve and impact business in 2025. Part 3 covers more predictions about Observability ...

In APMdigest's 2026 Observability Predictions Series, industry experts offer predictions on how Observability and related technologies will evolve and impact business in 2025. Part 2 covers predictions about Observability and AIOps ...

The Holiday Season means it is time for APMdigest's annual list of predictions, covering Observability and other IT performance topics. Industry experts — from analysts and consultants to the top vendors — offer thoughtful, insightful, and often controversial predictions on how Observability, AIOps, APM and related technologies will evolve and impact business in 2026 ...

IT organizations are preparing for 2026 with increased expectations around modernization, cloud maturity, and data readiness. At the same time, many teams continue to operate with limited staffing and are trying to maintain complex environments with small internal groups. These conditions are creating a distinct set of priorities for the year ahead. The DataStrike 2026 Data Infrastructure Survey Report, based on responses from nearly 280 IT leaders across industries, points to five trends that are shaping data infrastructure planning for 2026 ...

Developers building AI applications are not just looking for fault patterns after deployment; they must detect issues quickly during development and have the ability to prevent issues after going live. Unfortunately, traditional observability tools can no longer meet the needs of AI-driven enterprise application development. AI-powered detection and auto-remediation tools designed to keep pace with rapid development are now emerging to proactively manage performance and prevent downtime ...

Every few years, the cybersecurity industry adopts a new buzzword. "Zero Trust" has endured longer than most — and for good reason. Its promise is simple: trust nothing by default, verify everything continuously. Yet many organizations still hesitate to implement Zero Trust Network Access (ZTNA). The problem isn't that ZTNA doesn't work. It's that it's often misunderstood ...

Maximizing Resilience: Insights from the 2025 SRE Report

Leo Vasiliou
Catchpoint

As the digital landscape expands, the stakes for delivering reliable and seamless online experiences have never been higher. In the past year, site reliability engineering (SRE) has continued to evolve into a critical driver of operational success, shaping how organizations approach resilience, collaboration, and customer satisfaction.

The 2025 Catchpoint SRE Report dives into the forces transforming the SRE landscape, exploring both the challenges and opportunities ahead. Let's break down the key findings and what they mean for SRE professionals and the businesses relying on them.

Slow Is the New Down

Performance is about more than just uptime; it's also about speed. This year's report reveals that 53% of organizations believe poor performance is as harmful as downtime, making user experience a critical reliability metric.

What This Means for You: Organizations must elevate their performance monitoring strategies to include experience level objectives (XLOs) for ensuring fast and seamless digital interactions. Proactive performance tuning and real-time observability can mitigate the impact of "slow" on end users.

Toil Levels Are Rising Despite AI

After years of decline, toil — the manual, repetitive tasks that consume engineering resources — has ticked upward. The median reported percentage of work spent on toil rose to 30% from 25% in 2024 causing us to hypothesize whether AI is filling our time with more — instead of less — operational workload.

Why It Matters: This hypothesis suggests that while AI is improving specific workflows, it hasn't eliminated the burden of toil. Teams should evaluate their AI implementations to ensure they target high-impact areas and actively reduce manual effort. As Laura de Vesine, one of this year's report contributors put it: AI is at best "a co-worker you can't trust." Even as AI tools become more integrated into workflows, human oversight and intervention remain critical to ensure these tools don't inadvertently add to the complexity of tasks.

Organizational Priorities Under Pressure

The tension between agility and stability persists. Over two-thirds of respondents reported feeling pressured to prioritize release schedules over reliability, highlighting the ongoing challenge of balancing speed with resilience.

Takeaway: Building a culture that values reliability alongside agility requires clear communication and alignment on priorities. Teams should integrate reliability metrics into performance evaluations and emphasize the long-term benefits of stable releases for both IT and the business.

Monitoring Tools: More Is More

The report found that most organizations use between 2-10 monitoring or observability tools, showing a "value over cost" mindset for effective oversight across complex technology stacks.

What This Means for You: While multiple tools can provide comprehensive coverage, they also introduce complexity. Organizations should focus on integrating these tools to provide unified visibility and actionable insights without overwhelming their teams.

AI Training Universally in High Demand, but Time-Constrained

As AI continues to shape the SRE landscape, 30% of respondents prioritized technical training on AI — a strong indicator of the desire to upskill. However, the top sentiment (37%) reflected caution, as teams balance enthusiasm for AI with practical implementation concerns.

Takeaway: Providing targeted, hands-on training programs can help bridge the knowledge gap and build confidence in AI's capabilities. Organizations should also set realistic expectations for AI adoption, ensuring a smooth transition into daily workflows.

Incidents Are a Certainty

Incident response remains a universal challenge, with 40% of respondents handling between 1 and 5 incidents in the last 30 days. Notably, incident management is a shared responsibility, with higher-level managers as involved as individual contributors.

Why This Matters: Teams should adopt a collaborative approach to incident response, leveraging diverse perspectives to address issues effectively. Implementing clear incident playbooks and blameless post-mortem practices can further enhance preparedness and learning.

Misalignment on Reliability Priorities

While the overall responses paint a positive picture of reliability practices, significant gaps emerge when analyzed by managerial responsibility. Misalignment on priorities and approaches remains a challenge.

Takeaway: Bridging this IT-to-business gap requires the acknowledgment of its existence. Ongoing dialogue, alignment across all levels of the organization, and regularly revisiting and communicating reliability goals can help ensure everyone is pulling in the same direction.

Ownership and Action in SRE

The report shows just how important it is to connect technical work with the bigger picture. It all comes down to teams knowing how their efforts make a real difference and taking thoughtful steps to grab the opportunities in front of them. This year's report sheds light on the ongoing challenges that need attention, like making reliability a part of release planning, giving teams the tools and training they need to tackle incidents smoothly, and getting everyone on the same page, from leadership to contributors.

When it comes to AI, the focus should be on using it in practical ways that actually make work easier rather than more complicated. Building resilience and reliability isn't just about technical know-how. It's about clear goals, teamwork, and always looking for ways to improve. Companies that see SRE as a way to drive real outcomes, rather than just a set of technical tasks, will be in a great spot to succeed as the digital world keeps getting more complex and fast-paced.

Leo Vasiliou is Director of Product Marketing at Catchpoint

The Latest

In APMdigest's 2026 Observability Predictions Series, industry experts offer predictions on how Observability and related technologies will evolve and impact business in 2025. Part 6 covers OpenTelemetry ...

In APMdigest's 2026 Observability Predictions Series, industry experts offer predictions on how Observability and related technologies will evolve and impact business in 2025. Part 5 covers APM and infrastructure monitoring ...

AI continues to be the top story across the industry, but a big test is coming up as retailers make the final preparations before the holiday season starts. Will new AI powered features help load up Santa's sleigh this year? Or are early adopters in for unpleasant surprises in the form of unexpected high costs, poor performance, or even service outages? ...

In APMdigest's 2026 Observability Predictions Series, industry experts offer predictions on how Observability and related technologies will evolve and impact business in 2025. Part 4 covers user experience, digital performance, website performance and ITSM ...

In APMdigest's 2026 Observability Predictions Series, industry experts offer predictions on how Observability and related technologies will evolve and impact business in 2025. Part 3 covers more predictions about Observability ...

In APMdigest's 2026 Observability Predictions Series, industry experts offer predictions on how Observability and related technologies will evolve and impact business in 2025. Part 2 covers predictions about Observability and AIOps ...

The Holiday Season means it is time for APMdigest's annual list of predictions, covering Observability and other IT performance topics. Industry experts — from analysts and consultants to the top vendors — offer thoughtful, insightful, and often controversial predictions on how Observability, AIOps, APM and related technologies will evolve and impact business in 2026 ...

IT organizations are preparing for 2026 with increased expectations around modernization, cloud maturity, and data readiness. At the same time, many teams continue to operate with limited staffing and are trying to maintain complex environments with small internal groups. These conditions are creating a distinct set of priorities for the year ahead. The DataStrike 2026 Data Infrastructure Survey Report, based on responses from nearly 280 IT leaders across industries, points to five trends that are shaping data infrastructure planning for 2026 ...

Developers building AI applications are not just looking for fault patterns after deployment; they must detect issues quickly during development and have the ability to prevent issues after going live. Unfortunately, traditional observability tools can no longer meet the needs of AI-driven enterprise application development. AI-powered detection and auto-remediation tools designed to keep pace with rapid development are now emerging to proactively manage performance and prevent downtime ...

Every few years, the cybersecurity industry adopts a new buzzword. "Zero Trust" has endured longer than most — and for good reason. Its promise is simple: trust nothing by default, verify everything continuously. Yet many organizations still hesitate to implement Zero Trust Network Access (ZTNA). The problem isn't that ZTNA doesn't work. It's that it's often misunderstood ...