Skip to main content

Maximizing Resilience: Insights from the 2025 SRE Report

Leo Vasiliou
Catchpoint

As the digital landscape expands, the stakes for delivering reliable and seamless online experiences have never been higher. In the past year, site reliability engineering (SRE) has continued to evolve into a critical driver of operational success, shaping how organizations approach resilience, collaboration, and customer satisfaction.

The 2025 Catchpoint SRE Report dives into the forces transforming the SRE landscape, exploring both the challenges and opportunities ahead. Let's break down the key findings and what they mean for SRE professionals and the businesses relying on them.

Slow Is the New Down

Performance is about more than just uptime; it's also about speed. This year's report reveals that 53% of organizations believe poor performance is as harmful as downtime, making user experience a critical reliability metric.

What This Means for You: Organizations must elevate their performance monitoring strategies to include experience level objectives (XLOs) for ensuring fast and seamless digital interactions. Proactive performance tuning and real-time observability can mitigate the impact of "slow" on end users.

Toil Levels Are Rising Despite AI

After years of decline, toil — the manual, repetitive tasks that consume engineering resources — has ticked upward. The median reported percentage of work spent on toil rose to 30% from 25% in 2024 causing us to hypothesize whether AI is filling our time with more — instead of less — operational workload.

Why It Matters: This hypothesis suggests that while AI is improving specific workflows, it hasn't eliminated the burden of toil. Teams should evaluate their AI implementations to ensure they target high-impact areas and actively reduce manual effort. As Laura de Vesine, one of this year's report contributors put it: AI is at best "a co-worker you can't trust." Even as AI tools become more integrated into workflows, human oversight and intervention remain critical to ensure these tools don't inadvertently add to the complexity of tasks.

Organizational Priorities Under Pressure

The tension between agility and stability persists. Over two-thirds of respondents reported feeling pressured to prioritize release schedules over reliability, highlighting the ongoing challenge of balancing speed with resilience.

Takeaway: Building a culture that values reliability alongside agility requires clear communication and alignment on priorities. Teams should integrate reliability metrics into performance evaluations and emphasize the long-term benefits of stable releases for both IT and the business.

Monitoring Tools: More Is More

The report found that most organizations use between 2-10 monitoring or observability tools, showing a "value over cost" mindset for effective oversight across complex technology stacks.

What This Means for You: While multiple tools can provide comprehensive coverage, they also introduce complexity. Organizations should focus on integrating these tools to provide unified visibility and actionable insights without overwhelming their teams.

AI Training Universally in High Demand, but Time-Constrained

As AI continues to shape the SRE landscape, 30% of respondents prioritized technical training on AI — a strong indicator of the desire to upskill. However, the top sentiment (37%) reflected caution, as teams balance enthusiasm for AI with practical implementation concerns.

Takeaway: Providing targeted, hands-on training programs can help bridge the knowledge gap and build confidence in AI's capabilities. Organizations should also set realistic expectations for AI adoption, ensuring a smooth transition into daily workflows.

Incidents Are a Certainty

Incident response remains a universal challenge, with 40% of respondents handling between 1 and 5 incidents in the last 30 days. Notably, incident management is a shared responsibility, with higher-level managers as involved as individual contributors.

Why This Matters: Teams should adopt a collaborative approach to incident response, leveraging diverse perspectives to address issues effectively. Implementing clear incident playbooks and blameless post-mortem practices can further enhance preparedness and learning.

Misalignment on Reliability Priorities

While the overall responses paint a positive picture of reliability practices, significant gaps emerge when analyzed by managerial responsibility. Misalignment on priorities and approaches remains a challenge.

Takeaway: Bridging this IT-to-business gap requires the acknowledgment of its existence. Ongoing dialogue, alignment across all levels of the organization, and regularly revisiting and communicating reliability goals can help ensure everyone is pulling in the same direction.

Ownership and Action in SRE

The report shows just how important it is to connect technical work with the bigger picture. It all comes down to teams knowing how their efforts make a real difference and taking thoughtful steps to grab the opportunities in front of them. This year's report sheds light on the ongoing challenges that need attention, like making reliability a part of release planning, giving teams the tools and training they need to tackle incidents smoothly, and getting everyone on the same page, from leadership to contributors.

When it comes to AI, the focus should be on using it in practical ways that actually make work easier rather than more complicated. Building resilience and reliability isn't just about technical know-how. It's about clear goals, teamwork, and always looking for ways to improve. Companies that see SRE as a way to drive real outcomes, rather than just a set of technical tasks, will be in a great spot to succeed as the digital world keeps getting more complex and fast-paced.

Leo Vasiliou is Director of Product Marketing at Catchpoint

The Latest

Artificial intelligence (AI) is core to observability practices, with some 41% of respondents reporting AI adoption as a core driver of observability, according to the State of Observability for Financial Services and Insurance report from New Relic ...

Application performance monitoring (APM) is a game of catching up — building dashboards, setting thresholds, tuning alerts, and manually correlating metrics to root causes. In the early days, this straightforward model worked as applications were simpler, stacks more predictable, and telemetry was manageable. Today, the landscape has shifted, and more assertive tools are needed ...

Cloud adoption has accelerated, but backup strategies haven't always kept pace. Many organizations continue to rely on backup strategies that were either lifted directly from on-prem environments or use cloud-native tools in limited, DR-focused ways ... Eon uncovered a handful of critical gaps regarding how organizations approach cloud backup. To capture these prevailing winds, we gathered insights from 150+ IT and cloud leaders at the recent Google Cloud Next conference, which we've compiled into the 2025 State of Cloud Data Backup ...

Private clouds are no longer playing catch-up, and public clouds are no longer the default as organizations recalibrate their cloud strategies, according to the Private Cloud Outlook 2025 report from Broadcom. More than half (53%) of survey respondents say private cloud is their top priority for deploying new workloads over the next three years, while 69% are considering workload repatriation from public to private cloud, with one-third having already done so ...

As organizations chase productivity gains from generative AI, teams are overwhelmingly focused on improving delivery speed (45%) over enhancing software quality (13%), according to the Quality Transformation Report from Tricentis ...

Back in March of this year ... MongoDB's stock price took a serious tumble ... In my opinion, it reflects a deeper structural issue in enterprise software economics altogether — vendor lock-in ...

In MEAN TIME TO INSIGHT Episode 15, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses Do-It-Yourself Network Automation ... 

Zero-day vulnerabilities — security flaws that are exploited before developers even know they exist — pose one of the greatest risks to modern organizations. Recently, such vulnerabilities have been discovered in well-known VPN systems like Ivanti and Fortinet, highlighting just how outdated these legacy technologies have become in defending against fast-evolving cyber threats ... To protect digital assets and remote workers in today's environment, companies need more than patchwork solutions. They need architecture that is secure by design ...

Traditional observability requires users to leap across different platforms or tools for metrics, logs, or traces and related issues manually, which is very time-consuming, so as to reasonably ascertain the root cause. Observability 2.0 fixes this by unifying all telemetry data, logs, metrics, and traces into a single, context-rich pipeline that flows into one smart platform. But this is far from just having a bunch of additional data; this data is actionable, predictive, and tied to revenue realization ...

64% of enterprise networking teams use internally developed software or scripts for network automation, but 61% of those teams spend six or more hours per week debugging and maintaining them, according to From Scripts to Platforms: Why Homegrown Tools Dominate Network Automation and How Vendors Can Help, my latest EMA report ...

Maximizing Resilience: Insights from the 2025 SRE Report

Leo Vasiliou
Catchpoint

As the digital landscape expands, the stakes for delivering reliable and seamless online experiences have never been higher. In the past year, site reliability engineering (SRE) has continued to evolve into a critical driver of operational success, shaping how organizations approach resilience, collaboration, and customer satisfaction.

The 2025 Catchpoint SRE Report dives into the forces transforming the SRE landscape, exploring both the challenges and opportunities ahead. Let's break down the key findings and what they mean for SRE professionals and the businesses relying on them.

Slow Is the New Down

Performance is about more than just uptime; it's also about speed. This year's report reveals that 53% of organizations believe poor performance is as harmful as downtime, making user experience a critical reliability metric.

What This Means for You: Organizations must elevate their performance monitoring strategies to include experience level objectives (XLOs) for ensuring fast and seamless digital interactions. Proactive performance tuning and real-time observability can mitigate the impact of "slow" on end users.

Toil Levels Are Rising Despite AI

After years of decline, toil — the manual, repetitive tasks that consume engineering resources — has ticked upward. The median reported percentage of work spent on toil rose to 30% from 25% in 2024 causing us to hypothesize whether AI is filling our time with more — instead of less — operational workload.

Why It Matters: This hypothesis suggests that while AI is improving specific workflows, it hasn't eliminated the burden of toil. Teams should evaluate their AI implementations to ensure they target high-impact areas and actively reduce manual effort. As Laura de Vesine, one of this year's report contributors put it: AI is at best "a co-worker you can't trust." Even as AI tools become more integrated into workflows, human oversight and intervention remain critical to ensure these tools don't inadvertently add to the complexity of tasks.

Organizational Priorities Under Pressure

The tension between agility and stability persists. Over two-thirds of respondents reported feeling pressured to prioritize release schedules over reliability, highlighting the ongoing challenge of balancing speed with resilience.

Takeaway: Building a culture that values reliability alongside agility requires clear communication and alignment on priorities. Teams should integrate reliability metrics into performance evaluations and emphasize the long-term benefits of stable releases for both IT and the business.

Monitoring Tools: More Is More

The report found that most organizations use between 2-10 monitoring or observability tools, showing a "value over cost" mindset for effective oversight across complex technology stacks.

What This Means for You: While multiple tools can provide comprehensive coverage, they also introduce complexity. Organizations should focus on integrating these tools to provide unified visibility and actionable insights without overwhelming their teams.

AI Training Universally in High Demand, but Time-Constrained

As AI continues to shape the SRE landscape, 30% of respondents prioritized technical training on AI — a strong indicator of the desire to upskill. However, the top sentiment (37%) reflected caution, as teams balance enthusiasm for AI with practical implementation concerns.

Takeaway: Providing targeted, hands-on training programs can help bridge the knowledge gap and build confidence in AI's capabilities. Organizations should also set realistic expectations for AI adoption, ensuring a smooth transition into daily workflows.

Incidents Are a Certainty

Incident response remains a universal challenge, with 40% of respondents handling between 1 and 5 incidents in the last 30 days. Notably, incident management is a shared responsibility, with higher-level managers as involved as individual contributors.

Why This Matters: Teams should adopt a collaborative approach to incident response, leveraging diverse perspectives to address issues effectively. Implementing clear incident playbooks and blameless post-mortem practices can further enhance preparedness and learning.

Misalignment on Reliability Priorities

While the overall responses paint a positive picture of reliability practices, significant gaps emerge when analyzed by managerial responsibility. Misalignment on priorities and approaches remains a challenge.

Takeaway: Bridging this IT-to-business gap requires the acknowledgment of its existence. Ongoing dialogue, alignment across all levels of the organization, and regularly revisiting and communicating reliability goals can help ensure everyone is pulling in the same direction.

Ownership and Action in SRE

The report shows just how important it is to connect technical work with the bigger picture. It all comes down to teams knowing how their efforts make a real difference and taking thoughtful steps to grab the opportunities in front of them. This year's report sheds light on the ongoing challenges that need attention, like making reliability a part of release planning, giving teams the tools and training they need to tackle incidents smoothly, and getting everyone on the same page, from leadership to contributors.

When it comes to AI, the focus should be on using it in practical ways that actually make work easier rather than more complicated. Building resilience and reliability isn't just about technical know-how. It's about clear goals, teamwork, and always looking for ways to improve. Companies that see SRE as a way to drive real outcomes, rather than just a set of technical tasks, will be in a great spot to succeed as the digital world keeps getting more complex and fast-paced.

Leo Vasiliou is Director of Product Marketing at Catchpoint

The Latest

Artificial intelligence (AI) is core to observability practices, with some 41% of respondents reporting AI adoption as a core driver of observability, according to the State of Observability for Financial Services and Insurance report from New Relic ...

Application performance monitoring (APM) is a game of catching up — building dashboards, setting thresholds, tuning alerts, and manually correlating metrics to root causes. In the early days, this straightforward model worked as applications were simpler, stacks more predictable, and telemetry was manageable. Today, the landscape has shifted, and more assertive tools are needed ...

Cloud adoption has accelerated, but backup strategies haven't always kept pace. Many organizations continue to rely on backup strategies that were either lifted directly from on-prem environments or use cloud-native tools in limited, DR-focused ways ... Eon uncovered a handful of critical gaps regarding how organizations approach cloud backup. To capture these prevailing winds, we gathered insights from 150+ IT and cloud leaders at the recent Google Cloud Next conference, which we've compiled into the 2025 State of Cloud Data Backup ...

Private clouds are no longer playing catch-up, and public clouds are no longer the default as organizations recalibrate their cloud strategies, according to the Private Cloud Outlook 2025 report from Broadcom. More than half (53%) of survey respondents say private cloud is their top priority for deploying new workloads over the next three years, while 69% are considering workload repatriation from public to private cloud, with one-third having already done so ...

As organizations chase productivity gains from generative AI, teams are overwhelmingly focused on improving delivery speed (45%) over enhancing software quality (13%), according to the Quality Transformation Report from Tricentis ...

Back in March of this year ... MongoDB's stock price took a serious tumble ... In my opinion, it reflects a deeper structural issue in enterprise software economics altogether — vendor lock-in ...

In MEAN TIME TO INSIGHT Episode 15, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses Do-It-Yourself Network Automation ... 

Zero-day vulnerabilities — security flaws that are exploited before developers even know they exist — pose one of the greatest risks to modern organizations. Recently, such vulnerabilities have been discovered in well-known VPN systems like Ivanti and Fortinet, highlighting just how outdated these legacy technologies have become in defending against fast-evolving cyber threats ... To protect digital assets and remote workers in today's environment, companies need more than patchwork solutions. They need architecture that is secure by design ...

Traditional observability requires users to leap across different platforms or tools for metrics, logs, or traces and related issues manually, which is very time-consuming, so as to reasonably ascertain the root cause. Observability 2.0 fixes this by unifying all telemetry data, logs, metrics, and traces into a single, context-rich pipeline that flows into one smart platform. But this is far from just having a bunch of additional data; this data is actionable, predictive, and tied to revenue realization ...

64% of enterprise networking teams use internally developed software or scripts for network automation, but 61% of those teams spend six or more hours per week debugging and maintaining them, according to From Scripts to Platforms: Why Homegrown Tools Dominate Network Automation and How Vendors Can Help, my latest EMA report ...