Skip to main content

Maximizing Resilience: Insights from the 2025 SRE Report

Leo Vasiliou
Catchpoint

As the digital landscape expands, the stakes for delivering reliable and seamless online experiences have never been higher. In the past year, site reliability engineering (SRE) has continued to evolve into a critical driver of operational success, shaping how organizations approach resilience, collaboration, and customer satisfaction.

The 2025 Catchpoint SRE Report dives into the forces transforming the SRE landscape, exploring both the challenges and opportunities ahead. Let's break down the key findings and what they mean for SRE professionals and the businesses relying on them.

Slow Is the New Down

Performance is about more than just uptime; it's also about speed. This year's report reveals that 53% of organizations believe poor performance is as harmful as downtime, making user experience a critical reliability metric.

What This Means for You: Organizations must elevate their performance monitoring strategies to include experience level objectives (XLOs) for ensuring fast and seamless digital interactions. Proactive performance tuning and real-time observability can mitigate the impact of "slow" on end users.

Toil Levels Are Rising Despite AI

After years of decline, toil — the manual, repetitive tasks that consume engineering resources — has ticked upward. The median reported percentage of work spent on toil rose to 30% from 25% in 2024 causing us to hypothesize whether AI is filling our time with more — instead of less — operational workload.

Why It Matters: This hypothesis suggests that while AI is improving specific workflows, it hasn't eliminated the burden of toil. Teams should evaluate their AI implementations to ensure they target high-impact areas and actively reduce manual effort. As Laura de Vesine, one of this year's report contributors put it: AI is at best "a co-worker you can't trust." Even as AI tools become more integrated into workflows, human oversight and intervention remain critical to ensure these tools don't inadvertently add to the complexity of tasks.

Organizational Priorities Under Pressure

The tension between agility and stability persists. Over two-thirds of respondents reported feeling pressured to prioritize release schedules over reliability, highlighting the ongoing challenge of balancing speed with resilience.

Takeaway: Building a culture that values reliability alongside agility requires clear communication and alignment on priorities. Teams should integrate reliability metrics into performance evaluations and emphasize the long-term benefits of stable releases for both IT and the business.

Monitoring Tools: More Is More

The report found that most organizations use between 2-10 monitoring or observability tools, showing a "value over cost" mindset for effective oversight across complex technology stacks.

What This Means for You: While multiple tools can provide comprehensive coverage, they also introduce complexity. Organizations should focus on integrating these tools to provide unified visibility and actionable insights without overwhelming their teams.

AI Training Universally in High Demand, but Time-Constrained

As AI continues to shape the SRE landscape, 30% of respondents prioritized technical training on AI — a strong indicator of the desire to upskill. However, the top sentiment (37%) reflected caution, as teams balance enthusiasm for AI with practical implementation concerns.

Takeaway: Providing targeted, hands-on training programs can help bridge the knowledge gap and build confidence in AI's capabilities. Organizations should also set realistic expectations for AI adoption, ensuring a smooth transition into daily workflows.

Incidents Are a Certainty

Incident response remains a universal challenge, with 40% of respondents handling between 1 and 5 incidents in the last 30 days. Notably, incident management is a shared responsibility, with higher-level managers as involved as individual contributors.

Why This Matters: Teams should adopt a collaborative approach to incident response, leveraging diverse perspectives to address issues effectively. Implementing clear incident playbooks and blameless post-mortem practices can further enhance preparedness and learning.

Misalignment on Reliability Priorities

While the overall responses paint a positive picture of reliability practices, significant gaps emerge when analyzed by managerial responsibility. Misalignment on priorities and approaches remains a challenge.

Takeaway: Bridging this IT-to-business gap requires the acknowledgment of its existence. Ongoing dialogue, alignment across all levels of the organization, and regularly revisiting and communicating reliability goals can help ensure everyone is pulling in the same direction.

Ownership and Action in SRE

The report shows just how important it is to connect technical work with the bigger picture. It all comes down to teams knowing how their efforts make a real difference and taking thoughtful steps to grab the opportunities in front of them. This year's report sheds light on the ongoing challenges that need attention, like making reliability a part of release planning, giving teams the tools and training they need to tackle incidents smoothly, and getting everyone on the same page, from leadership to contributors.

When it comes to AI, the focus should be on using it in practical ways that actually make work easier rather than more complicated. Building resilience and reliability isn't just about technical know-how. It's about clear goals, teamwork, and always looking for ways to improve. Companies that see SRE as a way to drive real outcomes, rather than just a set of technical tasks, will be in a great spot to succeed as the digital world keeps getting more complex and fast-paced.

Leo Vasiliou is Director of Product Marketing at Catchpoint

The Latest

Businesses that face downtime or outages risk financial and reputational damage, as well as reducing partner, shareholder, and customer trust. One of the major challenges that enterprises face is implementing a robust business continuity plan. What's the solution? The answer may lie in disaster recovery tactics such as truly immutable storage and regular disaster recovery testing ...

IT spending is expected to jump nearly 10% in 2025, and organizations are now facing pressure to manage costs without slowing down critical functions like observability. To meet the challenge, leaders are turning to smarter, more cost effective business strategies. Enter stage right: OpenTelemetry, the missing piece of the puzzle that is no longer just an option but rather a strategic advantage ...

Amidst the threat of cyberhacks and data breaches, companies install several security measures to keep their business safely afloat. These measures aim to protect businesses, employees, and crucial data. Yet, employees perceive them as burdensome. Frustrated with complex logins, slow access, and constant security checks, workers decide to completely bypass all security set-ups ...

Image
Cloudbrink's Personal SASE services provide last-mile acceleration and reduction in latency

In MEAN TIME TO INSIGHT Episode 13, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses hybrid multi-cloud networking strategy ... 

In high-traffic environments, the sheer volume and unpredictable nature of network incidents can quickly overwhelm even the most skilled teams, hindering their ability to react swiftly and effectively, potentially impacting service availability and overall business performance. This is where closed-loop remediation comes into the picture: an IT management concept designed to address the escalating complexity of modern networks ...

In 2025, enterprise workflows are undergoing a seismic shift. Propelled by breakthroughs in generative AI (GenAI), large language models (LLMs), and natural language processing (NLP), a new paradigm is emerging — agentic AI. This technology is not just automating tasks; it's reimagining how organizations make decisions, engage customers, and operate at scale ...

In the early days of the cloud revolution, business leaders perceived cloud services as a means of sidelining IT organizations. IT was too slow, too expensive, or incapable of supporting new technologies. With a team of developers, line of business managers could deploy new applications and services in the cloud. IT has been fighting to retake control ever since. Today, IT is back in the driver's seat, according to new research by Enterprise Management Associates (EMA) ...

In today's fast-paced and increasingly complex network environments, Network Operations Centers (NOCs) are the backbone of ensuring continuous uptime, smooth service delivery, and rapid issue resolution. However, the challenges faced by NOC teams are only growing. In a recent study, 78% state network complexity has grown significantly over the last few years while 84% regularly learn about network issues from users. It is imperative we adopt a new approach to managing today's network experiences ...

Image
Broadcom

From growing reliance on FinOps teams to the increasing attention on artificial intelligence (AI), and software licensing, the Flexera 2025 State of the Cloud Report digs into how organizations are improving cloud spend efficiency, while tackling the complexities of emerging technologies ...

Today, organizations are generating and processing more data than ever before. From training AI models to running complex analytics, massive datasets have become the backbone of innovation. However, as businesses embrace the cloud for its scalability and flexibility, a new challenge arises: managing the soaring costs of storing and processing this data ...

Maximizing Resilience: Insights from the 2025 SRE Report

Leo Vasiliou
Catchpoint

As the digital landscape expands, the stakes for delivering reliable and seamless online experiences have never been higher. In the past year, site reliability engineering (SRE) has continued to evolve into a critical driver of operational success, shaping how organizations approach resilience, collaboration, and customer satisfaction.

The 2025 Catchpoint SRE Report dives into the forces transforming the SRE landscape, exploring both the challenges and opportunities ahead. Let's break down the key findings and what they mean for SRE professionals and the businesses relying on them.

Slow Is the New Down

Performance is about more than just uptime; it's also about speed. This year's report reveals that 53% of organizations believe poor performance is as harmful as downtime, making user experience a critical reliability metric.

What This Means for You: Organizations must elevate their performance monitoring strategies to include experience level objectives (XLOs) for ensuring fast and seamless digital interactions. Proactive performance tuning and real-time observability can mitigate the impact of "slow" on end users.

Toil Levels Are Rising Despite AI

After years of decline, toil — the manual, repetitive tasks that consume engineering resources — has ticked upward. The median reported percentage of work spent on toil rose to 30% from 25% in 2024 causing us to hypothesize whether AI is filling our time with more — instead of less — operational workload.

Why It Matters: This hypothesis suggests that while AI is improving specific workflows, it hasn't eliminated the burden of toil. Teams should evaluate their AI implementations to ensure they target high-impact areas and actively reduce manual effort. As Laura de Vesine, one of this year's report contributors put it: AI is at best "a co-worker you can't trust." Even as AI tools become more integrated into workflows, human oversight and intervention remain critical to ensure these tools don't inadvertently add to the complexity of tasks.

Organizational Priorities Under Pressure

The tension between agility and stability persists. Over two-thirds of respondents reported feeling pressured to prioritize release schedules over reliability, highlighting the ongoing challenge of balancing speed with resilience.

Takeaway: Building a culture that values reliability alongside agility requires clear communication and alignment on priorities. Teams should integrate reliability metrics into performance evaluations and emphasize the long-term benefits of stable releases for both IT and the business.

Monitoring Tools: More Is More

The report found that most organizations use between 2-10 monitoring or observability tools, showing a "value over cost" mindset for effective oversight across complex technology stacks.

What This Means for You: While multiple tools can provide comprehensive coverage, they also introduce complexity. Organizations should focus on integrating these tools to provide unified visibility and actionable insights without overwhelming their teams.

AI Training Universally in High Demand, but Time-Constrained

As AI continues to shape the SRE landscape, 30% of respondents prioritized technical training on AI — a strong indicator of the desire to upskill. However, the top sentiment (37%) reflected caution, as teams balance enthusiasm for AI with practical implementation concerns.

Takeaway: Providing targeted, hands-on training programs can help bridge the knowledge gap and build confidence in AI's capabilities. Organizations should also set realistic expectations for AI adoption, ensuring a smooth transition into daily workflows.

Incidents Are a Certainty

Incident response remains a universal challenge, with 40% of respondents handling between 1 and 5 incidents in the last 30 days. Notably, incident management is a shared responsibility, with higher-level managers as involved as individual contributors.

Why This Matters: Teams should adopt a collaborative approach to incident response, leveraging diverse perspectives to address issues effectively. Implementing clear incident playbooks and blameless post-mortem practices can further enhance preparedness and learning.

Misalignment on Reliability Priorities

While the overall responses paint a positive picture of reliability practices, significant gaps emerge when analyzed by managerial responsibility. Misalignment on priorities and approaches remains a challenge.

Takeaway: Bridging this IT-to-business gap requires the acknowledgment of its existence. Ongoing dialogue, alignment across all levels of the organization, and regularly revisiting and communicating reliability goals can help ensure everyone is pulling in the same direction.

Ownership and Action in SRE

The report shows just how important it is to connect technical work with the bigger picture. It all comes down to teams knowing how their efforts make a real difference and taking thoughtful steps to grab the opportunities in front of them. This year's report sheds light on the ongoing challenges that need attention, like making reliability a part of release planning, giving teams the tools and training they need to tackle incidents smoothly, and getting everyone on the same page, from leadership to contributors.

When it comes to AI, the focus should be on using it in practical ways that actually make work easier rather than more complicated. Building resilience and reliability isn't just about technical know-how. It's about clear goals, teamwork, and always looking for ways to improve. Companies that see SRE as a way to drive real outcomes, rather than just a set of technical tasks, will be in a great spot to succeed as the digital world keeps getting more complex and fast-paced.

Leo Vasiliou is Director of Product Marketing at Catchpoint

The Latest

Businesses that face downtime or outages risk financial and reputational damage, as well as reducing partner, shareholder, and customer trust. One of the major challenges that enterprises face is implementing a robust business continuity plan. What's the solution? The answer may lie in disaster recovery tactics such as truly immutable storage and regular disaster recovery testing ...

IT spending is expected to jump nearly 10% in 2025, and organizations are now facing pressure to manage costs without slowing down critical functions like observability. To meet the challenge, leaders are turning to smarter, more cost effective business strategies. Enter stage right: OpenTelemetry, the missing piece of the puzzle that is no longer just an option but rather a strategic advantage ...

Amidst the threat of cyberhacks and data breaches, companies install several security measures to keep their business safely afloat. These measures aim to protect businesses, employees, and crucial data. Yet, employees perceive them as burdensome. Frustrated with complex logins, slow access, and constant security checks, workers decide to completely bypass all security set-ups ...

Image
Cloudbrink's Personal SASE services provide last-mile acceleration and reduction in latency

In MEAN TIME TO INSIGHT Episode 13, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses hybrid multi-cloud networking strategy ... 

In high-traffic environments, the sheer volume and unpredictable nature of network incidents can quickly overwhelm even the most skilled teams, hindering their ability to react swiftly and effectively, potentially impacting service availability and overall business performance. This is where closed-loop remediation comes into the picture: an IT management concept designed to address the escalating complexity of modern networks ...

In 2025, enterprise workflows are undergoing a seismic shift. Propelled by breakthroughs in generative AI (GenAI), large language models (LLMs), and natural language processing (NLP), a new paradigm is emerging — agentic AI. This technology is not just automating tasks; it's reimagining how organizations make decisions, engage customers, and operate at scale ...

In the early days of the cloud revolution, business leaders perceived cloud services as a means of sidelining IT organizations. IT was too slow, too expensive, or incapable of supporting new technologies. With a team of developers, line of business managers could deploy new applications and services in the cloud. IT has been fighting to retake control ever since. Today, IT is back in the driver's seat, according to new research by Enterprise Management Associates (EMA) ...

In today's fast-paced and increasingly complex network environments, Network Operations Centers (NOCs) are the backbone of ensuring continuous uptime, smooth service delivery, and rapid issue resolution. However, the challenges faced by NOC teams are only growing. In a recent study, 78% state network complexity has grown significantly over the last few years while 84% regularly learn about network issues from users. It is imperative we adopt a new approach to managing today's network experiences ...

Image
Broadcom

From growing reliance on FinOps teams to the increasing attention on artificial intelligence (AI), and software licensing, the Flexera 2025 State of the Cloud Report digs into how organizations are improving cloud spend efficiency, while tackling the complexities of emerging technologies ...

Today, organizations are generating and processing more data than ever before. From training AI models to running complex analytics, massive datasets have become the backbone of innovation. However, as businesses embrace the cloud for its scalability and flexibility, a new challenge arises: managing the soaring costs of storing and processing this data ...