Skip to main content

As Influence Grows, SREs Still Face Many Challenges

Andreas Grabner

Years from now, the development community could look back and view this period as the beginning of a golden era, thanks in part to the embrace by business managers of site reliability engineering (SRE).

SRE has quickly become a mainstay of modern software development at some of the most scalable and reliable services on the planet, including Airbnb, Netflix and Google — the last of which kicked off the global discussion and adoption of SRE. The leading edge development practices and principles core to SRE grew out of the need to produce more reliable and resilient sites that were better performing and more secure — all the while they reduced the costs and the public opprobrium associated with outages and security breaches.


Despite that success, a recent survey of 450 site reliability engineers reveals that adoption of SRE at many companies is often stalled by a range of challenges that include the following:

■ The plodding pace at some stages of SRE evolution.

■ Confusion about which service-level objective (SLO) should be considered to measure success.

■ Automation that still requires large chunks of time from site reliability engineers (SREs) to create and maintain code.

■ Lack of vital resources, such as observability tools.

SRE Has Won Over Numerous Companies

Overall, the climate surrounding SRE is extremely positive. Many companies have embraced SRE practices, the survey indicates. Nearly 90% of respondents said that an SRE's role in achieving business success is more recognized today than three years ago. And only 6% of the SREs polled described their companies as immature in terms of SRE adoption.

Additionally, SREs have become more influential within their companies. Half of those surveyed reported that they now dedicate a significant amount of time to influencing architectural design as part of efforts to improve reliability.

More than two-thirds (68%) of SREs said they expect their role in security to become even more central in the future as organizations continue using third-party libraries for cloud-native application development.

Effort to Usher in SRE Often Stymied by Lack of Resources

But not all companies have made this kind of progress. An overwhelming majority of the SREs surveyed (97%) said efforts to implement a dedicated SRE practice at their organizations continue to face obstacles, such as an inability to gain access to new hires or be allowed to upskill existing teams.

Some of the reasons given for the lack of resources include the following:

■ 59% of respondents said companies perceive it as difficult to train or retain existing ITOps team members or sysadmins to become SREs.

■ 51% said SREs are often believed to be expensive and difficult to hire.

■ 43% said finding SRE skills in the market isn't easy.

Even companies committed to SRE run into trouble, especially when it comes time to measure SRE impact.

Confusion Surrounds SLOs

A majority of respondents (81%) said their companies created objectives and key results (OKRs) and key performance indicators (KPIs) to evaluate service levels for applications and infrastructure, while 75% said they rely on service-level objectives (SLOs). Surprisingly, 99% of those polled said establishing SLOs is challenging.

Some of the reasons given for the difficulty include the following:

■ Too many data sources are involved, which can hobble attempts to synthesize disparate data (64%).

■ An overabundance of metrics makes it difficult to find the most relevant measures to use (54%).

■ A lack of monitoring tools prevent the accurate tracking and defining of SLO performance history (36%).

■ An inability to determine what constitutes quality SLO (22%)

In the same way that many of the challenges confronting SREs were revealed in the survey, some of the solutions were unearthed there as well.

Automation Is Beneficial ... Sometimes

Because SREs oversee a growing number of tasks, their time is increasingly stretched thin, according to the report. Automation is key to freeing up more of their time and allowing them to focus on value-added activities, such as designing experiments and running tests to reduce risk of production failure, or ensuring that security vulnerabilities are detected.

According to 85% of the SREs surveyed, the ability to scale SRE practices further across the organization is extremely dependent on the availability of automation and AI capabilities.

In addition, 71% said their companies have increased the use of automation across every part of the development lifecycle, and 61% said automation has major impact on reducing security vulnerabilities. At a time when threat actors are increasingly more sophisticated and ransomware attacks are a plague on business, SREs have begun to drive increased adoption of DevSecOps practices to ensure security is top of mind at every stage of the development lifecycle.

SREs Need Standardized Observability

SREs and company managers also want to build unity across their tool stacks, according to the survey: 85% of SREs want to standardize on the same observability platform from development to operations and security by 2025. The goal is to create a larger number of streamlined solutions that enable SRE and DevOps teams to work together more effectively and eliminate the need to switch between various dashboards.

Among the major takeaways of The State of SRE report is that SREs need the time and resources required to foster greater reliability. Business managers should note that the success of SREs depends largely on making the most of available resources and that means limiting the amount of time SREs spend on low-priority chores. Automation is vital to achieving this, but it can also add to the distractions if SREs spend too much time writing automation scripts.

To supply the kind of deliverables that benefit their companies the most, such as maximizing reliability, resiliency, security, performance, and eventually business outcomes, SREs require platforms that enable them to drive reliability and automation by default, through self-serve and everything-as-code capabilities.

The Latest

Businesses that face downtime or outages risk financial and reputational damage, as well as reducing partner, shareholder, and customer trust. One of the major challenges that enterprises face is implementing a robust business continuity plan. What's the solution? The answer may lie in disaster recovery tactics such as truly immutable storage and regular disaster recovery testing ...

IT spending is expected to jump nearly 10% in 2025, and organizations are now facing pressure to manage costs without slowing down critical functions like observability. To meet the challenge, leaders are turning to smarter, more cost effective business strategies. Enter stage right: OpenTelemetry, the missing piece of the puzzle that is no longer just an option but rather a strategic advantage ...

Amidst the threat of cyberhacks and data breaches, companies install several security measures to keep their business safely afloat. These measures aim to protect businesses, employees, and crucial data. Yet, employees perceive them as burdensome. Frustrated with complex logins, slow access, and constant security checks, workers decide to completely bypass all security set-ups ...

Image
Cloudbrink's Personal SASE services provide last-mile acceleration and reduction in latency

In MEAN TIME TO INSIGHT Episode 13, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses hybrid multi-cloud networking strategy ... 

In high-traffic environments, the sheer volume and unpredictable nature of network incidents can quickly overwhelm even the most skilled teams, hindering their ability to react swiftly and effectively, potentially impacting service availability and overall business performance. This is where closed-loop remediation comes into the picture: an IT management concept designed to address the escalating complexity of modern networks ...

In 2025, enterprise workflows are undergoing a seismic shift. Propelled by breakthroughs in generative AI (GenAI), large language models (LLMs), and natural language processing (NLP), a new paradigm is emerging — agentic AI. This technology is not just automating tasks; it's reimagining how organizations make decisions, engage customers, and operate at scale ...

In the early days of the cloud revolution, business leaders perceived cloud services as a means of sidelining IT organizations. IT was too slow, too expensive, or incapable of supporting new technologies. With a team of developers, line of business managers could deploy new applications and services in the cloud. IT has been fighting to retake control ever since. Today, IT is back in the driver's seat, according to new research by Enterprise Management Associates (EMA) ...

In today's fast-paced and increasingly complex network environments, Network Operations Centers (NOCs) are the backbone of ensuring continuous uptime, smooth service delivery, and rapid issue resolution. However, the challenges faced by NOC teams are only growing. In a recent study, 78% state network complexity has grown significantly over the last few years while 84% regularly learn about network issues from users. It is imperative we adopt a new approach to managing today's network experiences ...

Image
Broadcom

From growing reliance on FinOps teams to the increasing attention on artificial intelligence (AI), and software licensing, the Flexera 2025 State of the Cloud Report digs into how organizations are improving cloud spend efficiency, while tackling the complexities of emerging technologies ...

Today, organizations are generating and processing more data than ever before. From training AI models to running complex analytics, massive datasets have become the backbone of innovation. However, as businesses embrace the cloud for its scalability and flexibility, a new challenge arises: managing the soaring costs of storing and processing this data ...

As Influence Grows, SREs Still Face Many Challenges

Andreas Grabner

Years from now, the development community could look back and view this period as the beginning of a golden era, thanks in part to the embrace by business managers of site reliability engineering (SRE).

SRE has quickly become a mainstay of modern software development at some of the most scalable and reliable services on the planet, including Airbnb, Netflix and Google — the last of which kicked off the global discussion and adoption of SRE. The leading edge development practices and principles core to SRE grew out of the need to produce more reliable and resilient sites that were better performing and more secure — all the while they reduced the costs and the public opprobrium associated with outages and security breaches.


Despite that success, a recent survey of 450 site reliability engineers reveals that adoption of SRE at many companies is often stalled by a range of challenges that include the following:

■ The plodding pace at some stages of SRE evolution.

■ Confusion about which service-level objective (SLO) should be considered to measure success.

■ Automation that still requires large chunks of time from site reliability engineers (SREs) to create and maintain code.

■ Lack of vital resources, such as observability tools.

SRE Has Won Over Numerous Companies

Overall, the climate surrounding SRE is extremely positive. Many companies have embraced SRE practices, the survey indicates. Nearly 90% of respondents said that an SRE's role in achieving business success is more recognized today than three years ago. And only 6% of the SREs polled described their companies as immature in terms of SRE adoption.

Additionally, SREs have become more influential within their companies. Half of those surveyed reported that they now dedicate a significant amount of time to influencing architectural design as part of efforts to improve reliability.

More than two-thirds (68%) of SREs said they expect their role in security to become even more central in the future as organizations continue using third-party libraries for cloud-native application development.

Effort to Usher in SRE Often Stymied by Lack of Resources

But not all companies have made this kind of progress. An overwhelming majority of the SREs surveyed (97%) said efforts to implement a dedicated SRE practice at their organizations continue to face obstacles, such as an inability to gain access to new hires or be allowed to upskill existing teams.

Some of the reasons given for the lack of resources include the following:

■ 59% of respondents said companies perceive it as difficult to train or retain existing ITOps team members or sysadmins to become SREs.

■ 51% said SREs are often believed to be expensive and difficult to hire.

■ 43% said finding SRE skills in the market isn't easy.

Even companies committed to SRE run into trouble, especially when it comes time to measure SRE impact.

Confusion Surrounds SLOs

A majority of respondents (81%) said their companies created objectives and key results (OKRs) and key performance indicators (KPIs) to evaluate service levels for applications and infrastructure, while 75% said they rely on service-level objectives (SLOs). Surprisingly, 99% of those polled said establishing SLOs is challenging.

Some of the reasons given for the difficulty include the following:

■ Too many data sources are involved, which can hobble attempts to synthesize disparate data (64%).

■ An overabundance of metrics makes it difficult to find the most relevant measures to use (54%).

■ A lack of monitoring tools prevent the accurate tracking and defining of SLO performance history (36%).

■ An inability to determine what constitutes quality SLO (22%)

In the same way that many of the challenges confronting SREs were revealed in the survey, some of the solutions were unearthed there as well.

Automation Is Beneficial ... Sometimes

Because SREs oversee a growing number of tasks, their time is increasingly stretched thin, according to the report. Automation is key to freeing up more of their time and allowing them to focus on value-added activities, such as designing experiments and running tests to reduce risk of production failure, or ensuring that security vulnerabilities are detected.

According to 85% of the SREs surveyed, the ability to scale SRE practices further across the organization is extremely dependent on the availability of automation and AI capabilities.

In addition, 71% said their companies have increased the use of automation across every part of the development lifecycle, and 61% said automation has major impact on reducing security vulnerabilities. At a time when threat actors are increasingly more sophisticated and ransomware attacks are a plague on business, SREs have begun to drive increased adoption of DevSecOps practices to ensure security is top of mind at every stage of the development lifecycle.

SREs Need Standardized Observability

SREs and company managers also want to build unity across their tool stacks, according to the survey: 85% of SREs want to standardize on the same observability platform from development to operations and security by 2025. The goal is to create a larger number of streamlined solutions that enable SRE and DevOps teams to work together more effectively and eliminate the need to switch between various dashboards.

Among the major takeaways of The State of SRE report is that SREs need the time and resources required to foster greater reliability. Business managers should note that the success of SREs depends largely on making the most of available resources and that means limiting the amount of time SREs spend on low-priority chores. Automation is vital to achieving this, but it can also add to the distractions if SREs spend too much time writing automation scripts.

To supply the kind of deliverables that benefit their companies the most, such as maximizing reliability, resiliency, security, performance, and eventually business outcomes, SREs require platforms that enable them to drive reliability and automation by default, through self-serve and everything-as-code capabilities.

The Latest

Businesses that face downtime or outages risk financial and reputational damage, as well as reducing partner, shareholder, and customer trust. One of the major challenges that enterprises face is implementing a robust business continuity plan. What's the solution? The answer may lie in disaster recovery tactics such as truly immutable storage and regular disaster recovery testing ...

IT spending is expected to jump nearly 10% in 2025, and organizations are now facing pressure to manage costs without slowing down critical functions like observability. To meet the challenge, leaders are turning to smarter, more cost effective business strategies. Enter stage right: OpenTelemetry, the missing piece of the puzzle that is no longer just an option but rather a strategic advantage ...

Amidst the threat of cyberhacks and data breaches, companies install several security measures to keep their business safely afloat. These measures aim to protect businesses, employees, and crucial data. Yet, employees perceive them as burdensome. Frustrated with complex logins, slow access, and constant security checks, workers decide to completely bypass all security set-ups ...

Image
Cloudbrink's Personal SASE services provide last-mile acceleration and reduction in latency

In MEAN TIME TO INSIGHT Episode 13, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses hybrid multi-cloud networking strategy ... 

In high-traffic environments, the sheer volume and unpredictable nature of network incidents can quickly overwhelm even the most skilled teams, hindering their ability to react swiftly and effectively, potentially impacting service availability and overall business performance. This is where closed-loop remediation comes into the picture: an IT management concept designed to address the escalating complexity of modern networks ...

In 2025, enterprise workflows are undergoing a seismic shift. Propelled by breakthroughs in generative AI (GenAI), large language models (LLMs), and natural language processing (NLP), a new paradigm is emerging — agentic AI. This technology is not just automating tasks; it's reimagining how organizations make decisions, engage customers, and operate at scale ...

In the early days of the cloud revolution, business leaders perceived cloud services as a means of sidelining IT organizations. IT was too slow, too expensive, or incapable of supporting new technologies. With a team of developers, line of business managers could deploy new applications and services in the cloud. IT has been fighting to retake control ever since. Today, IT is back in the driver's seat, according to new research by Enterprise Management Associates (EMA) ...

In today's fast-paced and increasingly complex network environments, Network Operations Centers (NOCs) are the backbone of ensuring continuous uptime, smooth service delivery, and rapid issue resolution. However, the challenges faced by NOC teams are only growing. In a recent study, 78% state network complexity has grown significantly over the last few years while 84% regularly learn about network issues from users. It is imperative we adopt a new approach to managing today's network experiences ...

Image
Broadcom

From growing reliance on FinOps teams to the increasing attention on artificial intelligence (AI), and software licensing, the Flexera 2025 State of the Cloud Report digs into how organizations are improving cloud spend efficiency, while tackling the complexities of emerging technologies ...

Today, organizations are generating and processing more data than ever before. From training AI models to running complex analytics, massive datasets have become the backbone of innovation. However, as businesses embrace the cloud for its scalability and flexibility, a new challenge arises: managing the soaring costs of storing and processing this data ...