Skip to main content

How Site Reliability Has Progressed in the Last 2 Years

Emily Arnott
Blameless

In the last two years, site reliability engineering, more popularly known as SRE, has progressed and matured as both an engineering practice and function. There have been significant changes — not only in terms of tool usage, but also people process changes that begin with a culture or mind-set shift. Cloud-native, microservices-driven architecture has both complicated the discipline, yet enabled us to live an all-digital existence with continuous updates and new capabilities.

From Monitoring to Observing

10 years ago, there was a big emphasis on infrastructure monitoring across the industry. In some cases, dev and ops worked in silos and didn't communicate very much at all. Processes weren't codified and ops lacked visibility into the code base. Operations moved much slower and it was definitely more frustrating for both engineers and end-users.

The dawn of DevOps inspired teams to break down silos, automate workflows, and better communicate. Simultaneously, APM (Application Performance Management) led the evolution of democratizing metrics monitoring. With that, performance expectations increased and we saw the beginnings of emphasizing the application or front-end user experience. That translated to a more avid focus on response times, error rates and availability.

APM adoption really took off during the big cloud migration. Shifting from monolithic to micro-architected services, teams started to lean on third party cloud providers and also outsource their monitoring too.

Duh! It's All About the End-User Experience

Meanwhile, this set the stage for site reliability engineering (SRE), Google's best kept secret until 2016 when the SRE handbook was published. Created internally for a decade, it was a natural time to introduce SRE principles into the ITIL framework. SRE takes a prescriptive approach to DevOps, with its main mantra being: your key objectives should aim toward user happiness as the end goal.

This new practice has inspired a cultural shift for engineering infrastructure teams that I've never quite seen before. What gets me most excited is the data-driven approach to measuring the business value of reliability. I see a lot more non-engineering stakeholders invested in and paying more attention to reliability insights.

Reliability directly impacts both the top-line and bottom line. Customer loyalty, brand, and growth are impacted when a service doesn't deliver predictable reliability. At the same time, when engineers become mired in complex, unstable environments where they have to spend toilsome cycles fixing and band-aiding, it translates to attrition and a significant cost hit to the business.

Reliability: Think Outside the Engineering "Box"

Reliability is categorically a business metric. It encompasses more than product measures, and I believe companies should view it in a more holistic manner. To understand reliability, we need to look at what's happening "on the front lines."

How often do we find out about a performance issue from a customer support ticket?

What type of feedback do we receive from customers?

How's our reputation in the market?

We pose these questions to the Support Team, Customer Success, Marketing, and Sales. In doing so, we actually start to treat reliability as a measure of the health of the business. It's both a leading and lagging indicator.

To take this a step further, SRE helps DevOps teams use a proactive approach by surfacing the right data insights to take preventative steps and avoid an issue to potentially worsen. If teams have time to respond and react, they are happier, and certainly the customer either never knows or carries on with their loyal usage.

Before, maybe 5-7 years back, we thought metrics monitoring and visibility across DevOps was enough. It wasn't. We built telemetry-based dashboards and collected all relevant data points, but it didn't go far enough by prioritizing different parts of the service and proactively setting a metric to work towards. Without a planned number or indicator, we didn't know how to progress or improve over time.

Using the principles of SRE, we're more diligent about continuously learning and improving. We do better at documenting what happened with retrospective reports that inform us how the system is behaving. By aggregating that data over time, we learn how it's advancing. We are better challenged to identify gaps and any single points of failure. Often fixing one issue or part of the service doesn't solve the entire system.

When teams come together and agree on which specific parts of the service are absolutely critical and how to escalate when something goes awry, teams get a clear sense of what's important and where to focus. This is now popularized by SLOs which are essentially KPIs for engineers that can be communicated up, down, and across the functions of any business.

When Should a Team Build Incident Management Rigor?

Smaller teams in growing organizations tend to be more proactive. A modern approach to tech tools and lack of legacy environments definitely makes this easier. However, the tendency is to just do the basics and then move on with the busy, innovative work. Often important steps are missed such as modifying runbooks or creating useful dashboards and reports. Retrospectives (post-mortems) are a huge learning opportunity and sometimes this step is missed or only conducted for Sev0 or Sev1 incidents.

Because reliability is a constantly evolving aspect, we recommend teams dedicate a percentage of their time to codify and improve process and tooling steps. In fact, part of the SRE function is dedicated to doing just that. All too often SRE teams are pulled into coding work or managing infrastructure in advance of, say, a big release, which takes time away from the operational rigor that's critical to reliability.

Ultimately, reliability is a journey and knocking down key milestones is a long-haul investment. It's daunting for young, fast-paced teams to bite off too much, especially when there's a high velocity release schedule. All organizational aspects and team workload must be factored into the plan plus the existing maturity and team skill set.

Hiring dedicated seasoned SREs is difficult and so establishing in-house enablement and training is a better path to success. Learning on the job, which is unique to your environment, is irreplaceable and cannot be easily taught without daily practice.

Take the baby steps and milestone approach each quarter and quickly you will start to see results.

If your team is bigger and in an established organization, you need to consider migrating from older practices and tools. Taking a staged approach requires program and project management and therefore dedicated time from existing or new team members. Without executive investment and sponsorship, it will be an uphill climb. The reality is that customer expectations are no different for a smaller service provider compared to a large enterprise company with deeper resources. Modernizing DevOps teams is the only path forward.

Culture Is the Great Enabler

SRE is markedly different because of its emphasis on culture. There's a very specific type of culture that SRE teaches us to adopt. I like to think of it as twofold.

First, you want everyone in the organization to put the customer first. In other words, you're successful in your job when the customer is happy.

Second, trust and believe that incidents and failures stem from systemic problems that require systemic solutions. The first part about focusing on the customer is hugely important to understanding why we should care about reliability in the first place. We're all here to create cutting-edge tech that simplifies and streamlines, absolutely. But ultimately we're also here to provide a service and be useful. If we can get the business and engineering functions to align on the same goals, speak the same language, and humanize our processes more, we're on the right track.

We're making great progress. This year, I'm looking forward to a few trends manifesting in our day-to-day experiences. I want to have engineers exposed to the front lines. Connect them with customers much more often. Let them hear what customers are saying and asking for.

Also, I hope to see go-to-market teams get close to engineers, understand the complexities of their products, and observe them as they manage tradeoffs. Whichever side of the company, we're all dealing with prioritization in our day-to-day. Everyone's doing it. Adopting a holistic view and witnessing how our functions play their part in a greater ecosystem is a game-changer.

Third-party research from leading provider Gartner states:

"IT organizations struggle to demonstrate the business value of I&O [infrastructure and operations]. As a result, business leaders often see I&O as a cost center rather than an enabler of business value. There's plenty of work to be done. Recommendations are wide-ranging and include:

■ Use business and IT operations outcomes and metrics that stakeholders will understand and appreciate

■ Facilitate regular communications that highlight progress toward the identified outcomes and metrics."

For every engineering team that we work with, we always ask about the cultural mindset and how goals map to the business needs. Improving the entire incident management process is unquestionably important, but even more critical is how we translate the outcomes of that learning to other business stakeholders in order to achieve a resilient culture.

Takeaways

■ APM (Application Performance Management) led the evolution of democratizing metrics monitoring, which translated to a more avid focus on response times, error rates and availability.

■ Reliability directly impacts both the top-line and bottom line. Customer loyalty, brand, and growth are impacted when a service doesn't deliver predictable reliability.

■ Reliability is categorically a business metric that encompasses more than product measures, and should be viewed in a more holistic manner by companies.

■ Because reliability is a constantly evolving aspect, teams should dedicate a percentage of their time to codify and improve process and tooling steps.

■ Use business and IT operations outcomes and metrics that stakeholders will understand and facilitate regular communications that highlight progress toward the identified outcomes and metrics.

Emily Arnott is Community Relations Manager at Blameless

Hot Topics

The Latest

Every digital customer interaction, every cloud deployment, and every AI model depends on the same foundation: the ability to see, understand, and act on data in real time ... Recent data from Splunk confirms that 74% of the business leaders believe observability is essential to monitoring critical business processes, and 66% feel it's key to understanding user journeys. Because while the unknown is inevitable, observability makes it manageable. Let's explore why ...

Organizations that perform regular audits and assessments of AI system performance and compliance are over three times more likely to achieve high GenAI value than organizations that do not, according to a survey by Gartner ...

Kubernetes has become the backbone of cloud infrastructure, but it's also one of its biggest cost drivers. Recent research shows that 98% of senior IT leaders say Kubernetes now drives cloud spend, yet 91% still can't optimize it effectively. After years of adoption, most organizations have moved past discovery. They know container sprawl, idle resources and reactive scaling inflate costs. What they don't know is how to fix it ...

Artificial intelligence is no longer a future investment. It's already embedded in how we work — whether through copilots in productivity apps, real-time transcription tools in meetings, or machine learning models fueling analytics and personalization. But while enterprise adoption accelerates, there's one critical area many leaders have yet to examine: Can your network actually support AI at the speed your users expect? ...

The more technology businesses invest in, the more potential attack surfaces they have that can be exploited. Without the right continuity plans in place, the disruptions caused by these attacks can bring operations to a standstill and cause irreparable damage to an organization. It's essential to take the time now to ensure your business has the right tools, processes, and recovery initiatives in place to weather any type of IT disaster that comes up. Here are some effective strategies you can follow to achieve this ...

How Site Reliability Has Progressed in the Last 2 Years

Emily Arnott
Blameless

In the last two years, site reliability engineering, more popularly known as SRE, has progressed and matured as both an engineering practice and function. There have been significant changes — not only in terms of tool usage, but also people process changes that begin with a culture or mind-set shift. Cloud-native, microservices-driven architecture has both complicated the discipline, yet enabled us to live an all-digital existence with continuous updates and new capabilities.

From Monitoring to Observing

10 years ago, there was a big emphasis on infrastructure monitoring across the industry. In some cases, dev and ops worked in silos and didn't communicate very much at all. Processes weren't codified and ops lacked visibility into the code base. Operations moved much slower and it was definitely more frustrating for both engineers and end-users.

The dawn of DevOps inspired teams to break down silos, automate workflows, and better communicate. Simultaneously, APM (Application Performance Management) led the evolution of democratizing metrics monitoring. With that, performance expectations increased and we saw the beginnings of emphasizing the application or front-end user experience. That translated to a more avid focus on response times, error rates and availability.

APM adoption really took off during the big cloud migration. Shifting from monolithic to micro-architected services, teams started to lean on third party cloud providers and also outsource their monitoring too.

Duh! It's All About the End-User Experience

Meanwhile, this set the stage for site reliability engineering (SRE), Google's best kept secret until 2016 when the SRE handbook was published. Created internally for a decade, it was a natural time to introduce SRE principles into the ITIL framework. SRE takes a prescriptive approach to DevOps, with its main mantra being: your key objectives should aim toward user happiness as the end goal.

This new practice has inspired a cultural shift for engineering infrastructure teams that I've never quite seen before. What gets me most excited is the data-driven approach to measuring the business value of reliability. I see a lot more non-engineering stakeholders invested in and paying more attention to reliability insights.

Reliability directly impacts both the top-line and bottom line. Customer loyalty, brand, and growth are impacted when a service doesn't deliver predictable reliability. At the same time, when engineers become mired in complex, unstable environments where they have to spend toilsome cycles fixing and band-aiding, it translates to attrition and a significant cost hit to the business.

Reliability: Think Outside the Engineering "Box"

Reliability is categorically a business metric. It encompasses more than product measures, and I believe companies should view it in a more holistic manner. To understand reliability, we need to look at what's happening "on the front lines."

How often do we find out about a performance issue from a customer support ticket?

What type of feedback do we receive from customers?

How's our reputation in the market?

We pose these questions to the Support Team, Customer Success, Marketing, and Sales. In doing so, we actually start to treat reliability as a measure of the health of the business. It's both a leading and lagging indicator.

To take this a step further, SRE helps DevOps teams use a proactive approach by surfacing the right data insights to take preventative steps and avoid an issue to potentially worsen. If teams have time to respond and react, they are happier, and certainly the customer either never knows or carries on with their loyal usage.

Before, maybe 5-7 years back, we thought metrics monitoring and visibility across DevOps was enough. It wasn't. We built telemetry-based dashboards and collected all relevant data points, but it didn't go far enough by prioritizing different parts of the service and proactively setting a metric to work towards. Without a planned number or indicator, we didn't know how to progress or improve over time.

Using the principles of SRE, we're more diligent about continuously learning and improving. We do better at documenting what happened with retrospective reports that inform us how the system is behaving. By aggregating that data over time, we learn how it's advancing. We are better challenged to identify gaps and any single points of failure. Often fixing one issue or part of the service doesn't solve the entire system.

When teams come together and agree on which specific parts of the service are absolutely critical and how to escalate when something goes awry, teams get a clear sense of what's important and where to focus. This is now popularized by SLOs which are essentially KPIs for engineers that can be communicated up, down, and across the functions of any business.

When Should a Team Build Incident Management Rigor?

Smaller teams in growing organizations tend to be more proactive. A modern approach to tech tools and lack of legacy environments definitely makes this easier. However, the tendency is to just do the basics and then move on with the busy, innovative work. Often important steps are missed such as modifying runbooks or creating useful dashboards and reports. Retrospectives (post-mortems) are a huge learning opportunity and sometimes this step is missed or only conducted for Sev0 or Sev1 incidents.

Because reliability is a constantly evolving aspect, we recommend teams dedicate a percentage of their time to codify and improve process and tooling steps. In fact, part of the SRE function is dedicated to doing just that. All too often SRE teams are pulled into coding work or managing infrastructure in advance of, say, a big release, which takes time away from the operational rigor that's critical to reliability.

Ultimately, reliability is a journey and knocking down key milestones is a long-haul investment. It's daunting for young, fast-paced teams to bite off too much, especially when there's a high velocity release schedule. All organizational aspects and team workload must be factored into the plan plus the existing maturity and team skill set.

Hiring dedicated seasoned SREs is difficult and so establishing in-house enablement and training is a better path to success. Learning on the job, which is unique to your environment, is irreplaceable and cannot be easily taught without daily practice.

Take the baby steps and milestone approach each quarter and quickly you will start to see results.

If your team is bigger and in an established organization, you need to consider migrating from older practices and tools. Taking a staged approach requires program and project management and therefore dedicated time from existing or new team members. Without executive investment and sponsorship, it will be an uphill climb. The reality is that customer expectations are no different for a smaller service provider compared to a large enterprise company with deeper resources. Modernizing DevOps teams is the only path forward.

Culture Is the Great Enabler

SRE is markedly different because of its emphasis on culture. There's a very specific type of culture that SRE teaches us to adopt. I like to think of it as twofold.

First, you want everyone in the organization to put the customer first. In other words, you're successful in your job when the customer is happy.

Second, trust and believe that incidents and failures stem from systemic problems that require systemic solutions. The first part about focusing on the customer is hugely important to understanding why we should care about reliability in the first place. We're all here to create cutting-edge tech that simplifies and streamlines, absolutely. But ultimately we're also here to provide a service and be useful. If we can get the business and engineering functions to align on the same goals, speak the same language, and humanize our processes more, we're on the right track.

We're making great progress. This year, I'm looking forward to a few trends manifesting in our day-to-day experiences. I want to have engineers exposed to the front lines. Connect them with customers much more often. Let them hear what customers are saying and asking for.

Also, I hope to see go-to-market teams get close to engineers, understand the complexities of their products, and observe them as they manage tradeoffs. Whichever side of the company, we're all dealing with prioritization in our day-to-day. Everyone's doing it. Adopting a holistic view and witnessing how our functions play their part in a greater ecosystem is a game-changer.

Third-party research from leading provider Gartner states:

"IT organizations struggle to demonstrate the business value of I&O [infrastructure and operations]. As a result, business leaders often see I&O as a cost center rather than an enabler of business value. There's plenty of work to be done. Recommendations are wide-ranging and include:

■ Use business and IT operations outcomes and metrics that stakeholders will understand and appreciate

■ Facilitate regular communications that highlight progress toward the identified outcomes and metrics."

For every engineering team that we work with, we always ask about the cultural mindset and how goals map to the business needs. Improving the entire incident management process is unquestionably important, but even more critical is how we translate the outcomes of that learning to other business stakeholders in order to achieve a resilient culture.

Takeaways

■ APM (Application Performance Management) led the evolution of democratizing metrics monitoring, which translated to a more avid focus on response times, error rates and availability.

■ Reliability directly impacts both the top-line and bottom line. Customer loyalty, brand, and growth are impacted when a service doesn't deliver predictable reliability.

■ Reliability is categorically a business metric that encompasses more than product measures, and should be viewed in a more holistic manner by companies.

■ Because reliability is a constantly evolving aspect, teams should dedicate a percentage of their time to codify and improve process and tooling steps.

■ Use business and IT operations outcomes and metrics that stakeholders will understand and facilitate regular communications that highlight progress toward the identified outcomes and metrics.

Emily Arnott is Community Relations Manager at Blameless

Hot Topics

The Latest

Every digital customer interaction, every cloud deployment, and every AI model depends on the same foundation: the ability to see, understand, and act on data in real time ... Recent data from Splunk confirms that 74% of the business leaders believe observability is essential to monitoring critical business processes, and 66% feel it's key to understanding user journeys. Because while the unknown is inevitable, observability makes it manageable. Let's explore why ...

Organizations that perform regular audits and assessments of AI system performance and compliance are over three times more likely to achieve high GenAI value than organizations that do not, according to a survey by Gartner ...

Kubernetes has become the backbone of cloud infrastructure, but it's also one of its biggest cost drivers. Recent research shows that 98% of senior IT leaders say Kubernetes now drives cloud spend, yet 91% still can't optimize it effectively. After years of adoption, most organizations have moved past discovery. They know container sprawl, idle resources and reactive scaling inflate costs. What they don't know is how to fix it ...

Artificial intelligence is no longer a future investment. It's already embedded in how we work — whether through copilots in productivity apps, real-time transcription tools in meetings, or machine learning models fueling analytics and personalization. But while enterprise adoption accelerates, there's one critical area many leaders have yet to examine: Can your network actually support AI at the speed your users expect? ...

The more technology businesses invest in, the more potential attack surfaces they have that can be exploited. Without the right continuity plans in place, the disruptions caused by these attacks can bring operations to a standstill and cause irreparable damage to an organization. It's essential to take the time now to ensure your business has the right tools, processes, and recovery initiatives in place to weather any type of IT disaster that comes up. Here are some effective strategies you can follow to achieve this ...