How Site Reliability Has Progressed in the Last 2 Years

July 20, 2022

Emily Arnott

Blameless

In the last two years, site reliability engineering, more popularly known as SRE, has progressed and matured as both an engineering practice and function. There have been significant changes — not only in terms of tool usage, but also people process changes that begin with a culture or mind-set shift. Cloud-native, microservices-driven architecture has both complicated the discipline, yet enabled us to live an all-digital existence with continuous updates and new capabilities.

From Monitoring to Observing

10 years ago, there was a big emphasis on infrastructure monitoring across the industry. In some cases, dev and ops worked in silos and didn't communicate very much at all. Processes weren't codified and ops lacked visibility into the code base. Operations moved much slower and it was definitely more frustrating for both engineers and end-users.

The dawn of DevOps inspired teams to break down silos, automate workflows, and better communicate. Simultaneously, APM (Application Performance Management) led the evolution of democratizing metrics monitoring. With that, performance expectations increased and we saw the beginnings of emphasizing the application or front-end user experience. That translated to a more avid focus on response times, error rates and availability.

APM adoption really took off during the big cloud migration. Shifting from monolithic to micro-architected services, teams started to lean on third party cloud providers and also outsource their monitoring too.

Duh! It's All About the End-User Experience

Meanwhile, this set the stage for site reliability engineering (SRE), Google's best kept secret until 2016 when the SRE handbook was published. Created internally for a decade, it was a natural time to introduce SRE principles into the ITIL framework. SRE takes a prescriptive approach to DevOps, with its main mantra being: your key objectives should aim toward user happiness as the end goal.

This new practice has inspired a cultural shift for engineering infrastructure teams that I've never quite seen before. What gets me most excited is the data-driven approach to measuring the business value of reliability. I see a lot more non-engineering stakeholders invested in and paying more attention to reliability insights.

Reliability directly impacts both the top-line and bottom line. Customer loyalty, brand, and growth are impacted when a service doesn't deliver predictable reliability. At the same time, when engineers become mired in complex, unstable environments where they have to spend toilsome cycles fixing and band-aiding, it translates to attrition and a significant cost hit to the business.

Reliability: Think Outside the Engineering "Box"

Reliability is categorically a business metric. It encompasses more than product measures, and I believe companies should view it in a more holistic manner. To understand reliability, we need to look at what's happening "on the front lines."

How often do we find out about a performance issue from a customer support ticket?

What type of feedback do we receive from customers?

How's our reputation in the market?

We pose these questions to the Support Team, Customer Success, Marketing, and Sales. In doing so, we actually start to treat reliability as a measure of the health of the business. It's both a leading and lagging indicator.

To take this a step further, SRE helps DevOps teams use a proactive approach by surfacing the right data insights to take preventative steps and avoid an issue to potentially worsen. If teams have time to respond and react, they are happier, and certainly the customer either never knows or carries on with their loyal usage.

Before, maybe 5-7 years back, we thought metrics monitoring and visibility across DevOps was enough. It wasn't. We built telemetry-based dashboards and collected all relevant data points, but it didn't go far enough by prioritizing different parts of the service and proactively setting a metric to work towards. Without a planned number or indicator, we didn't know how to progress or improve over time.

Using the principles of SRE, we're more diligent about continuously learning and improving. We do better at documenting what happened with retrospective reports that inform us how the system is behaving. By aggregating that data over time, we learn how it's advancing. We are better challenged to identify gaps and any single points of failure. Often fixing one issue or part of the service doesn't solve the entire system.

When teams come together and agree on which specific parts of the service are absolutely critical and how to escalate when something goes awry, teams get a clear sense of what's important and where to focus. This is now popularized by SLOs which are essentially KPIs for engineers that can be communicated up, down, and across the functions of any business.

When Should a Team Build Incident Management Rigor?

Smaller teams in growing organizations tend to be more proactive. A modern approach to tech tools and lack of legacy environments definitely makes this easier. However, the tendency is to just do the basics and then move on with the busy, innovative work. Often important steps are missed such as modifying runbooks or creating useful dashboards and reports. Retrospectives (post-mortems) are a huge learning opportunity and sometimes this step is missed or only conducted for Sev0 or Sev1 incidents.

Because reliability is a constantly evolving aspect, we recommend teams dedicate a percentage of their time to codify and improve process and tooling steps. In fact, part of the SRE function is dedicated to doing just that. All too often SRE teams are pulled into coding work or managing infrastructure in advance of, say, a big release, which takes time away from the operational rigor that's critical to reliability.

Ultimately, reliability is a journey and knocking down key milestones is a long-haul investment. It's daunting for young, fast-paced teams to bite off too much, especially when there's a high velocity release schedule. All organizational aspects and team workload must be factored into the plan plus the existing maturity and team skill set.

Hiring dedicated seasoned SREs is difficult and so establishing in-house enablement and training is a better path to success. Learning on the job, which is unique to your environment, is irreplaceable and cannot be easily taught without daily practice.

Take the baby steps and milestone approach each quarter and quickly you will start to see results.

If your team is bigger and in an established organization, you need to consider migrating from older practices and tools. Taking a staged approach requires program and project management and therefore dedicated time from existing or new team members. Without executive investment and sponsorship, it will be an uphill climb. The reality is that customer expectations are no different for a smaller service provider compared to a large enterprise company with deeper resources. Modernizing DevOps teams is the only path forward.

Culture Is the Great Enabler

SRE is markedly different because of its emphasis on culture. There's a very specific type of culture that SRE teaches us to adopt. I like to think of it as twofold.

First, you want everyone in the organization to put the customer first. In other words, you're successful in your job when the customer is happy.

Second, trust and believe that incidents and failures stem from systemic problems that require systemic solutions. The first part about focusing on the customer is hugely important to understanding why we should care about reliability in the first place. We're all here to create cutting-edge tech that simplifies and streamlines, absolutely. But ultimately we're also here to provide a service and be useful. If we can get the business and engineering functions to align on the same goals, speak the same language, and humanize our processes more, we're on the right track.

We're making great progress. This year, I'm looking forward to a few trends manifesting in our day-to-day experiences. I want to have engineers exposed to the front lines. Connect them with customers much more often. Let them hear what customers are saying and asking for.

Also, I hope to see go-to-market teams get close to engineers, understand the complexities of their products, and observe them as they manage tradeoffs. Whichever side of the company, we're all dealing with prioritization in our day-to-day. Everyone's doing it. Adopting a holistic view and witnessing how our functions play their part in a greater ecosystem is a game-changer.

Third-party research from leading provider Gartner states:

"IT organizations struggle to demonstrate the business value of I&O [infrastructure and operations]. As a result, business leaders often see I&O as a cost center rather than an enabler of business value. There's plenty of work to be done. Recommendations are wide-ranging and include:

■ Use business and IT operations outcomes and metrics that stakeholders will understand and appreciate

■ Facilitate regular communications that highlight progress toward the identified outcomes and metrics."

For every engineering team that we work with, we always ask about the cultural mindset and how goals map to the business needs. Improving the entire incident management process is unquestionably important, but even more critical is how we translate the outcomes of that learning to other business stakeholders in order to achieve a resilient culture.

Takeaways

■ APM (Application Performance Management) led the evolution of democratizing metrics monitoring, which translated to a more avid focus on response times, error rates and availability.

■ Reliability directly impacts both the top-line and bottom line. Customer loyalty, brand, and growth are impacted when a service doesn't deliver predictable reliability.

■ Reliability is categorically a business metric that encompasses more than product measures, and should be viewed in a more holistic manner by companies.

■ Because reliability is a constantly evolving aspect, teams should dedicate a percentage of their time to codify and improve process and tooling steps.

■ Use business and IT operations outcomes and metrics that stakeholders will understand and facilitate regular communications that highlight progress toward the identified outcomes and metrics.

Emily Arnott is Community Relations Manager at Blameless

Hot Topics

APM

SRE

The Latest

Redis Monitoring 101: Key Metrics You Need to Watch

May 22, 2025

As businesses increasingly rely on high-performance applications to deliver seamless user experiences, the demand for fast, reliable, and scalable data storage systems has never been greater. Redis — an open-source, in-memory data structure store — has emerged as a popular choice for use cases ranging from caching to real-time analytics. But with great performance comes the need for vigilant monitoring ...

Beyond Traditional Autoscaling: The Future of Kubernetes in AI Infrastructure

May 22, 2025

Kubernetes was not initially designed with AI's vast resource variability in mind, and the rapid rise of AI has exposed Kubernetes limitations, particularly when it comes to cost and resource efficiency. Indeed, AI workloads differ from traditional applications in that they require a staggering amount and variety of compute resources, and their consumption is far less consistent than traditional workloads ... Considering the speed of AI innovation, teams cannot afford to be bogged down by these constant infrastructure concerns. A solution is needed ...

AI Drives Surge in Data Budgets

May 21, 2025

AI is the catalyst for significant investment in data teams as enterprises require higher-quality data to power their AI applications, according to the State of Analytics Engineering Report from dbt Labs ...

Misaligned Architecture Causes Service Disruptions, High Operational Costs and Security Challenges

May 20, 2025

Misaligned architecture can lead to business consequences, with 93% of respondents reporting negative outcomes such as service disruptions, high operational costs and security challenges ...

How GenAI Can Save Time for the NetOps Team

May 19, 2025

A Gartner analyst recently suggested that GenAI tools could create 25% time savings for network operational teams. Where might these time savings come from? How are GenAI tools helping NetOps teams today, and what other tasks might they take on in the future as models continue improving? In general, these savings come from automating or streamlining manual NetOps tasks ...

How Site Reliability Has Progressed in the Last 2 Years

July 20, 2022

Emily Arnott

Blameless

From Monitoring to Observing

Duh! It's All About the End-User Experience

Reliability: Think Outside the Engineering "Box"

How often do we find out about a performance issue from a customer support ticket?

What type of feedback do we receive from customers?

How's our reputation in the market?

When Should a Team Build Incident Management Rigor?

Take the baby steps and milestone approach each quarter and quickly you will start to see results.

Culture Is the Great Enabler

SRE is markedly different because of its emphasis on culture. There's a very specific type of culture that SRE teaches us to adopt. I like to think of it as twofold.

First, you want everyone in the organization to put the customer first. In other words, you're successful in your job when the customer is happy.

Third-party research from leading provider Gartner states:

■ Use business and IT operations outcomes and metrics that stakeholders will understand and appreciate

■ Facilitate regular communications that highlight progress toward the identified outcomes and metrics."

Takeaways

■ APM (Application Performance Management) led the evolution of democratizing metrics monitoring, which translated to a more avid focus on response times, error rates and availability.

■ Reliability directly impacts both the top-line and bottom line. Customer loyalty, brand, and growth are impacted when a service doesn't deliver predictable reliability.

■ Reliability is categorically a business metric that encompasses more than product measures, and should be viewed in a more holistic manner by companies.

■ Because reliability is a constantly evolving aspect, teams should dedicate a percentage of their time to codify and improve process and tooling steps.

■ Use business and IT operations outcomes and metrics that stakeholders will understand and facilitate regular communications that highlight progress toward the identified outcomes and metrics.

Emily Arnott is Community Relations Manager at Blameless

Hot Topics

APM

SRE

The Latest

Redis Monitoring 101: Key Metrics You Need to Watch

May 22, 2025

Beyond Traditional Autoscaling: The Future of Kubernetes in AI Infrastructure

May 22, 2025

AI Drives Surge in Data Budgets

May 21, 2025

Misaligned Architecture Causes Service Disruptions, High Operational Costs and Security Challenges

May 20, 2025

Misaligned architecture can lead to business consequences, with 93% of respondents reporting negative outcomes such as service disruptions, high operational costs and security challenges ...

How GenAI Can Save Time for the NetOps Team

May 19, 2025

Featured Free Trial

Featured Webinar

Featured Webinar

Featured White Paper

Featured Report

Featured White Paper

Featured Free Tool

Featured White Paper

Featured White Paper

Featured Webinar

Featured eBook

Featured Free Trial

Featured Free Trial

Featured White Paper

Featured White Paper

Featured Webinar

Featured White Paper

Featured White Paper

Featured Webinar

Featured Webinar

Featured Report

Featured Webinar

Featured eBook

Featured Webinar

Featured White Paper

Featured White Paper

Featured Webinar

Featured Webinar

Featured White Paper

Featured White Paper

Featured White Paper

Featured Free Trial

Featured eBook

Featured White Paper

Featured Webinar

Featured Free Trial

Featured Webinar

Featured Webinar

Featured Free Trial

Featured Webinar

Featured Free Tool

Featured White Paper

Featured eBook

Featured eBook

Featured Webinar

Featured White Paper

Featured Free Trial

Featured Free Trial

Featured Report

Featured Free Trial

Featured White Paper

Featured Webinar

Featured Free Trial

Featured Webinar

Featured Free Trial

Featured Webinar

Featured Webinar

Featured Webinar

Featured White Paper

Featured Webinar

Featured Webinar

Featured Webinar

Featured Webinar

Featured Report