In the last two years, site reliability engineering, more popularly known as SRE, has progressed and matured as both an engineering practice and function. There have been significant changes — not only in terms of tool usage, but also people process changes that begin with a culture or mind-set shift. Cloud-native, microservices-driven architecture has both complicated the discipline, yet enabled us to live an all-digital existence with continuous updates and new capabilities.
From Monitoring to Observing
10 years ago, there was a big emphasis on infrastructure monitoring across the industry. In some cases, dev and ops worked in silos and didn't communicate very much at all. Processes weren't codified and ops lacked visibility into the code base. Operations moved much slower and it was definitely more frustrating for both engineers and end-users.
The dawn of DevOps inspired teams to break down silos, automate workflows, and better communicate. Simultaneously, APM (Application Performance Management) led the evolution of democratizing metrics monitoring. With that, performance expectations increased and we saw the beginnings of emphasizing the application or front-end user experience. That translated to a more avid focus on response times, error rates and availability.
APM adoption really took off during the big cloud migration. Shifting from monolithic to micro-architected services, teams started to lean on third party cloud providers and also outsource their monitoring too.
Duh! It's All About the End-User Experience
Meanwhile, this set the stage for site reliability engineering (SRE), Google's best kept secret until 2016 when the SRE handbook was published. Created internally for a decade, it was a natural time to introduce SRE principles into the ITIL framework. SRE takes a prescriptive approach to DevOps, with its main mantra being: your key objectives should aim toward user happiness as the end goal.
This new practice has inspired a cultural shift for engineering infrastructure teams that I've never quite seen before. What gets me most excited is the data-driven approach to measuring the business value of reliability. I see a lot more non-engineering stakeholders invested in and paying more attention to reliability insights.
Reliability directly impacts both the top-line and bottom line. Customer loyalty, brand, and growth are impacted when a service doesn't deliver predictable reliability. At the same time, when engineers become mired in complex, unstable environments where they have to spend toilsome cycles fixing and band-aiding, it translates to attrition and a significant cost hit to the business.
Reliability: Think Outside the Engineering "Box"
Reliability is categorically a business metric. It encompasses more than product measures, and I believe companies should view it in a more holistic manner. To understand reliability, we need to look at what's happening "on the front lines."
How often do we find out about a performance issue from a customer support ticket?
What type of feedback do we receive from customers?
How's our reputation in the market?
We pose these questions to the Support Team, Customer Success, Marketing, and Sales. In doing so, we actually start to treat reliability as a measure of the health of the business. It's both a leading and lagging indicator.
To take this a step further, SRE helps DevOps teams use a proactive approach by surfacing the right data insights to take preventative steps and avoid an issue to potentially worsen. If teams have time to respond and react, they are happier, and certainly the customer either never knows or carries on with their loyal usage.
Before, maybe 5-7 years back, we thought metrics monitoring and visibility across DevOps was enough. It wasn't. We built telemetry-based dashboards and collected all relevant data points, but it didn't go far enough by prioritizing different parts of the service and proactively setting a metric to work towards. Without a planned number or indicator, we didn't know how to progress or improve over time.
Using the principles of SRE, we're more diligent about continuously learning and improving. We do better at documenting what happened with retrospective reports that inform us how the system is behaving. By aggregating that data over time, we learn how it's advancing. We are better challenged to identify gaps and any single points of failure. Often fixing one issue or part of the service doesn't solve the entire system.
When teams come together and agree on which specific parts of the service are absolutely critical and how to escalate when something goes awry, teams get a clear sense of what's important and where to focus. This is now popularized by SLOs which are essentially KPIs for engineers that can be communicated up, down, and across the functions of any business.
When Should a Team Build Incident Management Rigor?
Smaller teams in growing organizations tend to be more proactive. A modern approach to tech tools and lack of legacy environments definitely makes this easier. However, the tendency is to just do the basics and then move on with the busy, innovative work. Often important steps are missed such as modifying runbooks or creating useful dashboards and reports. Retrospectives (post-mortems) are a huge learning opportunity and sometimes this step is missed or only conducted for Sev0 or Sev1 incidents.
Because reliability is a constantly evolving aspect, we recommend teams dedicate a percentage of their time to codify and improve process and tooling steps. In fact, part of the SRE function is dedicated to doing just that. All too often SRE teams are pulled into coding work or managing infrastructure in advance of, say, a big release, which takes time away from the operational rigor that's critical to reliability.
Ultimately, reliability is a journey and knocking down key milestones is a long-haul investment. It's daunting for young, fast-paced teams to bite off too much, especially when there's a high velocity release schedule. All organizational aspects and team workload must be factored into the plan plus the existing maturity and team skill set.
Hiring dedicated seasoned SREs is difficult and so establishing in-house enablement and training is a better path to success. Learning on the job, which is unique to your environment, is irreplaceable and cannot be easily taught without daily practice.
Take the baby steps and milestone approach each quarter and quickly you will start to see results.
If your team is bigger and in an established organization, you need to consider migrating from older practices and tools. Taking a staged approach requires program and project management and therefore dedicated time from existing or new team members. Without executive investment and sponsorship, it will be an uphill climb. The reality is that customer expectations are no different for a smaller service provider compared to a large enterprise company with deeper resources. Modernizing DevOps teams is the only path forward.
Culture Is the Great Enabler
SRE is markedly different because of its emphasis on culture. There's a very specific type of culture that SRE teaches us to adopt. I like to think of it as twofold.
First, you want everyone in the organization to put the customer first. In other words, you're successful in your job when the customer is happy.
Second, trust and believe that incidents and failures stem from systemic problems that require systemic solutions. The first part about focusing on the customer is hugely important to understanding why we should care about reliability in the first place. We're all here to create cutting-edge tech that simplifies and streamlines, absolutely. But ultimately we're also here to provide a service and be useful. If we can get the business and engineering functions to align on the same goals, speak the same language, and humanize our processes more, we're on the right track.
We're making great progress. This year, I'm looking forward to a few trends manifesting in our day-to-day experiences. I want to have engineers exposed to the front lines. Connect them with customers much more often. Let them hear what customers are saying and asking for.
Also, I hope to see go-to-market teams get close to engineers, understand the complexities of their products, and observe them as they manage tradeoffs. Whichever side of the company, we're all dealing with prioritization in our day-to-day. Everyone's doing it. Adopting a holistic view and witnessing how our functions play their part in a greater ecosystem is a game-changer.
Third-party research from leading provider Gartner states:
"IT organizations struggle to demonstrate the business value of I&O [infrastructure and operations]. As a result, business leaders often see I&O as a cost center rather than an enabler of business value. There's plenty of work to be done. Recommendations are wide-ranging and include:
■ Use business and IT operations outcomes and metrics that stakeholders will understand and appreciate
■ Facilitate regular communications that highlight progress toward the identified outcomes and metrics."
For every engineering team that we work with, we always ask about the cultural mindset and how goals map to the business needs. Improving the entire incident management process is unquestionably important, but even more critical is how we translate the outcomes of that learning to other business stakeholders in order to achieve a resilient culture.
■ APM (Application Performance Management) led the evolution of democratizing metrics monitoring, which translated to a more avid focus on response times, error rates and availability.
■ Reliability directly impacts both the top-line and bottom line. Customer loyalty, brand, and growth are impacted when a service doesn't deliver predictable reliability.
■ Reliability is categorically a business metric that encompasses more than product measures, and should be viewed in a more holistic manner by companies.
■ Because reliability is a constantly evolving aspect, teams should dedicate a percentage of their time to codify and improve process and tooling steps.
■ Use business and IT operations outcomes and metrics that stakeholders will understand and facilitate regular communications that highlight progress toward the identified outcomes and metrics.
Developers need a tool that can be portable and vendor agnostic, given the advent of microservices. It may be clear an issue is occurring; what may not be clear is if it's part of a distributed system or the app itself. Enter OpenTelemetry, commonly referred to as OTel, an open-source framework that provides a standardized way of collecting and exporting telemetry data (logs, metrics, and traces) from cloud-native software ...
As SLOs grow in popularity their usage is becoming more mature. For example, 82% of respondents intend to increase their use of SLOs, and 96% have mapped SLOs directly to their business operations or already have a plan to, according to The State of Service Level Objectives 2023 from Nobl9 ...
Observability has matured beyond its early adopter position and is now foundational for modern enterprises to achieve full visibility into today's complex technology environments, according to The State of Observability 2023, a report released by Splunk in collaboration with Enterprise Strategy Group ...
Before network engineers even begin the automation process, they tend to start with preconceived notions that oftentimes, if acted upon, can hinder the process. To prevent that from happening, it's important to identify and dispel a few common misconceptions currently out there and how networking teams can overcome them. So, let's address the three most common network automation myths ...
Many IT organizations apply AI/ML and AIOps technology across domains, correlating insights from the various layers of IT infrastructure and operations. However, Enterprise Management Associates (EMA) has observed significant interest in applying these AI technologies narrowly to network management, according to a new research report, titled AI-Driven Networks: Leveling Up Network Management with AI/ML and AIOps ...