APM Site Reliability Engineering Shortcuts: Just Add Water
July 26, 2018

Kieran Taylor
CA Technologies

App design and development has changed dramatically in the last 5 years and with it, the monitoring tools, technology and IT operations roles supporting those applications. The advent of Site Reliability Engineering (SRE) and open source monitoring tools are two examples of these shifts. Open source solutions for app performance and reliability have even found their way in new commercial monitoring tools offering monitoring teams an alternative to “do it yourself” (DIY) approaches.


 
Traditionally, IT Ops teams have typically taken an inside-out view of the world. Individual specialists myopically monitor their respective app, infrastructure or network components in hopes their discrete efforts combine to ultimately improve user experience. This “find and fix faults” approach means waiting for an alert to fire and then dispatching experts to troubleshoot the issue.
 
Site Reliability Engineering (SRE) stands that approach on its head by viewing reliability through an outside-in lens. Companies like LinkedIn, Target and Netflix gauge success by first measuring the quality of end user experience that is rendered. When digital experience becomes the bellwether measurement, these dev and ops savvy SRE teams spend less time chasing misleading alerts and instead focus efforts on how they can deliver the best experience possible across every touchpoint of customer engagement.
 
Marc Alvidrez, SRE for Google explains that "Site Reliability Engineering seeks to balance the risk of unavailability with the goals of rapid innovation and efficient service operations, so that users' overall happiness — with features, service, and performance — is optimized." While not every company has a role that is formally titled SRE, most are adapting to manage this balance. Indeed, IT teams are now fusing software engineering into their operations practices. This blending signals a shift from staffing specialized domain experts to hiring generalists that understand the “full stack” — not just front and back end development code but also the underlying infrastructure that executes and delivers the app experience to requesting users.
 
That shift is driving SREs to develop new ways to collect and correlate monitoring insights across the digital delivery chain while also identifying ways to automate scaling and problem remediation. Specialized monitoring tools make this difficult as each collect and store data in different formats (structured, unstructured, time series, topological etc.) that make correlation difficult. Complicating the challenge is the sheer increase in volume, variety and velocity of data that must be collected. Coders at heart, many SREs turn to open source solutions to address these challenges. Open source technologies such as Elastic, Logstash and Kibana (ELK stack), Kafka, Apache Spark and Mineral are some of the building blocks that SREs use to code their own solutions to collect, store and analyze app performance.
 
Design and development of home grown solutions for app and user monitoring requires non-trivial amount of time and effort. What's more, the challenge only grows as machine learning and artificial intelligence for IT Ops (AIOps) become core components of automated problem remediation. Common performance problems are recognizable and machine learning means pattern recognition can be employed to automatically detect and remediate issues. However, for that to work, tools must have a library of these performance problems and which remedies are optimal for each.
 
Commercial monitoring solutions benefit from decades of learning and evolution but historically lacked the ability to correlate across silos. That has changed through the adoption of the latest big data and open source technologies that can normalize and correlate analytics to eliminate the traditional silos of monitoring that limited insight and control of modern apps.

IT operations management (ITOM) teams are rapidly evolving to manage modern applications and can learn many lessons from how SREs pursue performance and availability. When it comes to selecting a unified approach to monitoring, these teams have a choice. They can adopt and implement open source software in a homegrown approach but that free software is not without cost. Commercial monitoring solutions now incorporate many of these same open source technologies saving teams the trouble.  

For more on this topic, check out CA's on-demand AIOps Virtual Summit – featuring a SRE session with Todd Palino, SRE at LinkedIn, and David-Blank Edelman, co-founder of SRECon.

Share this