APM Monitoring Microservices? Consider Noise Cancellation
September 12, 2016

Kieran Taylor
CA Technologies

Invited or not, microservices are now a building block of both old and new enterprise applications. Some companies find themselves adopting microservices organically, bit by bit, as might happen with a mobile app initiative. For others, it is a premeditated decision and investment, such as refactoring an extensive legacy mainframe process into containerized applications. Planned or not, microservices adoption is on the rise and it is leading operations teams to define new approaches to microservices monitoring.

The interest in microservices isn't surprising. Breaking an application into independent components can deliver benefits not possible with monolithic centralized applications. For example, development teams gain autonomy that allows them to code and deploy more quickly as they don't have to coordinate with a fixed, static system.

Unfortunately, autonomy isn't necessarily a great thing for IT operations teams tasked with maintaining uptime and reliability. When teams arbitrarily pick cloud compute and storage providers you now need to define and set up different monitoring approaches for each permutation. Defining and maintaining multiple monitoring approaches is a pain, though surmountable. In fact, the harder challenge may be one of scale. When app components are highly distributed, there is a corresponding (and unanticipated) increase in messaging across a variety of systems.

Earlier this year, Sarah Wells, Principal Engineer at Financial Times delivered an excellent presentation titled Alert Overload – How to Adopt a Microservices Architecture without Being Overwhelmed by Noise. In it she describes the use of 45 different microservices across 3 environments (integration, test and production) each of which runs on two virtual machines (VMs). The team ran 20 checks per service every 5 minutes. The math works out to be about 1.5 million system checks per day. This of course created an exponential increase in alerting and email messages. Sarah Wells' team counted and saw 19,000 system alerts in a 50 day period. That's an average of 380 alerts each day. Of course, not every alert requires reaction but sorting the wheat from the chaff becomes impossible.

The increase in the sheer number of microservice system checks is one driver of these "alert storms." However, antiquated approaches to baselining, built for monolithic legacy environments, also contribute to the pain. Most monitoring systems have teams predict acceptable performance measurements, set those manually and then throw an alert when those threshold are passed. Even if the predictions were accurate, setting and maintaining so many isn't sustainable.

The main problem with this approach to baselining is that it's a binary, pass/fail test and doesn't convey any sense of severity. This is why companies that are monitoring microservices in full production need new approaches that don't drown them in alert noise and instead deliver actionable insights. It's a simple analogy but what's needed is a noise cancellation system much like the headphones used on airplanes or public transit. IT operations teams want hear the music and not the noise, but how to accomplish that?

Turns out a clever answer to monitoring microservices can be found in the work of an early 20th century statistician Walter Shewhart who worked for the Western Electric Company. By calculating the standard deviation of copper line quality, Shewhart showed that simple comparisons against bands of standard deviation could effectively identify points at which the signal is exhibiting uncontrolled variance; something like how an earthquake registers on a seismometer. This kind of control charting has come to be referred to as the Western Electric Rules. Informally these are sometimes called "how wrong for how long" algortithms because they can distinguish between small nuisance alerts and anomalous trends worthy of action. When applied to APM and monitoring microservices, IT operations teams can feel confident that the stream of alerts they receive are actionable problems in their myriad distributed systems.

CA APM recently adopted this approach that employs standard deviations to establish variance intensities, calling it Differential Analysis. Unlike traditional baselining that relies on static predictive models this new technique is purpose-built for the highly dynamic environments typical of microservices. History repeats itself. Shewhart's math is something old that's new again.

Kieran Taylor is Sr. Director, Product Marketing, CA Technologies.

Share this