Getting alerts about application or infrastructure problems is easy. Making sense of the alerts, and turning the alerts into actionable information, is much harder.
That is because the proliferation of different types of monitoring tools, combined with the complex nature of modern application performance problems, means that DevOps teams can easily find themselves inundated with alerts that are difficult to interpret. In many cases, simply cutting through all of the alert “noise” in order to identify specific problems can seem next to impossible.
Faced with so much alert noise, successful DevOps teams must take new approaches to managing alerts. Let’s take a look at what those approaches entail.
The Problem with Modern Software Monitoring
The issue of unmanageable alert noise stems from two main sources.
High Volumes of Alerts
The first is the simple fact that modern DevOps teams have so many monitoring tools to contend with. Keeping tabs on modern applications typically involves using infrastructure monitoring tools, APM tools, network monitoring software, security monitoring software, and sometimes other types of monitoring tools to boot. All of these monitoring frameworks produce alerts that DevOps engineers must manage.
Complicating the challenge is the fact that some alerts may be redundant; for example, network and security monitoring tools may both pick up on the same issue and send multiple alerts about it, but it may not be immediately obvious from the alerts that they pertain to the same problem.
The second issue is the difficulty of tracing an alert to a root cause within complex software environments.
Modern applications are often composed of complex webs of interdependent microservices. A problem with one application service may be rooted in a different service. For example, users may be unable to log in because an authentication service is not working as expected, but the root cause of the problem might lie with a failed database service that is preventing access to a credentials database.
The adoption of software-defined infrastructure creates similar uncertainty. The root cause of a storage failure within an application could be a problem with underlying hardware, a software-defined storage system or an application storage service.
Finally, modern applications tend to be highly dynamic and scalable. Loads change quickly, and what constitutes normal behavior or user traffic at one moment may look very different the next.
All of this complexity makes it tremendously difficult to interpret alerts, especially when they arrive in high volumes or at a dizzying pace.
A New Approach to Alerting
How can DevOps teams handle these modern alerting challenges? The answer lies in adopting a fundamentally new approach to alert management, based on the following strategies.
One innovative method can be found in CA’s APM solution. Called Differential Analysis, this approach uses proven statistical methods to establish variance intensities across metrics like latency and response times. Unlike static baselining with binary pass/fail conditions, modern methods like this are ideal for dynamic containerized microservice environments and API-centric applications.
One way to help cut through alert noise by determining which alerts merit immediate attention is to establish a baseline of normal behavior, then use the baseline to interpret whether an alert signals an out-of-the-ordinary issue.
In order to be effective for modern alerting within rapidly changing applications, a baseline needs to be dynamic. Instead of establishing a static baseline, DevOps teams should create multiple baselines based on times of day, the number of connected users or other factors that can shape a software environment.
With dynamic baselines in place, it becomes much easier to differentiate serious alerts from mere noise.
Troubleshooting a serious alert on the fly is never ideal. Instead of waiting until a problem occurs to determine how to respond, it is worth investing in alert playbooks that spell out how a DevOps team will respond to a given type of alert.
Playbooks can include information about which dependencies or other hard-to-notice issues may be associated with an alert in order to help engineers identify the root cause more quickly.
Being able to map dependencies between microservices and within application infrastructure is an important resource for getting quickly to the root of a problem. Dependencies are not something you want to be tracing manually when you’re in the midst of troubleshooting alerts. Instead, adopt monitoring software that can map dependencies, in real time.
When a serious alert occurs and the engineer who receives it does not respond, the alert should be automatically escalated to other team members until someone handles the issue. There are lots of reasons why the first party to receive an alert might not respond, and you don’t want such an oversight to lead to a breakdown in your alert and response process.
Most alerts don’t need to go to all of your team members. If they do, your engineers will quickly become overwhelmed with more alerts than they can handle.
Instead, configure alerting policies so that certain types of alerts go to certain engineers based on who is most qualified to respond. This is one key step in preventing what’s known as alert fatigue.
Metrics-Based Alert Management
When your DevOps team is juggling multiple alerting tools and complex streams of alerts, you can’t take an ad hoc approach to determining whether your alert management strategy is effective. Instead, collect metrics, such as Mean Time to Respond (MTTR), which can help to reveal whether your current alerting strategy is meeting your goals and how changes to alert management affect your outcomes over time.
Modern software monitoring and alerting will only grow more challenging as applications and infrastructure become ever more complex. Effective alert management requires strategies that can mitigate the amount of noise that your engineers have to work through when responding to alerts, as well as help your DevOps teams to determine quickly where the root cause of an issue lies.