Any good SysAdmin or DevOps engineer knows that there is no shortage of available monitoring tools (both in the open source and closed source arenas). Some of these are hosted on-premises, while others are hosted in the cloud, and both can come with quick agent deployments to have you monitoring and/or graphing key metrics within minutes.
However your infrastructure is configured, the real challenge comes with identifying exactly what should be monitored and alerted on, and defining sensible thresholds around the alerting. In this blog, we’re going to discuss the common monitoring mistakes admins make, as well as best practices to help build a good monitoring platform with alerts that are clear, understandable, and above all else, actionable.
Due to the nature of monitoring in general, most systems permit you to be the master of your own destiny by allowing you to make a number of different choices. In a bid to help the administrator get started, the majority of systems will come with some default checks and some default thresholds for those checks. You have to remember that environments will differ significantly depending on their purpose, where they’re located, and their importance. As such, the monitoring around them should be tweaked to be relevant for the environment in question.
Here are some common mistakes that I have both witnessed and made in my career:
Utilizing Only the Default Checks
In many cases, the default monitoring checks will look at things like CPU usage, RAM usage, available disk space, system load, and swap usage. The default thresholds will often be inappropriate for your use case, and as a result, you may find that you receive a lot of alerts for something that shouldn’t be alerting.
An example would be a load check alerting on the load average of a Linux server because it’s hitting a threshold, but the check is actually designed for a 1-CPU system, and you’re running an 8-CPU system. Thus, you could (and should) adjust the check accordingly. This erroneous alert multiple times a day could lead to the check being disabled, and missing a real load alert incident which could cause downtime.
Another issue that can emerge from simply using the default checks would be not monitoring other core functions of the system (e.g. that of a web server, database server or email server).
Not Understanding What the Checks Do
It’s easy to make mistakes when implementing a new monitoring system or strategy. One of these mistakes is implementing checks that you may not fully understand — for instance, a check that looks at inode usage on a system. If you don’t understand what inodes are or their relevance in being monitored, then you may not know if an alert is real (and if it is, how to go about fixing it).
We touched on this in one of the previous sections, but I have seen this frequently. Consider the example of a check looking at SSL validity on a site and expiration date. The thresholds may begin alerting you 30 days before expiration to allow you adequate time to renew the SSL certificate. This is a good strategy. However, I often see the default repetition of alerts being left in place for these sorts of checks, which means they begin to alert regularly every hour on the hour. This could go on for the entire month, but in most cases I’ve seen people simply disable the alert with the idea of, “I’ll get to this later.” Inevitably the check gets forgotten and the SSL certificate expires, rendering the check and alert entirely pointless!
The best way to deal with this (and which I have done) is to amend the alert to instead alert at 30 days, then 20 days, then 10 days, then 5 days, then every day until expiration. This way the alert is much more likely to be acted upon and much less likely to be disabled, thus preventing the SSL expiration in the first place.
To take action with regards to your alerting first requires that you understand what you’re trying to achieve, and plan accordingly. Here are the steps I take to achieve this:
■ Identify each server role and which checks apply to it.
■ Work out what the thresholds should be relative to the servers and applications running on the servers.
■ Identify at which times these alerts should be sent.
■ Define who should receive the alerts.
The objectives of such a policy are to make sure that any alerting received is necessary, not noisy (i.e. multiple alerts), to prevent staff from becoming complacent, and to keep alerts relevant so as to minimize the number of potential alerts.
Furthermore, the alerts should be clear to allow the person taking action to either fix or identify the issue quickly and effectively.
Deploying a new monitoring system can be a complex task and requires a lot of thought into the process. I would go so far as to say that it’s also something that requires constant rethinking and adjustment as time goes on to ensure you’re getting the most out of the system. If you sit down and properly discuss, with all engineers, the monitoring that needs to take place and the thresholds that should be in place, then you’re much more likely to deploy an actionable monitoring solution and reduce the likelihood of problems going unnoticed.