Universal Monitoring Crimes and What to Do About Them - Part 1
May 22, 2018

Leon Adato
SolarWinds

Share this

Monitoring is a critical aspect of any data center operation, yet it often remains the black sheep of an organization's IT strategy: an afterthought rather than a core competency. Because of this, many enterprises have a monitoring solution that appears to have been built by a flock of "IT seagulls" — technicians who swoop in, drop a smelly and offensive payload, and swoop out. Over time, the result is layer upon layer of offensive payloads that are all in the same general place (your monitoring solution) but have no coherent strategy or integration.

Believe it or not, this is a salvageable scenario. By applying a few basic techniques and monitoring discipline, you can turn a disorganized pile of noise into a monitoring solution that provides actionable insight. For the purposes of this piece, let's assume you've at least implemented some type of monitoring solution within your environment.

At its core, the principle of monitoring as a foundational IT discipline is designed to help IT professionals escape the short-term, reactive nature of administration, often caused by insufficient monitoring, and become more proactive and strategic. All too often, however, organizations are instead bogged down by monitoring systems that are improperly tuned — or not tuned at all — for their environment and business needs. This results in unnecessary or incorrect alerts that introduce more chaos and noise than order and insight, and as a result, cause your staff to value monitoring even less.

So, to help your organization increase data center efficiency and get the most benefit out of your monitoring solutions, here are the top five universal monitoring crimes and what you can do about them:

1. Fixed thresholds

Monitoring systems that trigger any type of alert at a fixed value for a group of devices are the "weak tea" of solutions. While general thresholds can be established, it is statistically impossible that every single device is going to adhere to the same one, and extremely improbable that even a majority will.

Even a single server has utilization that varies from day to day. A server that usually runs at 50 percent CPU, for example, but spikes to 95 percent at the end of the month is perfectly normal — but fixed thresholds can cause this spike to trigger. The result is that many organizations create multiple versions of the same alert (CPU Alert for Windows IIS-DMZ; CPU Alert for Windows IIS-core; CPU Alert for Windows Exchange CAS, and so on). And even then, fixed thresholds usually throw more false positives than anyone wants.

What to do about it:

■ GOOD: Enable per-device (and per-service) thresholds. Whether you do this within the tool or via customizations, you should ultimately be able to have a specific threshold for each device so that machines that have a specific threshold trigger at the correct time, and those that do not get the default.

■ BETTER: Use existing monitoring data to establish baselines for "normal" and then trigger when usage deviates from that baseline. Note that you may need to consider how to address edge cases that may require a second condition to help define when a threshold is triggered.

2. Lack of monitoring system oversight

While it's certainly important to have a tool or set of tools that monitor and alert on mission-critical systems, it's also important to have some sort of system in place to identify problems within the monitoring solution itself.

What to do about it: Set up a separate instance of a monitoring solution that keeps track of the primary, or production, monitoring system. It can be another copy of the same tool or tools you are using in production, or a separate solution, such as open source, vendor-provided, etc.

For another option to address this, see the discussion on lab and test environments in Part 2 of this blog.

3. Instant alerts

There are endless reasons why instant alerts — when your monitoring system triggers alerts as soon as a condition is detected — can cause chaos in your data center. For one thing, monitoring systems are not infallible and may detect "false positive" alerts that don't truly require a remediation response. For another, it's not uncommon for problems to appear for a moment and then disappear. Still some other problems aren't actionable until they've persisted for a certain amount of time. You get the idea.

What to do about it: Build a time delay into your monitoring system's trigger logic where a CPU alert, for example, would need to have all of the specified conditions persist for something like 10 minutes before any action would be needed. Spikes lasting longer than 10 minutes would require more direct intervention while anything less represents a temporary spike in activity that doesn't necessarily indicate a true problem.

Read Universal Monitoring Crimes and What to Do About Them - Part 2, for more monitoring tips.

Leon Adato is a Head Geek at SolarWinds
Share this

The Latest

March 04, 2021

User experience is a big deal. For public-facing interfaces, the friction of a bad customer experience can send potential business to your competitors. For IT services delivered within your organization, bad UX is one of the main drivers of shadow IT ...

March 03, 2021

When we talk about accelerated digital transformation, a lot of it is embodied in the move to cloud computing. However, the "journey to cloud" will not be uniform across organizations and industries, says Sendur Sellakumar, Splunk's CPO and SVP of Cloud. The uncertainty of the pandemic means that in 2020, many organizations tried to rein in spending to get some last value out of existing infrastructure investments. Yet some things you can't skimp on ...

March 02, 2021

The Model T automobile was introduced in 1908 ... Within a few years, competitors arrived on the scene including relic names such as Overland, Maxwell, and names that survived like Buick and Dodge. So, what does this have to do with the hybrid cloud market? From a business perspective — a lot ...

March 01, 2021

DevOps Institute announced the launch of the 2021 SRE Survey in collaboration with Catchpoint and VMware Tanzu. The survey will result in a more in-depth understanding of how SRE teams are organized, how they are measured, and a deep dive into specific automation needs within SRE teams ...

February 25, 2021

Organizations use data to fuel their operations, make smart business decisions, improve customer relationships, and much more. Because so much value can be extracted from data its influence is generally positive, but it can also be detrimental to a business experiencing a serious disruption such as a cyberattack, insider threat, or storage platform-specific hack or bug ...

February 24, 2021

Previously siloed IT teams and technologies are converging as enterprises accelerate their modernization efforts in reaction to COVID-19, according to a study by LogicMonitor ...

February 23, 2021

You surf the internet, don't you? While all of us are at home due to Covid lock-down and accepting a new reality, the majority of the work is happening online. IT managers are looking for tools that can track the user digital experience. Executives are reading a report from Gartner or Forrester about some of the best networking monitoring solutions available on the market. Project managers are using Microsoft Teams online to communicate and ensure team members are meeting deliverables on time. Remote employees everywhere use OWA to check their office mails. No matter what, you can be quite sure that everyone is using their favorite browser and search engine for connecting online and accomplish tasks ...

February 22, 2021

With the right solutions, teams can move themselves out of the shadows of error resolution and into the light of innovation. Observability data, drawn from their systems and imbued with context from AI, lets teams automate the issues holding them back. Contextualized data and insights also give them the language to speak to the incremental, product-led approach and the direction to drive key innovations in customer experience improvement. Communicating value becomes a much easier proposition for DevOps practitioners — and they can take their seat at the company table as contributors to value ...

February 18, 2021

Prediction: Successful organizations will blur (or erase) the line between ITOps and DevOps. DevOps has to coexist with traditional IT operations ... So bring a little DevOps to every aspect of IT operations. You don't even have to call it DevOps ...

February 17, 2021

The use of unified communications and collaboration (UC&C) solutions has increased since the start of the pandemic, and this increased use has created challenges for IT teams, according to a survey commissioned by NETSCOUT SYSTEMS ...