Skip to main content

Universal Monitoring Crimes and What to Do About Them - Part 1

Leon Adato

Monitoring is a critical aspect of any data center operation, yet it often remains the black sheep of an organization's IT strategy: an afterthought rather than a core competency. Because of this, many enterprises have a monitoring solution that appears to have been built by a flock of "IT seagulls" — technicians who swoop in, drop a smelly and offensive payload, and swoop out. Over time, the result is layer upon layer of offensive payloads that are all in the same general place (your monitoring solution) but have no coherent strategy or integration.

Believe it or not, this is a salvageable scenario. By applying a few basic techniques and monitoring discipline, you can turn a disorganized pile of noise into a monitoring solution that provides actionable insight. For the purposes of this piece, let's assume you've at least implemented some type of monitoring solution within your environment.

At its core, the principle of monitoring as a foundational IT discipline is designed to help IT professionals escape the short-term, reactive nature of administration, often caused by insufficient monitoring, and become more proactive and strategic. All too often, however, organizations are instead bogged down by monitoring systems that are improperly tuned — or not tuned at all — for their environment and business needs. This results in unnecessary or incorrect alerts that introduce more chaos and noise than order and insight, and as a result, cause your staff to value monitoring even less.

So, to help your organization increase data center efficiency and get the most benefit out of your monitoring solutions, here are the top five universal monitoring crimes and what you can do about them:

1. Fixed thresholds

Monitoring systems that trigger any type of alert at a fixed value for a group of devices are the "weak tea" of solutions. While general thresholds can be established, it is statistically impossible that every single device is going to adhere to the same one, and extremely improbable that even a majority will.

Even a single server has utilization that varies from day to day. A server that usually runs at 50 percent CPU, for example, but spikes to 95 percent at the end of the month is perfectly normal — but fixed thresholds can cause this spike to trigger. The result is that many organizations create multiple versions of the same alert (CPU Alert for Windows IIS-DMZ; CPU Alert for Windows IIS-core; CPU Alert for Windows Exchange CAS, and so on). And even then, fixed thresholds usually throw more false positives than anyone wants.

What to do about it:

■ GOOD: Enable per-device (and per-service) thresholds. Whether you do this within the tool or via customizations, you should ultimately be able to have a specific threshold for each device so that machines that have a specific threshold trigger at the correct time, and those that do not get the default.

■ BETTER: Use existing monitoring data to establish baselines for "normal" and then trigger when usage deviates from that baseline. Note that you may need to consider how to address edge cases that may require a second condition to help define when a threshold is triggered.

2. Lack of monitoring system oversight

While it's certainly important to have a tool or set of tools that monitor and alert on mission-critical systems, it's also important to have some sort of system in place to identify problems within the monitoring solution itself.

What to do about it: Set up a separate instance of a monitoring solution that keeps track of the primary, or production, monitoring system. It can be another copy of the same tool or tools you are using in production, or a separate solution, such as open source, vendor-provided, etc.

For another option to address this, see the discussion on lab and test environments in Part 2 of this blog.

3. Instant alerts

There are endless reasons why instant alerts — when your monitoring system triggers alerts as soon as a condition is detected — can cause chaos in your data center. For one thing, monitoring systems are not infallible and may detect "false positive" alerts that don't truly require a remediation response. For another, it's not uncommon for problems to appear for a moment and then disappear. Still some other problems aren't actionable until they've persisted for a certain amount of time. You get the idea.

What to do about it: Build a time delay into your monitoring system's trigger logic where a CPU alert, for example, would need to have all of the specified conditions persist for something like 10 minutes before any action would be needed. Spikes lasting longer than 10 minutes would require more direct intervention while anything less represents a temporary spike in activity that doesn't necessarily indicate a true problem.

Read Universal Monitoring Crimes and What to Do About Them - Part 2, for more monitoring tips.

APM

Hot Topics

The Latest

A major architectural shift is underway across enterprise networks, according to a new global study from Cisco. As AI assistants, agents, and data-driven workloads reshape how work gets done, they're creating faster, more dynamic, more latency-sensitive, and more complex network traffic. Combined with the ubiquity of connected devices, 24/7 uptime demands, and intensifying security threats, these shifts are driving infrastructure to adapt and evolve ...

Image
Cisco

The development of banking apps was supposed to provide users with convenience, control and piece of mind. However, for thousands of Halifax customers recently, a major mobile outage caused the exact opposite, leaving customers unable to check balances, or pay bills, sparking widespread frustration. This wasn't an isolated incident ... So why are these failures still happening? ...

Cyber threats are growing more sophisticated every day, and at their forefront are zero-day vulnerabilities. These elusive security gaps are exploited before a fix becomes available, making them among the most dangerous threats in today's digital landscape ... This guide will explore what these vulnerabilities are, how they work, why they pose such a significant threat, and how modern organizations can stay protected ...

The prevention of data center outages continues to be a strategic priority for data center owners and operators. Infrastructure equipment has improved, but the complexity of modern architectures and evolving external threats presents new risks that operators must actively manage, according to the Data Center Outage Analysis 2025 from Uptime Institute ...

As observability engineers, we navigate a sea of telemetry daily. We instrument our applications, configure collectors, and build dashboards, all in pursuit of understanding our complex distributed systems. Yet, amidst this flood of data, a critical question often remains unspoken, or at best, answered by gut feeling: "Is our telemetry actually good?" ... We're inviting you to participate in shaping a foundational element for better observability: the Instrumentation Score ...

We're inching ever closer toward a long-held goal: technology infrastructure that is so automated that it can protect itself. But as IT leaders aggressively employ automation across our enterprises, we need to continuously reassess what AI is ready to manage autonomously and what can not yet be trusted to algorithms ...

Much like a traditional factory turns raw materials into finished products, the AI factory turns vast datasets into actionable business outcomes through advanced models, inferences, and automation. From the earliest data inputs to the final token output, this process must be reliable, repeatable, and scalable. That requires industrializing the way AI is developed, deployed, and managed ...

Almost half (48%) of employees admit they resent their jobs but stay anyway, according to research from Ivanti ... This has obvious consequences across the business, but we're overlooking the massive impact of resenteeism and presenteeism on IT. For IT professionals tasked with managing the backbone of modern business operations, these numbers spell big trouble ...

For many B2B and B2C enterprise brands, technology isn't a core strength. Relying on overly complex architectures (like those that follow a pure MACH doctrine) has been flagged by industry leaders as a source of operational slowdown, creating bottlenecks that limit agility in volatile market conditions ...

FinOps champions crucial cross-departmental collaboration, uniting business, finance, technology and engineering leaders to demystify cloud expenses. Yet, too often, critical cost issues are softened into mere "recommendations" or "insights" — easy to ignore. But what if we adopted security's battle-tested strategy and reframed these as the urgent risks they truly are, demanding immediate action? ...

Universal Monitoring Crimes and What to Do About Them - Part 1

Leon Adato

Monitoring is a critical aspect of any data center operation, yet it often remains the black sheep of an organization's IT strategy: an afterthought rather than a core competency. Because of this, many enterprises have a monitoring solution that appears to have been built by a flock of "IT seagulls" — technicians who swoop in, drop a smelly and offensive payload, and swoop out. Over time, the result is layer upon layer of offensive payloads that are all in the same general place (your monitoring solution) but have no coherent strategy or integration.

Believe it or not, this is a salvageable scenario. By applying a few basic techniques and monitoring discipline, you can turn a disorganized pile of noise into a monitoring solution that provides actionable insight. For the purposes of this piece, let's assume you've at least implemented some type of monitoring solution within your environment.

At its core, the principle of monitoring as a foundational IT discipline is designed to help IT professionals escape the short-term, reactive nature of administration, often caused by insufficient monitoring, and become more proactive and strategic. All too often, however, organizations are instead bogged down by monitoring systems that are improperly tuned — or not tuned at all — for their environment and business needs. This results in unnecessary or incorrect alerts that introduce more chaos and noise than order and insight, and as a result, cause your staff to value monitoring even less.

So, to help your organization increase data center efficiency and get the most benefit out of your monitoring solutions, here are the top five universal monitoring crimes and what you can do about them:

1. Fixed thresholds

Monitoring systems that trigger any type of alert at a fixed value for a group of devices are the "weak tea" of solutions. While general thresholds can be established, it is statistically impossible that every single device is going to adhere to the same one, and extremely improbable that even a majority will.

Even a single server has utilization that varies from day to day. A server that usually runs at 50 percent CPU, for example, but spikes to 95 percent at the end of the month is perfectly normal — but fixed thresholds can cause this spike to trigger. The result is that many organizations create multiple versions of the same alert (CPU Alert for Windows IIS-DMZ; CPU Alert for Windows IIS-core; CPU Alert for Windows Exchange CAS, and so on). And even then, fixed thresholds usually throw more false positives than anyone wants.

What to do about it:

■ GOOD: Enable per-device (and per-service) thresholds. Whether you do this within the tool or via customizations, you should ultimately be able to have a specific threshold for each device so that machines that have a specific threshold trigger at the correct time, and those that do not get the default.

■ BETTER: Use existing monitoring data to establish baselines for "normal" and then trigger when usage deviates from that baseline. Note that you may need to consider how to address edge cases that may require a second condition to help define when a threshold is triggered.

2. Lack of monitoring system oversight

While it's certainly important to have a tool or set of tools that monitor and alert on mission-critical systems, it's also important to have some sort of system in place to identify problems within the monitoring solution itself.

What to do about it: Set up a separate instance of a monitoring solution that keeps track of the primary, or production, monitoring system. It can be another copy of the same tool or tools you are using in production, or a separate solution, such as open source, vendor-provided, etc.

For another option to address this, see the discussion on lab and test environments in Part 2 of this blog.

3. Instant alerts

There are endless reasons why instant alerts — when your monitoring system triggers alerts as soon as a condition is detected — can cause chaos in your data center. For one thing, monitoring systems are not infallible and may detect "false positive" alerts that don't truly require a remediation response. For another, it's not uncommon for problems to appear for a moment and then disappear. Still some other problems aren't actionable until they've persisted for a certain amount of time. You get the idea.

What to do about it: Build a time delay into your monitoring system's trigger logic where a CPU alert, for example, would need to have all of the specified conditions persist for something like 10 minutes before any action would be needed. Spikes lasting longer than 10 minutes would require more direct intervention while anything less represents a temporary spike in activity that doesn't necessarily indicate a true problem.

Read Universal Monitoring Crimes and What to Do About Them - Part 2, for more monitoring tips.

APM

Hot Topics

The Latest

A major architectural shift is underway across enterprise networks, according to a new global study from Cisco. As AI assistants, agents, and data-driven workloads reshape how work gets done, they're creating faster, more dynamic, more latency-sensitive, and more complex network traffic. Combined with the ubiquity of connected devices, 24/7 uptime demands, and intensifying security threats, these shifts are driving infrastructure to adapt and evolve ...

Image
Cisco

The development of banking apps was supposed to provide users with convenience, control and piece of mind. However, for thousands of Halifax customers recently, a major mobile outage caused the exact opposite, leaving customers unable to check balances, or pay bills, sparking widespread frustration. This wasn't an isolated incident ... So why are these failures still happening? ...

Cyber threats are growing more sophisticated every day, and at their forefront are zero-day vulnerabilities. These elusive security gaps are exploited before a fix becomes available, making them among the most dangerous threats in today's digital landscape ... This guide will explore what these vulnerabilities are, how they work, why they pose such a significant threat, and how modern organizations can stay protected ...

The prevention of data center outages continues to be a strategic priority for data center owners and operators. Infrastructure equipment has improved, but the complexity of modern architectures and evolving external threats presents new risks that operators must actively manage, according to the Data Center Outage Analysis 2025 from Uptime Institute ...

As observability engineers, we navigate a sea of telemetry daily. We instrument our applications, configure collectors, and build dashboards, all in pursuit of understanding our complex distributed systems. Yet, amidst this flood of data, a critical question often remains unspoken, or at best, answered by gut feeling: "Is our telemetry actually good?" ... We're inviting you to participate in shaping a foundational element for better observability: the Instrumentation Score ...

We're inching ever closer toward a long-held goal: technology infrastructure that is so automated that it can protect itself. But as IT leaders aggressively employ automation across our enterprises, we need to continuously reassess what AI is ready to manage autonomously and what can not yet be trusted to algorithms ...

Much like a traditional factory turns raw materials into finished products, the AI factory turns vast datasets into actionable business outcomes through advanced models, inferences, and automation. From the earliest data inputs to the final token output, this process must be reliable, repeatable, and scalable. That requires industrializing the way AI is developed, deployed, and managed ...

Almost half (48%) of employees admit they resent their jobs but stay anyway, according to research from Ivanti ... This has obvious consequences across the business, but we're overlooking the massive impact of resenteeism and presenteeism on IT. For IT professionals tasked with managing the backbone of modern business operations, these numbers spell big trouble ...

For many B2B and B2C enterprise brands, technology isn't a core strength. Relying on overly complex architectures (like those that follow a pure MACH doctrine) has been flagged by industry leaders as a source of operational slowdown, creating bottlenecks that limit agility in volatile market conditions ...

FinOps champions crucial cross-departmental collaboration, uniting business, finance, technology and engineering leaders to demystify cloud expenses. Yet, too often, critical cost issues are softened into mere "recommendations" or "insights" — easy to ignore. But what if we adopted security's battle-tested strategy and reframed these as the urgent risks they truly are, demanding immediate action? ...