Universal Monitoring Crimes and What to Do About Them - Part 1
May 22, 2018

Leon Adato
SolarWinds

Share this

Monitoring is a critical aspect of any data center operation, yet it often remains the black sheep of an organization's IT strategy: an afterthought rather than a core competency. Because of this, many enterprises have a monitoring solution that appears to have been built by a flock of "IT seagulls" — technicians who swoop in, drop a smelly and offensive payload, and swoop out. Over time, the result is layer upon layer of offensive payloads that are all in the same general place (your monitoring solution) but have no coherent strategy or integration.

Believe it or not, this is a salvageable scenario. By applying a few basic techniques and monitoring discipline, you can turn a disorganized pile of noise into a monitoring solution that provides actionable insight. For the purposes of this piece, let's assume you've at least implemented some type of monitoring solution within your environment.

At its core, the principle of monitoring as a foundational IT discipline is designed to help IT professionals escape the short-term, reactive nature of administration, often caused by insufficient monitoring, and become more proactive and strategic. All too often, however, organizations are instead bogged down by monitoring systems that are improperly tuned — or not tuned at all — for their environment and business needs. This results in unnecessary or incorrect alerts that introduce more chaos and noise than order and insight, and as a result, cause your staff to value monitoring even less.

So, to help your organization increase data center efficiency and get the most benefit out of your monitoring solutions, here are the top five universal monitoring crimes and what you can do about them:

1. Fixed thresholds

Monitoring systems that trigger any type of alert at a fixed value for a group of devices are the "weak tea" of solutions. While general thresholds can be established, it is statistically impossible that every single device is going to adhere to the same one, and extremely improbable that even a majority will.

Even a single server has utilization that varies from day to day. A server that usually runs at 50 percent CPU, for example, but spikes to 95 percent at the end of the month is perfectly normal — but fixed thresholds can cause this spike to trigger. The result is that many organizations create multiple versions of the same alert (CPU Alert for Windows IIS-DMZ; CPU Alert for Windows IIS-core; CPU Alert for Windows Exchange CAS, and so on). And even then, fixed thresholds usually throw more false positives than anyone wants.

What to do about it:

■ GOOD: Enable per-device (and per-service) thresholds. Whether you do this within the tool or via customizations, you should ultimately be able to have a specific threshold for each device so that machines that have a specific threshold trigger at the correct time, and those that do not get the default.

■ BETTER: Use existing monitoring data to establish baselines for "normal" and then trigger when usage deviates from that baseline. Note that you may need to consider how to address edge cases that may require a second condition to help define when a threshold is triggered.

2. Lack of monitoring system oversight

While it's certainly important to have a tool or set of tools that monitor and alert on mission-critical systems, it's also important to have some sort of system in place to identify problems within the monitoring solution itself.

What to do about it: Set up a separate instance of a monitoring solution that keeps track of the primary, or production, monitoring system. It can be another copy of the same tool or tools you are using in production, or a separate solution, such as open source, vendor-provided, etc.

For another option to address this, see the discussion on lab and test environments in Part 2 of this blog.

3. Instant alerts

There are endless reasons why instant alerts — when your monitoring system triggers alerts as soon as a condition is detected — can cause chaos in your data center. For one thing, monitoring systems are not infallible and may detect "false positive" alerts that don't truly require a remediation response. For another, it's not uncommon for problems to appear for a moment and then disappear. Still some other problems aren't actionable until they've persisted for a certain amount of time. You get the idea.

What to do about it: Build a time delay into your monitoring system's trigger logic where a CPU alert, for example, would need to have all of the specified conditions persist for something like 10 minutes before any action would be needed. Spikes lasting longer than 10 minutes would require more direct intervention while anything less represents a temporary spike in activity that doesn't necessarily indicate a true problem.

Read Universal Monitoring Crimes and What to Do About Them - Part 2, for more monitoring tips.

Leon Adato is a Head Geek at SolarWinds
Share this

The Latest

March 21, 2019

Achieving audit compliance within your IT ecosystem can be an iterative process, and it doesn't have to be compressed into the five days before the audit is due. Following is a four-step process I use to guide clients through the process of preparing for and successfully completing IT audits ...

March 20, 2019

Network performance issues come in all shapes and sizes, and can require vast amounts of time and resources to solve. Here are three examples of painful network performance issues you're likely to encounter this year, and how NPMD solutions can help you overcome them ...

March 19, 2019

"Scale up" versus "scale out" doesn't just apply to hardware investments, it also has an impact on product features. "Scale up" promotes buying the feature set you think you need now, then adding "feature modules" and licenses as you discover additional feature requirements are needed. Often as networks grow in size they also grow in complexity ...

March 18, 2019

Network Packet Brokers play a critical role in gaining visibility into new complex networks. They deliver the packet data and information IT and security teams need to identify problems, recognize security issues, and ensure overall network performance. However, not all Packet Brokers are created equal when it comes to scalability. Simply "scaling up" your network infrastructure at every growth point is a more complex and more expensive endeavor over time. Let's explore three ways the "scale up" approach to infrastructure growth impedes NetOps and security professionals (and the business as a whole) ...

March 15, 2019

Loyal users are the key to your service desk's success. Happy users want to use your services and they recommend your services in the organization. It takes time and effort to exceed user expectations, but doing so means keeping the promises we make to our users and being careful not to do too much without careful consideration for what's best for the organization and users ...

March 14, 2019

What's the difference between user satisfaction and user loyalty? How can you measure whether your users are satisfied and will keep buying from you? How much effort should you make to offer your users the ultimate experience? If you're a service provider, what matters in the end is whether users will keep coming back to you and will stay loyal ...

March 13, 2019

What if I said that a 95% reduction in the amount of IT noise, 99% reduction in ticket volume and 99% L1 resolution rate are not only possible, but that some of the largest, most complex enterprises in the world see these metrics in their environments every day, thanks to Artificial Intelligence (AI) and Machine Learning (ML)? Would you dismiss that as belonging to the realm of science fiction? ...

March 12, 2019
As a consumer, when you order products online, how do you expect them to get delivered? Some key requirements are: the product must arrive on time, well-packed, and ultimately must give you an easy gateway to return it if it is not as per your expectations. All this has been made possible via a single application. But what if this application doesn't function the way you want or cracks down mid-way, or probably leaks off information about you to some potential hackers? Technical uncertainty and digital chaos are the two double-edged swords dangling over this billion-dollar ecommerce market. Can Quality Assurance and Software Testing save application developers from this endless juggle? ...
March 11, 2019

Of those surveyed, 96% of organizations have a digital transformation strategy, with 57% approaching it as an enterprise-wide priority, with a clear emphasis on speed of business, costs, risk, and customer satisfaction, according to IDC’s Aligning IT Strategies and Business Expectations for Digital Transformation Success, sponsored by EasyVista ...

March 08, 2019

One of my ongoing areas of focus is analytics, AIOps, and the intersection with AI and machine learning more broadly. Within this space, sad to say, semantic confusion surrounding just what these terms mean echoes the confusions surrounding ITSM ...