Universal Monitoring Crimes and What to Do About Them - Part 2
May 23, 2018

Leon Adato
SolarWinds

Share this

To help your organization increase data center efficiency and get the most benefit out of your monitoring solutions, here are the remaining universal monitoring crimes and what you can do about them:

Start with Universal Monitoring Crimes and What to Do About Them - Part 1

4. Flapping or sawtoothing alerts

When an alert repeatedly triggers (a device that keeps rebooting itself or processes keep deleting/creating temporary page files so that one moment it's over threshold, the next it's below, for example), that condition is known as flapping or sawtoothing.

What to do about it: These types of alerts have several possible resolutions based on what is supported by your monitoring solution and which best fits the specific situation:

■ GOOD: Suppress events within a window. Ignoring duplicated events within a certain period of time is often all you need to avoid meaningless duplicates.

■ ALSO GOOD: As mentioned previously, add a time delay to allow for self-resolution, avoid false-positives, and eliminate other potential issues that don't necessarily require a remediation response.

■ BETTER: Leverage "Reset" logic. Wait for a reset event before triggering a new alert of the same kind. Avoid making the reset logic merely the reverse of the trigger (if the alert is > 90%, the reset might be 90% for 15 minutes, but it won't reset until it's

■ BEST: Two-way communication with a ticket or alert management system. This is where the monitoring system communicates with the ticket and/or alert tracking system, so you can never cut the same alert for the same device until a human has actively corrected the original problem and closed the ticket.

5. No lab, test, or QA environments for your monitoring system

If your monitoring system is watching and alerting on mission-critical systems within the enterprise, then it is mission critical itself. But despite the fact that many organizations set up a proof-of-concept environment when evaluating monitoring solutions, once the production system is selected and rolled out, they fail to have any type of lab, test, or QA system that is maintained on an ongoing basis to help ensure the system is maintained.

What to do about it: Duh. Implement test, dev, and/or QA installations that serve to ensure your monitoring system has the oversight necessary for a mission-critical application.

■ TEST: An (often temporary) environment where patches and upgrades can be tested before attempting them in production.

■ DEV: An environment that mirrors production in terms of software, but where monitors for new equipment, applications, reports, or alerts can be set up and tested before rolling those solutions to production. And as mentioned earlier, this is the perfect place to also monitor your production monitoring environment.

■ QA: An environment that mirrors the previous version of production, so that if issues are found in production, they can be double-checked to confirm whether the problem was introduced in the last revision.

Note that I'm not implying you necessarily must have all three, but it's worth considering the value of at least one. Because "none" is a really bad choice.

Final thoughts

The rate of technical change in the data center today is rapidly accelerating and traditional data center systems have undergone considerable evolution in a very short period of time. As complexity continues to grow alongside the expectation that an organization's IT department should become ever-more "agile" and continue to deliver a quality end-user experience 24/7 (meaning no glitches, outages, application performance problems, etc.), it's important that IT professionals give monitoring the priority it deserves as a foundational IT discipline.

By understanding and addressing these top universal monitoring crimes, you can ensure your organization receives the benefit of sophisticated, tuned monitoring systems while also enabling a more proactive data center strategy now and in the future.

Leon Adato is a Head Geek at SolarWinds
Share this

The Latest

August 21, 2018

High availability's (HA) primary objective has historically been focused on ensuring continuous operations and performance. HA was built on a foundation of redundancy and failover technologies and methodologies to ensure business continuity in the event of workload spikes, planned maintenance, and unplanned downtime. Today, HA methodologies have been superseded by intelligent workload routing automation (i.e., intelligent availability), in that data and their processing are consistently directed to the proper place at the right time ...

August 20, 2018

You need insight to maximize performance — not inefficient troubleshooting, longer time to resolution, and an overall lack of application intelligence. Steps 5 through 10 will help you maximize the performance of your applications and underlying network infrastructure ...

August 17, 2018

As a Network Operations professional, you know how hard it is to ensure optimal network performance when you’re unsure of how end-user devices, application code, and infrastructure affect performance. Identifying your important applications and prioritizing their performance is more difficult than ever, especially when much of an organization’s web-based traffic appears the same to the network. You need insight to maximize performance — not inefficient troubleshooting, longer time to resolution, and an overall lack of application intelligence. But you can stay ahead. Follow these 10 steps to maximize the performance of your applications and underlying network infrastructure ...

August 16, 2018

IT organizations are constantly trying to optimize operations and troubleshooting activities and for good reason. Let's look at one example for the medical industry. Networked applications, such as electronic medical records (EMR), are vital for hospitals to provide outstanding service to their patients and physicians. However, a networking team can often not be aware of slow response times on the remotely hosted EMR application until a physician or someone else calls in to complain ...

August 15, 2018

In 2014, AWS Lambda introduced serverless architecture. Since then, many other cloud providers have developed serverless options. What’s behind this rapid growth? ...

August 14, 2018

This question is really two questions. The first would be: What's really going on in terms of a confusion of terms? — as we wrestle with AIOps, IT Operational Analytics, big data, AI bots, machine learning, and more generically stated "AI platforms" (… and the list is far from complete). The second might be phrased as: What's really going on in terms of real-world advanced IT analytics deployments — where are they succeeding, and where are they not? This blog will look at both questions as a way of introducing EMA's newest research with data ...

August 13, 2018

Consumers will now trade app convenience for security, according to a study commissioned by F5 Networks, The Curve of Convenience – The Trade-Off between Security and Convenience ...

August 10, 2018

Gartner unveiled the CX Pyramid, a new methodology to test organizations’ customer journeys and forge more powerful experiences that deliver greater customer loyalty and brand advocacy ...

August 09, 2018

Nearly half (48 percent) of consumers report that they currently use, or have used in the past, services of organizations that were involved in a publicly disclosed data breach and, of those, 48 percent have stopped using the services of an organization because of a breach, according to Global State of Digital Trust Survey and Index 2018, a new report from CA Technologies ...

August 08, 2018

Here's the problem: IT teams are in the dark. The only information they have available to them is based on what users decide to tell them about through calls to the help desk ...