Universal Monitoring Crimes and What to Do About Them - Part 2
May 23, 2018

Leon Adato
SolarWinds

Share this

To help your organization increase data center efficiency and get the most benefit out of your monitoring solutions, here are the remaining universal monitoring crimes and what you can do about them:

Start with Universal Monitoring Crimes and What to Do About Them - Part 1

4. Flapping or sawtoothing alerts

When an alert repeatedly triggers (a device that keeps rebooting itself or processes keep deleting/creating temporary page files so that one moment it's over threshold, the next it's below, for example), that condition is known as flapping or sawtoothing.

What to do about it: These types of alerts have several possible resolutions based on what is supported by your monitoring solution and which best fits the specific situation:

■ GOOD: Suppress events within a window. Ignoring duplicated events within a certain period of time is often all you need to avoid meaningless duplicates.

■ ALSO GOOD: As mentioned previously, add a time delay to allow for self-resolution, avoid false-positives, and eliminate other potential issues that don't necessarily require a remediation response.

■ BETTER: Leverage "Reset" logic. Wait for a reset event before triggering a new alert of the same kind. Avoid making the reset logic merely the reverse of the trigger (if the alert is > 90%, the reset might be 90% for 15 minutes, but it won't reset until it's

■ BEST: Two-way communication with a ticket or alert management system. This is where the monitoring system communicates with the ticket and/or alert tracking system, so you can never cut the same alert for the same device until a human has actively corrected the original problem and closed the ticket.

5. No lab, test, or QA environments for your monitoring system

If your monitoring system is watching and alerting on mission-critical systems within the enterprise, then it is mission critical itself. But despite the fact that many organizations set up a proof-of-concept environment when evaluating monitoring solutions, once the production system is selected and rolled out, they fail to have any type of lab, test, or QA system that is maintained on an ongoing basis to help ensure the system is maintained.

What to do about it: Duh. Implement test, dev, and/or QA installations that serve to ensure your monitoring system has the oversight necessary for a mission-critical application.

■ TEST: An (often temporary) environment where patches and upgrades can be tested before attempting them in production.

■ DEV: An environment that mirrors production in terms of software, but where monitors for new equipment, applications, reports, or alerts can be set up and tested before rolling those solutions to production. And as mentioned earlier, this is the perfect place to also monitor your production monitoring environment.

■ QA: An environment that mirrors the previous version of production, so that if issues are found in production, they can be double-checked to confirm whether the problem was introduced in the last revision.

Note that I'm not implying you necessarily must have all three, but it's worth considering the value of at least one. Because "none" is a really bad choice.

Final thoughts

The rate of technical change in the data center today is rapidly accelerating and traditional data center systems have undergone considerable evolution in a very short period of time. As complexity continues to grow alongside the expectation that an organization's IT department should become ever-more "agile" and continue to deliver a quality end-user experience 24/7 (meaning no glitches, outages, application performance problems, etc.), it's important that IT professionals give monitoring the priority it deserves as a foundational IT discipline.

By understanding and addressing these top universal monitoring crimes, you can ensure your organization receives the benefit of sophisticated, tuned monitoring systems while also enabling a more proactive data center strategy now and in the future.

Leon Adato is a Head Geek at SolarWinds
Share this

The Latest

June 22, 2021

Your employees aren't coming back to the office, at least not in the traditional sense. The pandemic shifted almost all industries into remote work. And according to the results of Ivanti's Everywhere Workplace survey, they're not interested in going back to the way things once were ...

June 21, 2021

Respondents to an OpsRamp survey are moving forward with digital transformation, but many are re-evaluating the number and type of tools they're using. There are three main takeaways from the survey ...

June 17, 2021

More and more mainframe decision makers are becoming aware that the traditional way of handling mainframe operations will soon fall by the wayside. The ever-growing demand for newer, faster digital services has placed increased pressure on data centers to keep up as new applications come online, the volume of data handled continually increases, and workloads become increasingly unpredictable. In a recent Forrester Consulting AIOps survey, commissioned by BMC, the majority of respondents cited that they spend too much time reacting to incidents and not enough time finding ways to prevent them ...

June 16, 2021

In the age of digital transformation, enterprises are migrating to open source software (OSS) in droves to streamline operations and improve customer and employee experiences. However, to unlock the deluge of OSS benefits, it's not enough for organizations to simply implement the software. They must take the necessary steps to build an intentional OSS strategy rooted in ongoing third-party support and training ...

June 15, 2021

In Part 1 of this series, we explored the top pain points associated with managing Internet-based WANs today. This second installment will focus on today's most prevalent SD-WAN deployment challenges specifically and what you can do to better manage modern WANs overall ...

June 14, 2021

Enterprise wide-area networks (WANs) have undergone an incredible transformation over the past several years. More often than not, they're hybrid, offering multiple connection paths between WANs. This provides many benefits but also makes them more challenging to manage than ever before. In Part 1 of this series, we'll explore the top pain points associated with Internet-based WANs ...

June 10, 2021

As we have seen during this digital transformation boom during the pandemic, technologists are managing more applications and data than ever before, which has led three quarters of technologists to be concerned with increased IT complexity. Even more significant, 89% admitted to feeling under immense pressure to keep up with the churn, according to the recent AppDynamics Agents of Transformation report. It's clear that the pandemic has pushed many technologists to their breaking point. To help tackle IT burnout, tech professionals need a "canary" to help them streamline and catch the anomalies before they cause any major performance issues ...

June 09, 2021

An hour-long outage this Tuesday ground the Internet to a halt after popular Content Delivery Network (CDN) provider, Fastly, experienced a glitch that downed Reddit, Spotify, HBO Max, Shopify, Stripe and the BBC, to name just a few of properties affected ...

June 08, 2021

Digital experience has existed for a while now. We have now begun to scratch the surface to measure it. So that calls for Digital Experience Monitoring (DEM). DEM extends Application Performance Monitoring (APM) and Network Performance Management (NPM) to view and optimize application performance issues from the end-user perspective ...

June 07, 2021

The rising adoption of cloud-native architectures, DevOps, and agile methodologies has broken traditional approaches to application security, according to Precise, automatic risk and impact assessment is key for DevSecOps, a new report from Dynatrace, based on an independent global survey of 700 CISOs ...