Universal Monitoring Crimes and What to Do About Them - Part 2
May 23, 2018

Leon Adato
SolarWinds

Share this

To help your organization increase data center efficiency and get the most benefit out of your monitoring solutions, here are the remaining universal monitoring crimes and what you can do about them:

Start with Universal Monitoring Crimes and What to Do About Them - Part 1

4. Flapping or sawtoothing alerts

When an alert repeatedly triggers (a device that keeps rebooting itself or processes keep deleting/creating temporary page files so that one moment it's over threshold, the next it's below, for example), that condition is known as flapping or sawtoothing.

What to do about it: These types of alerts have several possible resolutions based on what is supported by your monitoring solution and which best fits the specific situation:

■ GOOD: Suppress events within a window. Ignoring duplicated events within a certain period of time is often all you need to avoid meaningless duplicates.

■ ALSO GOOD: As mentioned previously, add a time delay to allow for self-resolution, avoid false-positives, and eliminate other potential issues that don't necessarily require a remediation response.

■ BETTER: Leverage "Reset" logic. Wait for a reset event before triggering a new alert of the same kind. Avoid making the reset logic merely the reverse of the trigger (if the alert is > 90%, the reset might be 90% for 15 minutes, but it won't reset until it's

■ BEST: Two-way communication with a ticket or alert management system. This is where the monitoring system communicates with the ticket and/or alert tracking system, so you can never cut the same alert for the same device until a human has actively corrected the original problem and closed the ticket.

5. No lab, test, or QA environments for your monitoring system

If your monitoring system is watching and alerting on mission-critical systems within the enterprise, then it is mission critical itself. But despite the fact that many organizations set up a proof-of-concept environment when evaluating monitoring solutions, once the production system is selected and rolled out, they fail to have any type of lab, test, or QA system that is maintained on an ongoing basis to help ensure the system is maintained.

What to do about it: Duh. Implement test, dev, and/or QA installations that serve to ensure your monitoring system has the oversight necessary for a mission-critical application.

■ TEST: An (often temporary) environment where patches and upgrades can be tested before attempting them in production.

■ DEV: An environment that mirrors production in terms of software, but where monitors for new equipment, applications, reports, or alerts can be set up and tested before rolling those solutions to production. And as mentioned earlier, this is the perfect place to also monitor your production monitoring environment.

■ QA: An environment that mirrors the previous version of production, so that if issues are found in production, they can be double-checked to confirm whether the problem was introduced in the last revision.

Note that I'm not implying you necessarily must have all three, but it's worth considering the value of at least one. Because "none" is a really bad choice.

Final thoughts

The rate of technical change in the data center today is rapidly accelerating and traditional data center systems have undergone considerable evolution in a very short period of time. As complexity continues to grow alongside the expectation that an organization's IT department should become ever-more "agile" and continue to deliver a quality end-user experience 24/7 (meaning no glitches, outages, application performance problems, etc.), it's important that IT professionals give monitoring the priority it deserves as a foundational IT discipline.

By understanding and addressing these top universal monitoring crimes, you can ensure your organization receives the benefit of sophisticated, tuned monitoring systems while also enabling a more proactive data center strategy now and in the future.

Leon Adato is a Head Geek at SolarWinds
Share this

The Latest

October 22, 2020

IT teams critically require better visibility into the network driven by a number of factors, including tremendous disruption from the COVID-19 pandemic, relentless technological advances, remote working reaching an all-time high and the expanding security threatscape, according to State of the Network 2020, a study conducted by VIAVI Solutions ...

October 21, 2020

Mobile commerce offers several benefits for retailers. But all this potential can only be fully realized if retailers can manage the associated challenges that mobile commerce introduces. Anyone involved in the development, operation or troubleshooting of a mobile shopping app needs to be aware of the three following technical obstacles and plan accordingly ...

October 20, 2020

Although cost control/expense management remains top of mind, organizations are realizing the necessity of technology solutions to enable them to steer the business during these turbulent times, according to IDG's CIO Pandemic Business Impact Study ...

October 19, 2020

The COVID-19 pandemic has compressed six years of modernization projects into 6 months. According to a recent report, IT leaders have accelerated projects aimed at increasing productivity and business agility, improving application performance and end-user experience, and driving additional revenue through existing channels ...

October 15, 2020

There is no doubt that automation has become the key aspect of modern IT management. The end-user computing market is no exception. With a large and complex technology stack and a huge number of applications, EUC specialists need to handle an ever-increasing number of changes at an ever-increasing rate. Many IT organizations are starting to realize that they can no longer control the flow of changes. It is time to think about how to facilitate change ...

October 14, 2020

Starting this September, the lifespan of an SSL/TLS certificate has been limited to 398 days, a reduction from the previous maximum certificate lifetime of 825 days. With this change, everyone needs to more carefully monitor SSL certificate expiration and server characteristics ...

October 13, 2020

Nearly 6 in 10 responding organizations have accelerated their digital transformations due to the COVID-19 pandemic, according to The IBM Institute for Business Value study COVID-19 and the Future of Business ...

October 08, 2020

Two-thirds (67%) of those surveyed expect the sheer quantity of data to grow nearly five times by 2025, according to a new report from Splunk: The Data Age Is Here. Are You Ready? ...

October 07, 2020

Gaming introduced the world to a whole new range of experiences through augmented reality (AR) and virtual reality (VR). And consumers are really catching on. To unlock the potential of these platforms, enterprises must ensure massive amounts of data can be transferred quickly and reliably to ensure an acceptable quality of experience. As such, this means that enterprises will need to turn to a 5G infrastructure powered by an adaptive network ...

October 06, 2020

A distributed, remote workforce is the new business reality. How can businesses keep operations going smoothly and quickly resolve issues when IT staff is in San Jose, employee A is working remotely in Denver at their home and employee B is a salesperson still doing some road traveling? The key is an IT architecture that promotes and supports "self-healing" at the endpoint to take care of issues before they impact employees. The essential element to achieve this is hyper-automation ...