Universal Monitoring Crimes and What to Do About Them - Part 2
May 23, 2018

Leon Adato
SolarWinds

Share this

To help your organization increase data center efficiency and get the most benefit out of your monitoring solutions, here are the remaining universal monitoring crimes and what you can do about them:

Start with Universal Monitoring Crimes and What to Do About Them - Part 1

4. Flapping or sawtoothing alerts

When an alert repeatedly triggers (a device that keeps rebooting itself or processes keep deleting/creating temporary page files so that one moment it's over threshold, the next it's below, for example), that condition is known as flapping or sawtoothing.

What to do about it: These types of alerts have several possible resolutions based on what is supported by your monitoring solution and which best fits the specific situation:

■ GOOD: Suppress events within a window. Ignoring duplicated events within a certain period of time is often all you need to avoid meaningless duplicates.

■ ALSO GOOD: As mentioned previously, add a time delay to allow for self-resolution, avoid false-positives, and eliminate other potential issues that don't necessarily require a remediation response.

■ BETTER: Leverage "Reset" logic. Wait for a reset event before triggering a new alert of the same kind. Avoid making the reset logic merely the reverse of the trigger (if the alert is > 90%, the reset might be 90% for 15 minutes, but it won't reset until it's

■ BEST: Two-way communication with a ticket or alert management system. This is where the monitoring system communicates with the ticket and/or alert tracking system, so you can never cut the same alert for the same device until a human has actively corrected the original problem and closed the ticket.

5. No lab, test, or QA environments for your monitoring system

If your monitoring system is watching and alerting on mission-critical systems within the enterprise, then it is mission critical itself. But despite the fact that many organizations set up a proof-of-concept environment when evaluating monitoring solutions, once the production system is selected and rolled out, they fail to have any type of lab, test, or QA system that is maintained on an ongoing basis to help ensure the system is maintained.

What to do about it: Duh. Implement test, dev, and/or QA installations that serve to ensure your monitoring system has the oversight necessary for a mission-critical application.

■ TEST: An (often temporary) environment where patches and upgrades can be tested before attempting them in production.

■ DEV: An environment that mirrors production in terms of software, but where monitors for new equipment, applications, reports, or alerts can be set up and tested before rolling those solutions to production. And as mentioned earlier, this is the perfect place to also monitor your production monitoring environment.

■ QA: An environment that mirrors the previous version of production, so that if issues are found in production, they can be double-checked to confirm whether the problem was introduced in the last revision.

Note that I'm not implying you necessarily must have all three, but it's worth considering the value of at least one. Because "none" is a really bad choice.

Final thoughts

The rate of technical change in the data center today is rapidly accelerating and traditional data center systems have undergone considerable evolution in a very short period of time. As complexity continues to grow alongside the expectation that an organization's IT department should become ever-more "agile" and continue to deliver a quality end-user experience 24/7 (meaning no glitches, outages, application performance problems, etc.), it's important that IT professionals give monitoring the priority it deserves as a foundational IT discipline.

By understanding and addressing these top universal monitoring crimes, you can ensure your organization receives the benefit of sophisticated, tuned monitoring systems while also enabling a more proactive data center strategy now and in the future.

Leon Adato is a Head Geek at SolarWinds
Share this

The Latest

June 29, 2022

When it comes to AIOps predictions, there's no question of AI's value in predictive intelligence and faster problem resolution for IT teams. In fact, Gartner has reported that there is no future for IT Operations without AIOps. So, where is AIOps headed in five years? Here's what the vendors and thought leaders in the AIOps space had to share ...

June 27, 2022

A new study by OpsRamp on the state of the Managed Service Providers (MSP) market concludes that MSPs face a market of bountiful opportunities but must prepare for this growth by embracing complex technologies like hybrid cloud management, root cause analysis and automation ...

June 27, 2022

Hybrid work adoption and the accelerated pace of digital transformation are driving an increasing need for automation and site reliability engineering (SRE) practices, according to new research. In a new survey almost half of respondents (48.2%) said automation is a way to decrease Mean Time to Resolution/Repair (MTTR) and improve service management ...

June 23, 2022

Digital businesses don't invest in monitoring for monitoring's sake. They do it to make the business run better. Every dollar spent on observability — every hour your team spends using monitoring tools or responding to what they reveal — should tie back directly to business outcomes: conversions, revenues, brand equity. If they don't? You might be missing the forest for the trees ...

June 22, 2022

Every day, companies are missing customer experience (CX) "red flags" because they don't have the tools to observe CX processes or metrics. Even basic errors or defects in automated customer interactions are left undetected for days, weeks or months, leading to widespread customer dissatisfaction. In fact, poor CX and digital technology investments are costing enterprises billions of dollars in lost potential revenue ...

June 21, 2022

Organizations are moving to microservices and cloud native architectures at an increasing pace. The primary incentive for these transformation projects is typically to increase the agility and velocity of software release and product innovation. These dynamic systems, however, are far more complex to manage and monitor, and they generate far higher data volumes ...

June 16, 2022

Global IT teams adapted to remote work in 2021, resolving employee tickets 23% faster than the year before as overall resolution time for IT tickets went down by 7 hours, according to the Freshservice Service Management Benchmark Report from Freshworks ...

June 15, 2022

Once upon a time data lived in the data center. Now data lives everywhere. All this signals the need for a new approach to data management, a next-gen solution ...

June 14, 2022

Findings from the 2022 State of Edge Messaging Report from Ably and Coleman Parkes Research show that most organizations (65%) that have built edge messaging capabilities in house have experienced an outage or significant downtime in the last 12-18 months. Most of the current in-house real-time messaging services aren't cutting it ...

June 13, 2022
Today's users want a complete digital experience when dealing with a software product or system. They are not content with the page load speeds or features alone but want the software to perform optimally in an omnichannel environment comprising multiple platforms, browsers, devices, and networks. This calls into question the role of load testing services to check whether the given software under testing can perform optimally when subjected to peak load ...