Manage the Performance of Virtual Environments Using Dynamic Alerts
June 16, 2014

Karthik Ramachandran
SolarWinds

Share this

As we know, virtual environments consist of many moving pieces and are generally complex to setup. Typically, IT environments, depending on the size of the organization, can have several hundred VMs down to a handful of VMs. For such virtual infrastructure deployments, it helps to monitor the performance of VM and VM usage. It's also equally important to ensure the health of your virtual appliances are always in check and to immediately know when something goes wrong.

What you really don't want is to have alerts paging you 24/7, especially when they're not critical situations. Alert management can be a subtle, but dangerous activity. Additionally, manually setting alert thresholds can be an extremely time consuming task. Alternatively, using static thresholds that don't reflect real performance problems often result in alert storms, where administrators stop watching alerts carefully. This is where the "dangerous" part comes in and often true critical alerts can be lost in the noise and missed. As a result, intelligent, dynamic alerting can be critical for both staff efficiency and system reliability.

False Alerts: Reasons Why You Get Them and How to Avoid Them

Here are a few examples why your virtual environment may trigger alerts more frequently than normal:

■ Events that frequently occur, such as resource consumption can trigger alerts more often than most other virtual components.

■ You can get "spam" alerts from VMs or hosts that are no longer in use or that have been discharged.

■ Not properly tuning threshold levels can lead to a sudden spike in alerts.

Having intelligent alerting processes help ensure irrelevant alerts are not generated. This gives virtual admins time to look at "real" alerts and fix them. Here's what you can do to avoid alerting errors:

■ Set up alerts for specific VMs that you think are really going to impact your users or your business.

■ Leverage dynamic thresholds based on historical baseline trends whenever possible to set more realistic thresholds for your clusters, hosts, VMs, and datastore.

■ Establish well-defined threshold settings—this way you can optimize the kind of alerts you receive during the day and ensure that you're not bothered after work hours.

■ Set the right dependencies to significantly lower the amount of alerts you receive.

■ Forward specific alerts to the defined teams, since they understand the severity of the alert and can fix it right away.

Determine What to Monitor and Why

Most admins have to monitor hundreds of virtual appliances, which means you're probably dealing with plenty of alerts. Under these circumstances you'll have to determine a few things:

■ Go over each host to see if all VMs under the host must be monitored or if only a few critical VMs need to be monitored for alerts.

■ Talk to your business groups or users and understand what the impact will be. This will give you a sense of how many VMs and datastores have to be setup for alerts. They may have mission critical applications running inside them, which may affect business performance.

Statistical Thresholds: A Better Way to Set Baseline Values for your Virtual Environment

Normally, you would have to monitor the performance of hosts, VMs, and datastores for several weeks in order to know what the ideal or optimum baseline is to set warning and critical thresholds. However, integrated virtualization management tools can automatically calculate performance of clusters, hosts, VMs, and datastores and determine the baseline values.

IStatistical thresholds allow you to look at the following processes:

■ Applying thresholds to clusters, hosts, VMs, and datastores.

■ Understanding baseline statistics using standard deviation calculation for day and night system performance.

■ Gaining statistical insights into performance metrics and how they vary over time. Look at how stats are collected for higher and lower threshold values for individual VMs and hosts.

■ Calculating thresholds from historical performance data saves time in adjusting thresholds and provides more intelligent alerts.

■ Setting the right threshold values using the built-in baseline calculator. This calculates and applies the recommended threshold values for warning and critical stages for clusters, hosts, VMs, and datastores.

While this won't completely eliminate "spam" alerts, it will quickly let you get to a much smaller set for the administrator to deal with. In turn, it will let them spend more time and attention on striking that balance between monitoring your VM usage and hypervisor performance, and setting the right threshold values.

Karthik Ramachandran is Product Marketing Specialist at SolarWinds.

Share this

The Latest

July 25, 2024

The 2024 State of the Data Center Report from CoreSite shows that although C-suite confidence in the economy remains high, a VUCA (volatile, uncertain, complex, ambiguous) environment has many business leaders proceeding with caution when it comes to their IT and data ecosystems, with an emphasis on cost control and predictability, flexibility and risk management ...

July 24, 2024

In June, New Relic published the State of Observability for Energy and Utilities Report to share insights, analysis, and data on the impact of full-stack observability software in energy and utilities organizations' service capabilities. Here are eight key takeaways from the report ...

July 23, 2024

The rapid rise of generative AI (GenAI) has caught everyone's attention, leaving many to wonder if the technology's impact will live up to the immense hype. A recent survey by Alteryx provides valuable insights into the current state of GenAI adoption, revealing a shift from inflated expectations to tangible value realization across enterprises ... Here are five key takeaways that underscore GenAI's progression from hype to real-world impact ...

July 22, 2024
A defective software update caused what some experts are calling the largest IT outage in history on Friday, July 19. The impact reverberated through multiple industries around the world ...
July 18, 2024

As software development grows more intricate, the challenge for observability engineers tasked with ensuring optimal system performance becomes more daunting. Current methodologies are struggling to keep pace, with the annual Observability Pulse surveys indicating a rise in Mean Time to Remediation (MTTR). According to this survey, only a small fraction of organizations, around 10%, achieve full observability today. Generative AI, however, promises to significantly move the needle ...

July 17, 2024

While nearly all data leaders surveyed are building generative AI applications, most don't believe their data estate is actually prepared to support them, according to the State of Reliable AI report from Monte Carlo Data ...

July 16, 2024

Enterprises are putting a lot of effort into improving the digital employee experience (DEX), which has become essential to both improving organizational performance and attracting and retaining talented workers. But to date, most efforts to deliver outstanding DEX have focused on people working with laptops, PCs, or thin clients. Employees on the frontlines, using mobile devices to handle logistics ... have been largely overlooked ...

July 15, 2024

The average customer-facing incident takes nearly three hours to resolve (175 minutes) while the estimated cost of downtime is $4,537 per minute, meaning each incident can cost nearly $794,000, according to new research from PagerDuty ...

July 12, 2024

In MEAN TIME TO INSIGHT Episode 8, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses AutoCon with the conference founders Scott Robohn and Chris Grundemann ...

July 11, 2024

Numerous vendors and service providers have recently embraced the NaaS concept, yet there is still no industry consensus on its definition or the types of networks it involves. Furthermore, providers have varied in how they define the NaaS service delivery model. I conducted research for a new report, Network as a Service: Understanding the Cloud Consumption Model in Networking, to refine the concept of NaaS and reduce buyer confusion over what it is and how it can offer value ...