As we know, virtual environments consist of many moving pieces and are generally complex to setup. Typically, IT environments, depending on the size of the organization, can have several hundred VMs down to a handful of VMs. For such virtual infrastructure deployments, it helps to monitor the performance of VM and VM usage. It's also equally important to ensure the health of your virtual appliances are always in check and to immediately know when something goes wrong.
What you really don't want is to have alerts paging you 24/7, especially when they're not critical situations. Alert management can be a subtle, but dangerous activity. Additionally, manually setting alert thresholds can be an extremely time consuming task. Alternatively, using static thresholds that don't reflect real performance problems often result in alert storms, where administrators stop watching alerts carefully. This is where the "dangerous" part comes in and often true critical alerts can be lost in the noise and missed. As a result, intelligent, dynamic alerting can be critical for both staff efficiency and system reliability.
False Alerts: Reasons Why You Get Them and How to Avoid Them
Here are a few examples why your virtual environment may trigger alerts more frequently than normal:
■ Events that frequently occur, such as resource consumption can trigger alerts more often than most other virtual components.
■ You can get "spam" alerts from VMs or hosts that are no longer in use or that have been discharged.
■ Not properly tuning threshold levels can lead to a sudden spike in alerts.
Having intelligent alerting processes help ensure irrelevant alerts are not generated. This gives virtual admins time to look at "real" alerts and fix them. Here's what you can do to avoid alerting errors:
■ Set up alerts for specific VMs that you think are really going to impact your users or your business.
■ Leverage dynamic thresholds based on historical baseline trends whenever possible to set more realistic thresholds for your clusters, hosts, VMs, and datastore.
■ Establish well-defined threshold settings—this way you can optimize the kind of alerts you receive during the day and ensure that you're not bothered after work hours.
■ Set the right dependencies to significantly lower the amount of alerts you receive.
■ Forward specific alerts to the defined teams, since they understand the severity of the alert and can fix it right away.
Determine What to Monitor and Why
Most admins have to monitor hundreds of virtual appliances, which means you're probably dealing with plenty of alerts. Under these circumstances you'll have to determine a few things:
■ Go over each host to see if all VMs under the host must be monitored or if only a few critical VMs need to be monitored for alerts.
■ Talk to your business groups or users and understand what the impact will be. This will give you a sense of how many VMs and datastores have to be setup for alerts. They may have mission critical applications running inside them, which may affect business performance.
Statistical Thresholds: A Better Way to Set Baseline Values for your Virtual Environment
Normally, you would have to monitor the performance of hosts, VMs, and datastores for several weeks in order to know what the ideal or optimum baseline is to set warning and critical thresholds. However, integrated virtualization management tools can automatically calculate performance of clusters, hosts, VMs, and datastores and determine the baseline values.
IStatistical thresholds allow you to look at the following processes:
■ Applying thresholds to clusters, hosts, VMs, and datastores.
■ Understanding baseline statistics using standard deviation calculation for day and night system performance.
■ Gaining statistical insights into performance metrics and how they vary over time. Look at how stats are collected for higher and lower threshold values for individual VMs and hosts.
■ Calculating thresholds from historical performance data saves time in adjusting thresholds and provides more intelligent alerts.
■ Setting the right threshold values using the built-in baseline calculator. This calculates and applies the recommended threshold values for warning and critical stages for clusters, hosts, VMs, and datastores.
While this won't completely eliminate "spam" alerts, it will quickly let you get to a much smaller set for the administrator to deal with. In turn, it will let them spend more time and attention on striking that balance between monitoring your VM usage and hypervisor performance, and setting the right threshold values.
Karthik Ramachandran is Product Marketing Specialist at SolarWinds.
I've had the opportunity to work with a number of organizations embarking on their AIOps journey. I always advise them to start by evaluating their needs and the possibilities AIOps can bring to them through five different levels of AIOps maturity. This is a strategic approach that allows enterprises to achieve complete automation for long-term success ...
Sumo Logic recently commissioned an independent market research study to understand the industry momentum behind continuous intelligence — and the necessity for digital organizations to embrace a cloud-native, real-time continuous intelligence platform to support the speed and agility of business for faster decision-making, optimizing security, driving new innovation and delivering world-class customer experiences. Some of the key findings include ...
When it comes to viruses, it's typically those of the computer/digital variety that IT is concerned about. But with the ongoing pandemic, IT operations teams are on the hook to maintain business functions in the midst of rapid and massive change. One of the biggest challenges for businesses is the shift to remote work at scale. Ensuring that they can continue to provide products and services — and satisfy their customers — against this backdrop is challenging for many ...
Teams tasked with developing and delivering software are under pressure to balance the business imperative for speed with high customer expectations for quality. In the course of trying to achieve this balance, engineering organizations rely on a variety of tools, techniques and processes. The 2020 State of Software Quality report provides a snapshot of the key challenges organizations encounter when it comes to delivering quality software at speed, as well as how they are approaching these hurdles. This blog introduces its key findings ...
For IT teams, run-the-business, commodity areas such as employee help desks, device support and communication platforms are regularly placed in the crosshairs for cost takeout, but these areas are also highly visible to employees. Organizations can improve employee satisfaction and business performance by building unified functions that are measured by employee experience rather than price. This approach will ultimately fund transformation, as well as increase productivity and innovation ...
In the agile DevOps framework, there is a vital piece missing; something that previous approaches to application development did well, but has since fallen by the wayside. That is, the post-delivery portion of the toolchain. Without continuous cloud optimization, the CI/CD toolchain still produces massive inefficiencies and overspend ...
The COVID-19 pandemic has exponentially accelerated digital transformation projects. To better understand where IT professionals are turning for help, we analyzed the online behaviors of IT decision-makers. Our research found an increase in demand for resources related to APM, microservices and dependence on cloud services ...
The rush to the public cloud has now slowed as organizations realized that it is not a "one size fits all" solution. The main issue is the lack of deep visibility into the performance of applications provided by the host. Our own research has recently revealed that 32% of public cloud resources are currently under-utilized, and without proper direction and guidance, this will remain the case ...
The global shift to working from home (WFH) enforced by COVID-19 stay-at-home orders has had a massive impact on everyone's working lives, not just in the way they remotely interact with their teams and IT systems, but also in how they spend their working days. With both governments and businesses committed to slowly opening up offices, it's increasingly clear that a high prevalence of remote work will continue throughout 2020 and beyond. This situation begets important questions ...