As we know, virtual environments consist of many moving pieces and are generally complex to setup. Typically, IT environments, depending on the size of the organization, can have several hundred VMs down to a handful of VMs. For such virtual infrastructure deployments, it helps to monitor the performance of VM and VM usage. It's also equally important to ensure the health of your virtual appliances are always in check and to immediately know when something goes wrong.
What you really don't want is to have alerts paging you 24/7, especially when they're not critical situations. Alert management can be a subtle, but dangerous activity. Additionally, manually setting alert thresholds can be an extremely time consuming task. Alternatively, using static thresholds that don't reflect real performance problems often result in alert storms, where administrators stop watching alerts carefully. This is where the "dangerous" part comes in and often true critical alerts can be lost in the noise and missed. As a result, intelligent, dynamic alerting can be critical for both staff efficiency and system reliability.
False Alerts: Reasons Why You Get Them and How to Avoid Them
Here are a few examples why your virtual environment may trigger alerts more frequently than normal:
■ Events that frequently occur, such as resource consumption can trigger alerts more often than most other virtual components.
■ You can get "spam" alerts from VMs or hosts that are no longer in use or that have been discharged.
■ Not properly tuning threshold levels can lead to a sudden spike in alerts.
Having intelligent alerting processes help ensure irrelevant alerts are not generated. This gives virtual admins time to look at "real" alerts and fix them. Here's what you can do to avoid alerting errors:
■ Set up alerts for specific VMs that you think are really going to impact your users or your business.
■ Leverage dynamic thresholds based on historical baseline trends whenever possible to set more realistic thresholds for your clusters, hosts, VMs, and datastore.
■ Establish well-defined threshold settings—this way you can optimize the kind of alerts you receive during the day and ensure that you're not bothered after work hours.
■ Set the right dependencies to significantly lower the amount of alerts you receive.
■ Forward specific alerts to the defined teams, since they understand the severity of the alert and can fix it right away.
Determine What to Monitor and Why
Most admins have to monitor hundreds of virtual appliances, which means you're probably dealing with plenty of alerts. Under these circumstances you'll have to determine a few things:
■ Go over each host to see if all VMs under the host must be monitored or if only a few critical VMs need to be monitored for alerts.
■ Talk to your business groups or users and understand what the impact will be. This will give you a sense of how many VMs and datastores have to be setup for alerts. They may have mission critical applications running inside them, which may affect business performance.
Statistical Thresholds: A Better Way to Set Baseline Values for your Virtual Environment
Normally, you would have to monitor the performance of hosts, VMs, and datastores for several weeks in order to know what the ideal or optimum baseline is to set warning and critical thresholds. However, integrated virtualization management tools can automatically calculate performance of clusters, hosts, VMs, and datastores and determine the baseline values.
IStatistical thresholds allow you to look at the following processes:
■ Applying thresholds to clusters, hosts, VMs, and datastores.
■ Understanding baseline statistics using standard deviation calculation for day and night system performance.
■ Gaining statistical insights into performance metrics and how they vary over time. Look at how stats are collected for higher and lower threshold values for individual VMs and hosts.
■ Calculating thresholds from historical performance data saves time in adjusting thresholds and provides more intelligent alerts.
■ Setting the right threshold values using the built-in baseline calculator. This calculates and applies the recommended threshold values for warning and critical stages for clusters, hosts, VMs, and datastores.
While this won't completely eliminate "spam" alerts, it will quickly let you get to a much smaller set for the administrator to deal with. In turn, it will let them spend more time and attention on striking that balance between monitoring your VM usage and hypervisor performance, and setting the right threshold values.
Karthik Ramachandran is Product Marketing Specialist at SolarWinds.
Organizations face major infrastructure and security challenges in supporting multi-cloud and edge deployments, according to new global survey conducted by Propeller Insights for Volterra ...
Developers spend roughly 17.3 hours each week debugging, refactoring and modifying bad code — valuable time that could be spent writing more code, shipping better products and innovating. The bottom line? Nearly $300B (US) in lost developer productivity every year ...
While remote work policies have been gaining steam for the better part of the past decade across the enterprise space — driven in large part by more agile and scalable, cloud-delivered business solutions — recent events have pushed adoption into overdrive ...
Time-critical, unplanned work caused by IT disruptions continues to plague enterprises around the world, leading to lost revenue, significant employee morale problems and missed opportunities to innovate, according to the State of Unplanned Work Report 2020, conducted by Dimensional Research for PagerDuty ...
In today's iterative world, development teams care a lot more about how apps are running. There's a demand for fixing actionable items. Developers want to know exactly what's broken, what to fix right now, and what can wait. They want to know, "Do we build or fix?" This trade-off between building new features versus fixing bugs is one of the key factors behind the adoption of Application Stability management tools ...
With the rise of mobile apps and iterative development releases, Application Stability has answered the widespread need to monitor applications in a new way, shifting the focus from servers and networks to the customer experience. The emergence of Application Stability has caused some consternation for diehard APM fans. However, these two solutions embody very distinct monitoring focuses, which leads me to believe there's room for both tools, as well as different teams for both ...
The 2019 State of E-Commerce Infrastructure Report, from Webscale, analyzes findings from a comprehensive survey of more than 450 ecommerce professionals regarding how their online stores performed during the 2019 holiday season. Some key insights from the report include ...
Robinhood is a unicorn startup that has been disrupting the way by which many millennials have been investing and managing their money for the past few years. For Robinhood, the burden of proof was to show that they can provide an infrastructure that is as scalable, reliable and secure as that of major banks who have been developing their trading infrastructure for the last quarter-century. That promise fell flat last week, when the market volatility brought about a set of edge cases that brought Robinhood's trading app to its knees ...
Application backend monitoring is the key to acquiring visibility across the enterprise's application stack, from the application layer and underlying infrastructure to third-party API services, web servers and databases, be they on-premises, in a public or private cloud, or in a hybrid model. By tracking and reporting performance in real time, IT teams can ensure applications perform at peak efficiency — and guarantee a seamless customer experience. How can IT operations teams improve application backend monitoring? By embracing artificial intelligence for operations — AIOps ...
In 2020, DevOps teams will face heightened expectations for higher speed and frequency of code delivery, which means their IT environments will become even more modular, ephemeral and dynamic — and significantly more complicated to monitor. As a result, AIOps will further cement its position as the most effective technology that DevOps teams can use to see and control what's going on with their applications and their underlying infrastructure, so that they can prevent outages. Here I outline five key trends to watch related to how AIOps will impact DevOps in 2020 and beyond ...