Weathering Event Storms and Alert Floods
Actionable Alerting for the Cloud and Dynamic Datacenter
August 23, 2011
Floyd Strimling
Share this

It seems that everyone in IT has caught “Cloud-fever,” as Enterprises and Service Providers alike race to revamp their architectures and offerings to take advantage of this great IT inflection point. However, lost within the technology is the reality that someone is responsible for keeping the Cloud up and running. That someone is usually Operations personnel along with their fellow Systems, Network, Storage, and Security Engineers. The lifeline of these dedicated individuals is a unified monitoring and eventing system with a goal of providing relevant, functional, and timely alerts.

To accomplish this goal, IT Operations must have the ability to effectively monitor the entire datacenter, and to provide high-quality data to the eventing system. As the saying goes, “garbage-in, garbage-out,” and no degree of filtering or pre-processing will alleviate this problem. In the end, the monitoring data that is collected is turned into events that are processed by the eventing system independent of the alerting mechanisms. This allows common techniques such as correlation, filtering, and suppression to take place prior to an alert being issued.

Herein lies the first challenge. How do you take an event storm with tens, hundreds, or even thousands of events and turn it into a single relevant event and subsequent alert? Rules-based correlation engines of the past cannot keep pace with the high rate of change within the dynamic datacenter. Instead, a new approach is needed that views the infrastructure as services instead of individually monitored components, and provides a service assurance layer to IT Operations and other business stakeholders. Assuming that the first challenge is overcome, it is time to design an alerting solution.

Careful consideration must be made to the purpose of the alert being processed. For example, is it an informational alert to the customer regarding a service issue, or is it an operational alert to a system administrator to fix an issue? Are any automated actions being used such as restarting a Windows service or Linux process? Is there integration to a service desk such as ServiceNow? Is the alert a high priority issue for revenue generation such as a customer issue or an internal issue?

Herein lies the second challenge -- alert floods. Alert floods fill your pager/email/phone with alerts that have either already been acknowledged or are irrelevant. Perhaps there is nothing more frustrating than getting an alert from a device that you are in the process of working on or have placed into maintenance. Many Operations personnel have a special folder or rule to take care of this, but this may actually cause them to miss relevant alerts. Operations personnel must trust that the alerts they receive are valid and require their immediate attention.

To accomplish this, only an intelligent solution that provides granular control over the alerts will eliminate this issue. Unlike the event storms discussed earlier, alerting lends itself to granular filtering, time-based policies, and escalation rules. The key is to have an eventing system that provides well-formed events that can be filtered against via a set of flexible and powerful rules. For example, an alert is only sent out if the automated action failed and the event has not been acknowledged for ten minutes. If the subsequent alert is not cleared within another ten minutes, the alert is resent only this time it goes to operations management. Finally, alerts should have the ability to be subscribed to and shared among your IT staff.

Alerting for the Cloud and dynamic datacenter requires IT organizations to re-examine how they deliver, monitor, and alert on vital services. IT Operations has minutes to respond to issues that could take down tens, hundreds, or thousands of virtual servers, impacting the business in ways we have never seen before. Accepting a console full of “Red” or a pager/phone/email full of useless alerts is a recipe for disaster. However, with proper planning and re-evaluation of your current People, Process, and Solutions, IT Operations will be able to meet demands and keep the Cloud running.

About Floyd Strimling

Floyd Strimling is a Technology Evangelist at Zenoss, who enjoys creating, debating, and following technology trends with the goal of making them a reality. Strimling’s unique background spans both hardware and software environments with experience in Cloud Computing/Autonomic Computing, Datacenter Automation, Virtualization, Networking and Security.

Related Links:

Zenoss Service Dynamics Now Supports IPv6

Share this

The Latest

October 21, 2021

Scaling DevOps and SRE practices is critical to accelerating the release of high-quality digital services. However, siloed teams, manual approaches, and increasingly complex tooling slow innovation and make teams more reactive than proactive, impeding their ability to drive value for the business, according to a new report from Dynatrace, Deep Cloud Observability and Advanced AIOps are Key to Scaling DevOps Practices ...

October 20, 2021

Over three quarters (79%) of database professionals are now using either a paid-for or in-house monitoring tool, according to a new survey from Redgate Software ...

October 19, 2021

Gartner announced the top strategic technology trends that organizations need to explore in 2022. With CEOs and Boards striving to find growth through direct digital connections with customers, CIOs' priorities must reflect the same business imperatives, which run through each of Gartner's top strategic tech trends for 2022 ...

October 18, 2021

Distributed tracing has been growing in popularity as a primary tool for investigating performance issues in microservices systems. Our recent DevOps Pulse survey shows a 38% increase year-over-year in organizations' tracing use. Furthermore, 64% of those respondents who are not yet using tracing indicated plans to adopt it in the next two years ...

October 14, 2021

Businesses are embracing artificial intelligence (AI) technologies to improve network performance and security, according to a new State of AIOps Study, conducted by ZK Research and Masergy ...