Weathering Event Storms and Alert Floods
Actionable Alerting for the Cloud and Dynamic Datacenter
August 23, 2011
Floyd Strimling
Share this

It seems that everyone in IT has caught “Cloud-fever,” as Enterprises and Service Providers alike race to revamp their architectures and offerings to take advantage of this great IT inflection point. However, lost within the technology is the reality that someone is responsible for keeping the Cloud up and running. That someone is usually Operations personnel along with their fellow Systems, Network, Storage, and Security Engineers. The lifeline of these dedicated individuals is a unified monitoring and eventing system with a goal of providing relevant, functional, and timely alerts.

To accomplish this goal, IT Operations must have the ability to effectively monitor the entire datacenter, and to provide high-quality data to the eventing system. As the saying goes, “garbage-in, garbage-out,” and no degree of filtering or pre-processing will alleviate this problem. In the end, the monitoring data that is collected is turned into events that are processed by the eventing system independent of the alerting mechanisms. This allows common techniques such as correlation, filtering, and suppression to take place prior to an alert being issued.

Herein lies the first challenge. How do you take an event storm with tens, hundreds, or even thousands of events and turn it into a single relevant event and subsequent alert? Rules-based correlation engines of the past cannot keep pace with the high rate of change within the dynamic datacenter. Instead, a new approach is needed that views the infrastructure as services instead of individually monitored components, and provides a service assurance layer to IT Operations and other business stakeholders. Assuming that the first challenge is overcome, it is time to design an alerting solution.

Careful consideration must be made to the purpose of the alert being processed. For example, is it an informational alert to the customer regarding a service issue, or is it an operational alert to a system administrator to fix an issue? Are any automated actions being used such as restarting a Windows service or Linux process? Is there integration to a service desk such as ServiceNow? Is the alert a high priority issue for revenue generation such as a customer issue or an internal issue?

Herein lies the second challenge -- alert floods. Alert floods fill your pager/email/phone with alerts that have either already been acknowledged or are irrelevant. Perhaps there is nothing more frustrating than getting an alert from a device that you are in the process of working on or have placed into maintenance. Many Operations personnel have a special folder or rule to take care of this, but this may actually cause them to miss relevant alerts. Operations personnel must trust that the alerts they receive are valid and require their immediate attention.

To accomplish this, only an intelligent solution that provides granular control over the alerts will eliminate this issue. Unlike the event storms discussed earlier, alerting lends itself to granular filtering, time-based policies, and escalation rules. The key is to have an eventing system that provides well-formed events that can be filtered against via a set of flexible and powerful rules. For example, an alert is only sent out if the automated action failed and the event has not been acknowledged for ten minutes. If the subsequent alert is not cleared within another ten minutes, the alert is resent only this time it goes to operations management. Finally, alerts should have the ability to be subscribed to and shared among your IT staff.

Alerting for the Cloud and dynamic datacenter requires IT organizations to re-examine how they deliver, monitor, and alert on vital services. IT Operations has minutes to respond to issues that could take down tens, hundreds, or thousands of virtual servers, impacting the business in ways we have never seen before. Accepting a console full of “Red” or a pager/phone/email full of useless alerts is a recipe for disaster. However, with proper planning and re-evaluation of your current People, Process, and Solutions, IT Operations will be able to meet demands and keep the Cloud running.

About Floyd Strimling

Floyd Strimling is a Technology Evangelist at Zenoss, who enjoys creating, debating, and following technology trends with the goal of making them a reality. Strimling’s unique background spans both hardware and software environments with experience in Cloud Computing/Autonomic Computing, Datacenter Automation, Virtualization, Networking and Security.

Related Links:

Zenoss Service Dynamics Now Supports IPv6

Share this

The Latest

September 29, 2020

More than 80% of organizations have experienced a significant increase in pressure on digital services since the start of the COVID-19 pandemic, according to a new study conducted by PagerDuty ...

September 28, 2020

In Episode 9, Sean McDermott, President, CEO and Founder of Windward Consulting Group, joins the AI+ITOPS Podcast to discuss how the pandemic has impacted IT and is driving the need for AIOps ...

September 25, 2020

Michael Olson on the AI+ITOPS Podcast: "I really see AIOps as being a core requirement for observability because it ... applies intelligence to your telemetry data and your incident data ... to potentially predict problems before they happen."

September 24, 2020

Enterprise ITOM and ITSM teams have been welcoming of AIOps, believing that it has the potential to deliver great value to them as their IT environments become more distributed, hybrid and complex. Not so with DevOps teams. It's safe to say they've kept AIOps at arm's length, because they don't think it's relevant nor useful for what they do. Instead, to manage the software code they develop and deploy, they've focused on observability ...

September 23, 2020

The post-pandemic environment has resulted in a major shift on where SREs will be located, with nearly 50% of SREs believing they will be working remotely post COVID-19, as compared to only 19% prior to the pandemic, according to the 2020 SRE Survey Report from Catchpoint and the DevOps Institute ...