Weathering Event Storms and Alert Floods
Actionable Alerting for the Cloud and Dynamic Datacenter
August 23, 2011
Floyd Strimling
Share this

It seems that everyone in IT has caught “Cloud-fever,” as Enterprises and Service Providers alike race to revamp their architectures and offerings to take advantage of this great IT inflection point. However, lost within the technology is the reality that someone is responsible for keeping the Cloud up and running. That someone is usually Operations personnel along with their fellow Systems, Network, Storage, and Security Engineers. The lifeline of these dedicated individuals is a unified monitoring and eventing system with a goal of providing relevant, functional, and timely alerts.

To accomplish this goal, IT Operations must have the ability to effectively monitor the entire datacenter, and to provide high-quality data to the eventing system. As the saying goes, “garbage-in, garbage-out,” and no degree of filtering or pre-processing will alleviate this problem. In the end, the monitoring data that is collected is turned into events that are processed by the eventing system independent of the alerting mechanisms. This allows common techniques such as correlation, filtering, and suppression to take place prior to an alert being issued.

Herein lies the first challenge. How do you take an event storm with tens, hundreds, or even thousands of events and turn it into a single relevant event and subsequent alert? Rules-based correlation engines of the past cannot keep pace with the high rate of change within the dynamic datacenter. Instead, a new approach is needed that views the infrastructure as services instead of individually monitored components, and provides a service assurance layer to IT Operations and other business stakeholders. Assuming that the first challenge is overcome, it is time to design an alerting solution.

Careful consideration must be made to the purpose of the alert being processed. For example, is it an informational alert to the customer regarding a service issue, or is it an operational alert to a system administrator to fix an issue? Are any automated actions being used such as restarting a Windows service or Linux process? Is there integration to a service desk such as ServiceNow? Is the alert a high priority issue for revenue generation such as a customer issue or an internal issue?

Herein lies the second challenge -- alert floods. Alert floods fill your pager/email/phone with alerts that have either already been acknowledged or are irrelevant. Perhaps there is nothing more frustrating than getting an alert from a device that you are in the process of working on or have placed into maintenance. Many Operations personnel have a special folder or rule to take care of this, but this may actually cause them to miss relevant alerts. Operations personnel must trust that the alerts they receive are valid and require their immediate attention.

To accomplish this, only an intelligent solution that provides granular control over the alerts will eliminate this issue. Unlike the event storms discussed earlier, alerting lends itself to granular filtering, time-based policies, and escalation rules. The key is to have an eventing system that provides well-formed events that can be filtered against via a set of flexible and powerful rules. For example, an alert is only sent out if the automated action failed and the event has not been acknowledged for ten minutes. If the subsequent alert is not cleared within another ten minutes, the alert is resent only this time it goes to operations management. Finally, alerts should have the ability to be subscribed to and shared among your IT staff.

Alerting for the Cloud and dynamic datacenter requires IT organizations to re-examine how they deliver, monitor, and alert on vital services. IT Operations has minutes to respond to issues that could take down tens, hundreds, or thousands of virtual servers, impacting the business in ways we have never seen before. Accepting a console full of “Red” or a pager/phone/email full of useless alerts is a recipe for disaster. However, with proper planning and re-evaluation of your current People, Process, and Solutions, IT Operations will be able to meet demands and keep the Cloud running.

About Floyd Strimling

Floyd Strimling is a Technology Evangelist at Zenoss, who enjoys creating, debating, and following technology trends with the goal of making them a reality. Strimling’s unique background spans both hardware and software environments with experience in Cloud Computing/Autonomic Computing, Datacenter Automation, Virtualization, Networking and Security.

Related Links:

Zenoss Service Dynamics Now Supports IPv6

Share this

The Latest

November 29, 2023
The past few years have presented numerous challenges for businesses: a pandemic, rising interest rates, supply chain disruptions, and geopolitical conflict that sent shockwaves across the global economy. But change may finally be on the horizon. According to a recent report by Endava ... a majority of executives confirmed they are feeling optimistic about the current business climate, and as a result, are forecasting larger IT budgets, increased technology funding and rollout, and prioritized innovation in the coming year ...
November 28, 2023

Incident management processes are not keeping pace with the demands of modern operations teams, failing to meet the needs of SREs as well as platform and ops teams. Results from the State of DevOps Automation and AI Survey, commissioned by Transposit, point to an incident management paradox. Despite nearly 60% of ITOps and DevOps professionals reporting they have a defined incident management process that's fully documented in one place and over 70% saying they have a level of automation that meets their needs, teams are unable to quickly resolve incidents ...

November 27, 2023

Today, in the world of enterprise technology, the challenges posed by legacy Virtual Desktop Infrastructure (VDI) systems have long been a source of concern for IT departments. In many instances, this promising solution has become an organizational burden, hindering progress, depleting resources, and taking a psychological and operational toll on employees ...

November 22, 2023

Within retail organizations across the world, IT teams will be bracing themselves for a hectic holiday season ... While this is an exciting opportunity for retailers to boost sales, it also intensifies severe risk. Any application performance slipup will cause consumers to turn their back on brands, possibly forever. Online shoppers will be completely unforgiving to any retailer who doesn't deliver a seamless digital experience ...

November 21, 2023

Black Friday is a time when consumers can cash in on some of the biggest deals retailers offer all year long ... Nearly two-thirds of consumers utilize a retailer's web and mobile app for holiday shopping, raising the stakes for competitors to provide the best online experience to retain customer loyalty. Perforce's 2023 Black Friday survey sheds light on consumers' expectations this time of year and how developers can properly prepare their applications for increased online traffic ...