It seems that everyone in IT has caught “Cloud-fever,” as Enterprises and Service Providers alike race to revamp their architectures and offerings to take advantage of this great IT inflection point. However, lost within the technology is the reality that someone is responsible for keeping the Cloud up and running. That someone is usually Operations personnel along with their fellow Systems, Network, Storage, and Security Engineers. The lifeline of these dedicated individuals is a unified monitoring and eventing system with a goal of providing relevant, functional, and timely alerts.
To accomplish this goal, IT Operations must have the ability to effectively monitor the entire datacenter, and to provide high-quality data to the eventing system. As the saying goes, “garbage-in, garbage-out,” and no degree of filtering or pre-processing will alleviate this problem. In the end, the monitoring data that is collected is turned into events that are processed by the eventing system independent of the alerting mechanisms. This allows common techniques such as correlation, filtering, and suppression to take place prior to an alert being issued.
Herein lies the first challenge. How do you take an event storm with tens, hundreds, or even thousands of events and turn it into a single relevant event and subsequent alert? Rules-based correlation engines of the past cannot keep pace with the high rate of change within the dynamic datacenter. Instead, a new approach is needed that views the infrastructure as services instead of individually monitored components, and provides a service assurance layer to IT Operations and other business stakeholders. Assuming that the first challenge is overcome, it is time to design an alerting solution.
Careful consideration must be made to the purpose of the alert being processed. For example, is it an informational alert to the customer regarding a service issue, or is it an operational alert to a system administrator to fix an issue? Are any automated actions being used such as restarting a Windows service or Linux process? Is there integration to a service desk such as ServiceNow? Is the alert a high priority issue for revenue generation such as a customer issue or an internal issue?
Herein lies the second challenge -- alert floods. Alert floods fill your pager/email/phone with alerts that have either already been acknowledged or are irrelevant. Perhaps there is nothing more frustrating than getting an alert from a device that you are in the process of working on or have placed into maintenance. Many Operations personnel have a special folder or rule to take care of this, but this may actually cause them to miss relevant alerts. Operations personnel must trust that the alerts they receive are valid and require their immediate attention.
To accomplish this, only an intelligent solution that provides granular control over the alerts will eliminate this issue. Unlike the event storms discussed earlier, alerting lends itself to granular filtering, time-based policies, and escalation rules. The key is to have an eventing system that provides well-formed events that can be filtered against via a set of flexible and powerful rules. For example, an alert is only sent out if the automated action failed and the event has not been acknowledged for ten minutes. If the subsequent alert is not cleared within another ten minutes, the alert is resent only this time it goes to operations management. Finally, alerts should have the ability to be subscribed to and shared among your IT staff.
Alerting for the Cloud and dynamic datacenter requires IT organizations to re-examine how they deliver, monitor, and alert on vital services. IT Operations has minutes to respond to issues that could take down tens, hundreds, or thousands of virtual servers, impacting the business in ways we have never seen before. Accepting a console full of “Red” or a pager/phone/email full of useless alerts is a recipe for disaster. However, with proper planning and re-evaluation of your current People, Process, and Solutions, IT Operations will be able to meet demands and keep the Cloud running.
About Floyd Strimling
Floyd Strimling is a Technology Evangelist at Zenoss, who enjoys creating, debating, and following technology trends with the goal of making them a reality. Strimling’s unique background spans both hardware and software environments with experience in Cloud Computing/Autonomic Computing, Datacenter Automation, Virtualization, Networking and Security.
Incident management processes are not keeping pace with the demands of modern operations teams, failing to meet the needs of SREs as well as platform and ops teams. Results from the State of DevOps Automation and AI Survey, commissioned by Transposit, point to an incident management paradox. Despite nearly 60% of ITOps and DevOps professionals reporting they have a defined incident management process that's fully documented in one place and over 70% saying they have a level of automation that meets their needs, teams are unable to quickly resolve incidents ...
Today, in the world of enterprise technology, the challenges posed by legacy Virtual Desktop Infrastructure (VDI) systems have long been a source of concern for IT departments. In many instances, this promising solution has become an organizational burden, hindering progress, depleting resources, and taking a psychological and operational toll on employees ...
Within retail organizations across the world, IT teams will be bracing themselves for a hectic holiday season ... While this is an exciting opportunity for retailers to boost sales, it also intensifies severe risk. Any application performance slipup will cause consumers to turn their back on brands, possibly forever. Online shoppers will be completely unforgiving to any retailer who doesn't deliver a seamless digital experience ...
Black Friday is a time when consumers can cash in on some of the biggest deals retailers offer all year long ... Nearly two-thirds of consumers utilize a retailer's web and mobile app for holiday shopping, raising the stakes for competitors to provide the best online experience to retain customer loyalty. Perforce's 2023 Black Friday survey sheds light on consumers' expectations this time of year and how developers can properly prepare their applications for increased online traffic ...