Can event management help foster a curiosity for innovative possibilities to make application performance better? Blue-sky thinkers may not want to deal with the myriad of details on how to manage the events being generated operationally, but could learn something from this exercise.
Consider the major system failures in your organization over the last 12 to 18 months. What if you had a system or process in place to capture those failures and mitigate them from a proactive standpoint preventing them from reoccurring? How much better off would you be if you could avoid the proverbial “Groundhog Day” with system outages? The argument that system monitoring is just a nice to have, and not really a core requirement for operational readiness, dissipates quickly when a critical application goes down with no warning.
Starting with the Event management and Incident management processes may seem like a reactive approach when implementing an Application Performance Management (APM) solution, but is it really? If “Rome is burning”, wouldn’t the most prudent action be to extinguish the fire, then come up with a proactive approach for prevention? Managing the operational noise can calm the environment allowing you to focus on APM strategy more effectively.
Asking the right questions during a post-mortem review will help generate dialog, outlining options for alerting and prevention. This will direct your thinking towards a new horizon of continual improvement that will help galvanize proactive monitoring as an operational requirement.
Here are three questions that build on each other as you work to mature your solution:
1. Did we alert on it when it went down, or did the user community call us?
2. Can we get a proactive alert on it before it goes down, (e.g. dual power supply failure in server)?
3. Can we trend on the event creating a predictive alert before it is escalated, (e.g. disk space utilization to trigger a minor@90%, major@95%, critical@98%)?
The preceding questions are directly related to the following categories respectively: Reactive, Proactive, and Predictive.
Reactive – Alerts that Occur at Failure
Multiple events can occur before a system failure; eventually an alert will come in notifying you that an application is down. This will come from either the users calling the Service Desk to report an issue or it will be system generated corresponding with an application failure.
Proactive – Alerts that Occur Before Failure
These alerts will most likely come from proactive monitoring to tell you there are component failures that need attention but have not yet affected overall application availability, (e.g. dual power supply failure in server).
Predictive – Alerts that Trend on a Possible Failure
These alerts are usually set up in parallel with trending reports that will help predict subtle changes in the environment, (e.g. trending on memory usage or disk utilization before running out of resources).
Once you build awareness in the organization that you have a bird’s eye view of the technical landscape and have the ability to monitor the ecosystem of each application (as an ecologist), people become more meticulous when introducing new elements into the environment. They know that you are watching, taking samples, and trending on the overall health and stability leaving you free to focus on the strategic side of APM without distraction.
ABOUT Larry Dragich
Larry Dragich, a regular blogger and contributor on APMdigest, has 23 years of IT experience, and has been in an IT leadership role at the Auto Club Group (ACG) for the past ten years. He serves as Director of Enterprise Application Services (EAS) at the Auto Club Group with overall accountability to optimize the capability of the IT infrastructure to deliver high availability and optimal performance. Dragich is actively involved with industry leaders sharing knowledge of APM technologies from best practices, technical workflows, to resource allocation and approaches for implementation of APM Strategies.
You can contact Larry on LinkedIn
For a high-level view of a much broader technology space refer to the slide show on BrightTALK.com which describes the “The Anatomy of APM - webcast” in more context.
For more information on the critical success factors in APM adoption and how this centers around the End-User-Experience (EUE), read The Anatomy of APM and the corresponding blog APM’s DNA – Event to Incident Flow.
The enterprise WAN is unable to keep up with digital transformation demands, according to Foundation for Digital Transformation, a new research report, authored by Ensemble IQ and supported by InfoVista. This challenge was universal across all three vertical industries surveyed — retail, manufacturing, and banking/financial services ...
Achieving optimum Java Virtual Machine (JVM) performance is key to ensuring proper memory management and fast application processing. According to a Cornell University study, a 1-millisecond improvement in the performance of a trading application can be worth $100 million a year to a major brokerage firm. Because of this potential for loss, IT teams owning banking, financial, trading and other Java-based applications place a high value on having a proper JVM monitoring strategy in place ...
APM had to evolve to keep pace with development velocity and maintain the service quality for the modern applications born out of digital transformation. Automation and artificial intelligence (AI) technologies are critical to the next step in APM evolution, helping to address speed, scalability and intelligence demands ...
A worldwide survey by Gartner, Inc. showed that 91 percent of organizations have not yet reached a "transformational" level of maturity in data and analytics, despite this area being a number one investment priority for CIOs in recent years ...
Mobile app performance is still a significant issue. In a new report from PacketZoom, The Effect of Mobile Network Performance on Mobile App Users, 66% of consumers said reliable mobile app performance is "very important" — second only to mobile app security ...
IT departments that shift from reactionary fire fighters to becoming proactive business partners find their ticket counts reduced from 20 to 50 percent or more. The strategies outlined in Part 1 of this blog may all sound like a great way to turn IT into a strategic, proactive business-enabler, but how can companies turn strategy into reality? The following are three best practices ...
"We can't fix it if they don't call." I can't count how many times I've said those words in my IT career. We need users to call in their issues, while conversely we need our ticket volumes to decrease. How can IT lower the amount of call center tickets, quickly resolve those incidents that can't be avoided, and reduce their own costs in the process? Here are three key strategies ...
Today's network engineers have their work cut out for them. Bigger, more complex networks have created an environment where network engineers are forced to adapt and develop more effective ways to manage and troubleshoot their networks. This begins with better visibility, which has presented an issue traditionally as engineers struggle to create an accurate picture due to challenges with static maps ...
My last blog covered technology-oriented best practices that application management and IT help desks can use to optimize the performance of their applications and the IT teams that oversee them. Now I'll explore what IT professionals can do to optimize their team's time and resources — the people and processes — in pursuit of that same goal ...