Root-Cause Analysis of Application Performance Problems
November 12, 2013

Charley Rich
Nastel Technologies

Share this

I first came upon the term root-cause analysis (RCA) while working at a network management startup. The concept was to determine why a problem occurred so that repair could happen sooner and service restored. To do this required a discovery of the topology of a network and its devices in order to understand where a problem could occur and the relationship between the various parts. Monitoring was necessary in order to identify that a failure occurred and provide notification.

However, the challenge in doing this was that many failure events are received in seemingly random order; thus, it is very difficult to differentiate which events signified symptoms of the problem and which event represented the actual cause. To resolve this, some solutions constructed elaborate causality chains in the hope you could follow them backwards in time to the "root-cause". This is akin to following smoke and having it lead you to the fire. Well it does work, if you do it fast enough and before the whole forest is in flames.

The obvious next thing to do was apply this to applications. It certainly seemed like a good idea at the time ... but it turned out to be much harder than expected. Why harder? Applications are far more complex than networks with many more variations in behavior and relationship. So, instead monitoring systems were applied to the various silos of application architecture such as web servers, application servers, middleware, databases and others.

For many years the focus of APM was on making the application server run better. And from that perspective, it was successful. However, while the application server became more reliable and ran faster, the two key features IT Operations management desire: getting alerted to problems before the end user is affected and being pointed in the right direction have not improved much.

Part of the difficulty in this sort of multiplicity of monitoring tools world is that there so many sources of events and so many moving parts. Is the cause capacity, a stuck message, configuration issues or even worse a misunderstanding of business requirements? Perhaps, the application is running just fine with all indicators green, but the results aren't what the business expected. Or it works fine for users in one group, but not for another. These are very difficult problems to unravel.

An approach Forrester Research suggests is to bring the events from the various sources to a single pain of glass and perform a root-cause analysis. The suggestion is made to use a technology called Complex Event Processing (CEP) to search in real-time for patterns based on events from multiple sources that together describe a problem.

CEP is very good at identifying situations spanning multiple event streams, correlating the individual events together into the "big picture", the situation. Analogous to this is the concept in QA of test cases. Think of situations as the test cases that occur spontaneously in production. APM is not for the faint of heart.

CEP can tie the seemingly unrelated events together into a picture that tells a story, what happened and what triggered it. CEP, using rules is of course dependent on the quality and completeness of those rules. But, that is something that grows ever better over time. A new situation can be described and prevented from ever causing harm again. Without the relationship between the events from the various sources, that would not be possible. We would just be fixing the web server or the database or the application server. With this approach, we are fixing the problem.

CEP represents an actionable form of analytics. You can add CEP analytics to your APM including your currently deployed monitoring solutions as it is inherently a multi-source approach. Utilizing this and delivering root-cause analysis can improve your incident management process. It can help you achieve the IT Ops goals of: getting alerted to problems before the end user is affected and being pointed in the right direction.

Charley Rich is VP Product Management and Marketing at Nastel Technologies.

Charley Rich is VP Product Management and Marketing at Nastel Technologies and has over 28 years of technical, hands-on experience working with large-scale customers to meet their application and systems management requirements. Prior to joining Nastel, Charley was Product Manager for IBM's Tivoli Application Dependency Discovery Manager software, where he co-authored an IBM Redbook, charted the product roadmap, managed an agile requirements process and was recognized for his accomplishments by winning the Tivoli General Manager's Award. Recently, Charley was granted a patent for an Application Discovery and Monitoring process.
Share this

The Latest

July 22, 2019

Many organizations are unsure where to begin with AIOps, but should seriously consider adopting an AIOps strategy and solution. To get started, it's important to identify the key capabilities of AIOps that are needed to realize maximum value from your investments ...

July 18, 2019

Organizations that are working with artificial intelligence (AI) or machine learning (ML) have, on average, four AI/ML projects in place, according to a recent survey by Gartner, Inc. Of all respondents, 59% said they have AI deployed today ...

July 17, 2019

The 11th anniversary of the Apple App Store frames a momentous time period in how we interact with each other and the services upon which we have come to rely. Even so, we continue to have our in-app mobile experiences marred by poor performance and instability. Apple has done little to help, and other tools provide little to no visibility and benchmarks on which to prioritize our efforts outside of crashes ...

July 16, 2019

Confidence in artificial intelligence (AI) and its ability to enhance network operations is high, but only if the issue of bias is tackled. Service providers (68%) are most concerned about the bias impact of "bad or incomplete data sets," since effective AI requires clean, high quality, unbiased data, according to a new survey of communication service providers ...

July 15, 2019

Every internet connected network needs a visibility platform for traffic monitoring, information security and infrastructure security. To accomplish this, most enterprise networks utilize from four to seven specialized tools on network links in order to monitor, capture and analyze traffic. Connecting tools to live links with TAPs allow network managers to safely see, analyze and protect traffic without compromising network reliability. However, like most networking equipment it's critical that installation and configuration are done properly ...

July 11, 2019

The Democratic presidential debates are likely to have many people switching back-and-forth between live streams over the coming months. This is going to be especially true in the days before and after each debate, which will mean many office networks are likely to see a greater share of their total capacity going to streaming news services than ever before ...

July 10, 2019

Monitoring of heating, ventilation and air conditioning (HVAC) infrastructures has become a key concern over the last several years. Modern versions of these systems need continual monitoring to stay energy efficient and deliver satisfactory comfort to building occupants. This is because there are a large number of environmental sensors and motorized control systems within HVAC systems. Proper monitoring helps maintain a consistent temperature to reduce energy and maintenance costs for this type of infrastructure ...

July 09, 2019

Shoppers won’t wait for retailers, according to a new research report titled, 2019 Retailer Website Performance Evaluation: Are Retail Websites Meeting Shopper Expectations? from Yottaa ...

June 27, 2019

Customer satisfaction and retention were the top concerns for a majority (58%) of IT leaders when suffering downtime or outages, according to a survey of top IT leaders conducted by AIOps Exchange. The effect of service interruptions on customers outweighed other concerns such as loss of revenue, brand reputation, negative press coverage, or the impact on IT Ops teams.

June 26, 2019

It is inevitable that employee productivity and the quality of customer experiences suffer as a consequence of the poor performance of O365. The quick detection and rapid resolution of problems associated with O365 are top of mind for any organization to keep its business humming ...