The 4 Building Blocks of Root Cause Analysis
May 06, 2014
David Hayward
Share this

With every minute you can shave off root cause analysis, you get a minute closer to restoring the performance or availability of a process that's important to your business. But the plethora of monitoring tools used throughout your organization, each with its own root cause perspective about the IT environment, can lead to confusion, dysfunction and drawn-out debate when things go wrong. To get the most business value from these diverse views, you need to understand how they can work together.

Think of root cause analysis as a software stack, and the higher the layer is in the stack, the more meaningful it is from a business perspective. For example, in the Open Systems Interconnect (OSI) stack, understanding layer 1, the physical layer, is vital, but layer 7, the application, is more meaningful to the business.

Each layer in the root cause analysis stack is provided by unique monitoring functions, analytics and visualization. Here they are, top down:

- Business Service Root Cause Analysis

- Application-Driven Root Cause Analysis

- Network Fault Root Cause Analysis

- Device Root Cause Analysis

Think of adding each layer in terms of a geometrical analogy of human awareness cleverly explained by the Russian philosopher P.D. Ouspenski in his book Tertium Organum. As he explained, if you were one-dimensional, a point, you couldn't think of a line. If you were a line, you couldn't perceive two-dimensions: a square. If you were a square you couldn't understand a cube. If you were a cube, couldn't understand motion.

Let's see how each layer has legitimate root cause analysis and how each successive layer up the stack adds awareness and greater business value.

1. Device Root Cause Analysis

The device layer is the foundation, letting you know if a server, storage device or switch, router, load balancer, etc. simply is up or down, fast or slow. If it's pingable, you know it has a power source, and diagnostics can tell you which subcomponent has the fault causing the outage. For root cause of performance issues, you'll be relying on your monitoring tools' visual correlation of time series data and threshold alerts to see if the CPU, memory, disk, ports etc. are degraded and why.

But if servers or network devices aren't reachable, how do you know for sure if they are down or if there's an upstream network root cause? To see this, you need to add a higher layer of monitoring and analytics.

2. Network Fault Root Cause Analysis

The next layer is Network Root Cause Analysis. This is partly based on a mechanism called inductive modeling, which discovers relationships between networked devices by discovering port connections and routing and configuration tables in each device.

When an outage occurs, inference, a related Network Root Cause Analysis mechanism, uses known network relationships to determine which devices are downstream from the one that is down. So instead of drowning in a sea of red alerts for all the unreachable devices, you get one upstream network root cause alert. This can also be applied to virtual servers and their underlying physical hosts, as well as network configuration issues.

3. Application-Driven Root Cause Analysis

Next up is Application-Driven Network Performance Management, which includes two monitoring technologies: network flow analysis and end-to-end application delivery analysis.

The first mechanism lets you see which applications are running on your network segments and how much bandwidth each is using. When users are complaining that an application service is slow, this can let you know when a bandwidth-monopolizing application is the root cause. Visualization includes stacked protocol charts, top hosts, top talkers, etc.

The second mechanism in this layer shows you end-to-end application response timing: network round trip, retransmission, data transfer and server response. Together in a stacked graph, this reveals if the network, the server or the application itself is impacting response. To see the detailed root cause in the offending domain, you drill down into a lower layer (e.g., into a network flow analysis, device monitoring or an application forensic tool).

4. Business Service Root Cause Analysis

The best practice is to unify the three layers into a single infrastructure management dashboard, so you can visually correlate all three levels of analytics in an efficient workflow. This is ideal for technical Level 2 Operations specialists and administrators.

But there's one more level at the top of the stack: Business Service Root Cause Analysis. This gives IT Operations Level 1 staff the greatest insight into how infrastructure is impacting business processes.

Examples of business processes include: Concept To Product, Product To Launch, Opportunity To Order, Order To Cash, Request To Service, Design To Build, Manufacturing To Distribution, Build To Order, Build To Stock, Requisition To Payables and so on.

At this layer of the stack, you monitor application and infrastructure components in groups that support each business process. This allows you to monitor each business process as you would an IT infrastructure service, and a mechanism called service impact analysis rates the relative impact each component has on the service performance. From there you can drill down into a lower layer in the stack to see the technical root cause details of the service impact (network outage, not enough bandwidth, server memory degradation, packet loss, not enough host resources for a virtual server, application logic error, etc.).

Once you have a clear understanding of this architecture, and a way to unify the information into a smooth workflow for triage, you can put the human processes in place to realize its business value.

ABOUT David Hayward

David Hayward is Senior Principal Manager, Solutions Marketing at CA Technologies. Hayward specializes in integrated network, systems and application performance management – and his research, writing and speaking engagements focus on IT operations maturity challenges, best-practices and IT management software return on investment. He began his career in 1979 as an editor at the groundbreaking BYTE computer magazine and has since held senior marketing positions in tier one and startup computer system, networking, data warehousing, VoIP and security solution vendors.

Share this

The Latest

June 24, 2021
E-commerce metrics give insights on how the business is doing and operating an e-commerce platform without tracking the metrics is akin to driving a car with eyes closed. To sustain and be successful, e-commerce businesses should understand their performance and compare their progress over time. The metrics can be harnessed to derive meaningful insights on the store’s performance, the average value of items sold at any given time, and the total sales clocked in a day, week, and month, among others. To help you decide the retail e-commerce metrics to monitor, we have compiled this list ...
June 23, 2021

More than half (61%) of respondents reported that their teams are practicing observability, an 8% increase from 2020, signaling that overall adoption is on the rise, according to a 2021 survey from Honeycomb. However, the majority of respondents indicated their teams are at the earliest stages of observability maturity ...

June 22, 2021

Your employees aren't coming back to the office, at least not in the traditional sense. The pandemic shifted almost all industries into remote work. And according to the results of Ivanti's Everywhere Workplace survey, they're not interested in going back to the way things once were ...

June 21, 2021

Respondents to an OpsRamp survey are moving forward with digital transformation, but many are re-evaluating the number and type of tools they're using. There are three main takeaways from the survey ...

June 17, 2021

More and more mainframe decision makers are becoming aware that the traditional way of handling mainframe operations will soon fall by the wayside. The ever-growing demand for newer, faster digital services has placed increased pressure on data centers to keep up as new applications come online, the volume of data handled continually increases, and workloads become increasingly unpredictable. In a recent Forrester Consulting AIOps survey, commissioned by BMC, the majority of respondents cited that they spend too much time reacting to incidents and not enough time finding ways to prevent them ...

June 16, 2021

In the age of digital transformation, enterprises are migrating to open source software (OSS) in droves to streamline operations and improve customer and employee experiences. However, to unlock the deluge of OSS benefits, it's not enough for organizations to simply implement the software. They must take the necessary steps to build an intentional OSS strategy rooted in ongoing third-party support and training ...

June 15, 2021

In Part 1 of this series, we explored the top pain points associated with managing Internet-based WANs today. This second installment will focus on today's most prevalent SD-WAN deployment challenges specifically and what you can do to better manage modern WANs overall ...

June 14, 2021

Enterprise wide-area networks (WANs) have undergone an incredible transformation over the past several years. More often than not, they're hybrid, offering multiple connection paths between WANs. This provides many benefits but also makes them more challenging to manage than ever before. In Part 1 of this series, we'll explore the top pain points associated with Internet-based WANs ...

June 10, 2021

As we have seen during this digital transformation boom during the pandemic, technologists are managing more applications and data than ever before, which has led three quarters of technologists to be concerned with increased IT complexity. Even more significant, 89% admitted to feeling under immense pressure to keep up with the churn, according to the recent AppDynamics Agents of Transformation report. It's clear that the pandemic has pushed many technologists to their breaking point. To help tackle IT burnout, tech professionals need a "canary" to help them streamline and catch the anomalies before they cause any major performance issues ...

June 09, 2021

An hour-long outage this Tuesday ground the Internet to a halt after popular Content Delivery Network (CDN) provider, Fastly, experienced a glitch that downed Reddit, Spotify, HBO Max, Shopify, Stripe and the BBC, to name just a few of properties affected ...