The 4 Building Blocks of Root Cause Analysis

May 06, 2014

Learn more about Broadcom

With every minute you can shave off root cause analysis, you get a minute closer to restoring the performance or availability of a process that's important to your business. But the plethora of monitoring tools used throughout your organization, each with its own root cause perspective about the IT environment, can lead to confusion, dysfunction and drawn-out debate when things go wrong. To get the most business value from these diverse views, you need to understand how they can work together.

Think of root cause analysis as a software stack, and the higher the layer is in the stack, the more meaningful it is from a business perspective. For example, in the Open Systems Interconnect (OSI) stack, understanding layer 1, the physical layer, is vital, but layer 7, the application, is more meaningful to the business.

Each layer in the root cause analysis stack is provided by unique monitoring functions, analytics and visualization. Here they are, top down:

- Business Service Root Cause Analysis

- Application-Driven Root Cause Analysis

- Network Fault Root Cause Analysis

- Device Root Cause Analysis

Think of adding each layer in terms of a geometrical analogy of human awareness cleverly explained by the Russian philosopher P.D. Ouspenski in his book Tertium Organum. As he explained, if you were one-dimensional, a point, you couldn't think of a line. If you were a line, you couldn't perceive two-dimensions: a square. If you were a square you couldn't understand a cube. If you were a cube, couldn't understand motion.

Let's see how each layer has legitimate root cause analysis and how each successive layer up the stack adds awareness and greater business value.

1. Device Root Cause Analysis

The device layer is the foundation, letting you know if a server, storage device or switch, router, load balancer, etc. simply is up or down, fast or slow. If it's pingable, you know it has a power source, and diagnostics can tell you which subcomponent has the fault causing the outage. For root cause of performance issues, you'll be relying on your monitoring tools' visual correlation of time series data and threshold alerts to see if the CPU, memory, disk, ports etc. are degraded and why.

But if servers or network devices aren't reachable, how do you know for sure if they are down or if there's an upstream network root cause? To see this, you need to add a higher layer of monitoring and analytics.

2. Network Fault Root Cause Analysis

The next layer is Network Root Cause Analysis. This is partly based on a mechanism called inductive modeling, which discovers relationships between networked devices by discovering port connections and routing and configuration tables in each device.

When an outage occurs, inference, a related Network Root Cause Analysis mechanism, uses known network relationships to determine which devices are downstream from the one that is down. So instead of drowning in a sea of red alerts for all the unreachable devices, you get one upstream network root cause alert. This can also be applied to virtual servers and their underlying physical hosts, as well as network configuration issues.

3. Application-Driven Root Cause Analysis

Next up is Application-Driven Network Performance Management, which includes two monitoring technologies: network flow analysis and end-to-end application delivery analysis.

The first mechanism lets you see which applications are running on your network segments and how much bandwidth each is using. When users are complaining that an application service is slow, this can let you know when a bandwidth-monopolizing application is the root cause. Visualization includes stacked protocol charts, top hosts, top talkers, etc.

The second mechanism in this layer shows you end-to-end application response timing: network round trip, retransmission, data transfer and server response. Together in a stacked graph, this reveals if the network, the server or the application itself is impacting response. To see the detailed root cause in the offending domain, you drill down into a lower layer (e.g., into a network flow analysis, device monitoring or an application forensic tool).

4. Business Service Root Cause Analysis

The best practice is to unify the three layers into a single infrastructure management dashboard, so you can visually correlate all three levels of analytics in an efficient workflow. This is ideal for technical Level 2 Operations specialists and administrators.

But there's one more level at the top of the stack: Business Service Root Cause Analysis. This gives IT Operations Level 1 staff the greatest insight into how infrastructure is impacting business processes.

Examples of business processes include: Concept To Product, Product To Launch, Opportunity To Order, Order To Cash, Request To Service, Design To Build, Manufacturing To Distribution, Build To Order, Build To Stock, Requisition To Payables and so on.

At this layer of the stack, you monitor application and infrastructure components in groups that support each business process. This allows you to monitor each business process as you would an IT infrastructure service, and a mechanism called service impact analysis rates the relative impact each component has on the service performance. From there you can drill down into a lower layer in the stack to see the technical root cause details of the service impact (network outage, not enough bandwidth, server memory degradation, packet loss, not enough host resources for a virtual server, application logic error, etc.).

Once you have a clear understanding of this architecture, and a way to unify the information into a smooth workflow for triage, you can put the human processes in place to realize its business value.

ABOUT David Hayward

David Hayward is Senior Principal Manager, Solutions Marketing at CA Technologies. Hayward specializes in integrated network, systems and application performance management – and his research, writing and speaking engagements focus on IT operations maturity challenges, best-practices and IT management software return on investment. He began his career in 1979 as an editor at the groundbreaking BYTE computer magazine and has since held senior marketing positions in tier one and startup computer system, networking, data warehousing, VoIP and security solution vendors.

Hot Topics

APM

Analytics

NPM/NetOps

The Latest

Beyond the MACH Hype: Why Your Commerce Platform Is Not Helping You Win DX or CX

June 06, 2025

For many B2B and B2C enterprise brands, technology isn't a core strength. Relying on overly complex architectures (like those that follow a pure MACH doctrine) has been flagged by industry leaders as a source of operational slowdown, creating bottlenecks that limit agility in volatile market conditions ...

Effective FinOps: Moving from Recommendations to Risks

June 05, 2025

FinOps champions crucial cross-departmental collaboration, uniting business, finance, technology and engineering leaders to demystify cloud expenses. Yet, too often, critical cost issues are softened into mere "recommendations" or "insights" — easy to ignore. But what if we adopted security's battle-tested strategy and reframed these as the urgent risks they truly are, demanding immediate action? ...

Rising IT Complexity Threatens Modernization - Survey Shows SysAdmins Under Pressure

June 04, 2025

Two in three IT professionals now cite growing complexity as their top challenge — an urgent signal that the modernization curve may be getting too steep, according to the Rising to the Challenge survey from Checkmk ...

State of the Data Center 2025

June 03, 2025

While IT leaders are becoming more comfortable and adept at balancing workloads across on-premises, colocation data centers and the public cloud, there's a key component missing: connectivity, according to the 2025 State of the Data Center Report from CoreSite ...

The Clock Is Ticking: How 47-Day Certificates and Quantum Threats Are Reshaping Cybersecurity

June 02, 2025

A perfect storm is brewing in cybersecurity — certificate lifespans shrinking to just 47 days while quantum computing threatens today's encryption. Organizations must embrace ephemeral trust and crypto-agility to survive this dual challenge ...

MEAN TIME TO INSIGHT Podcast - Episode 14: Hybrid Multi-Cloud Network Observability

May 29, 2025

In MEAN TIME TO INSIGHT Episode 14, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses hybrid multi-cloud network observability...

What's the State of AI Costs in 2025?

May 28, 2025

While companies adopt AI at a record pace, they also face the challenge of finding a smart and scalable way to manage its rapidly growing costs. This requires balancing the massive possibilities inherent in AI with the need to control cloud costs, aim for long-term profitability and optimize spending ...

Bridging the Visibility Gap: A Path to Smarter Telecom Infrastructure

May 27, 2025

Telecommunications is expanding at an unprecedented pace ... But progress brings complexity. As WanAware's 2025 Telecom Observability Benchmark Report reveals, many operators are discovering that modernization requires more than physical build outs and CapEx — it also demands the tools and insights to manage, secure, and optimize this fast-growing infrastructure in real time ...

Redis Monitoring 101: Key Metrics You Need to Watch

May 22, 2025

As businesses increasingly rely on high-performance applications to deliver seamless user experiences, the demand for fast, reliable, and scalable data storage systems has never been greater. Redis — an open-source, in-memory data structure store — has emerged as a popular choice for use cases ranging from caching to real-time analytics. But with great performance comes the need for vigilant monitoring ...

Beyond Traditional Autoscaling: The Future of Kubernetes in AI Infrastructure

May 22, 2025

Kubernetes was not initially designed with AI's vast resource variability in mind, and the rapid rise of AI has exposed Kubernetes limitations, particularly when it comes to cost and resource efficiency. Indeed, AI workloads differ from traditional applications in that they require a staggering amount and variety of compute resources, and their consumption is far less consistent than traditional workloads ... Considering the speed of AI innovation, teams cannot afford to be bogged down by these constant infrastructure concerns. A solution is needed ...

The 4 Building Blocks of Root Cause Analysis

May 06, 2014

Learn more about Broadcom

Each layer in the root cause analysis stack is provided by unique monitoring functions, analytics and visualization. Here they are, top down:

- Business Service Root Cause Analysis

- Application-Driven Root Cause Analysis

- Network Fault Root Cause Analysis

- Device Root Cause Analysis

Let's see how each layer has legitimate root cause analysis and how each successive layer up the stack adds awareness and greater business value.

1. Device Root Cause Analysis

2. Network Fault Root Cause Analysis

3. Application-Driven Root Cause Analysis

Next up is Application-Driven Network Performance Management, which includes two monitoring technologies: network flow analysis and end-to-end application delivery analysis.

4. Business Service Root Cause Analysis

Once you have a clear understanding of this architecture, and a way to unify the information into a smooth workflow for triage, you can put the human processes in place to realize its business value.

ABOUT David Hayward

Hot Topics

APM

Analytics

NPM/NetOps

The Latest

Beyond the MACH Hype: Why Your Commerce Platform Is Not Helping You Win DX or CX

June 06, 2025

Effective FinOps: Moving from Recommendations to Risks

June 05, 2025

Rising IT Complexity Threatens Modernization - Survey Shows SysAdmins Under Pressure

June 04, 2025

State of the Data Center 2025

June 03, 2025

The Clock Is Ticking: How 47-Day Certificates and Quantum Threats Are Reshaping Cybersecurity

June 02, 2025

MEAN TIME TO INSIGHT Podcast - Episode 14: Hybrid Multi-Cloud Network Observability

May 29, 2025

In MEAN TIME TO INSIGHT Episode 14, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses hybrid multi-cloud network observability...

What's the State of AI Costs in 2025?

May 28, 2025

Bridging the Visibility Gap: A Path to Smarter Telecom Infrastructure

May 27, 2025

Redis Monitoring 101: Key Metrics You Need to Watch

May 22, 2025

Beyond Traditional Autoscaling: The Future of Kubernetes in AI Infrastructure

May 22, 2025

Featured Webinar

Featured eBook

Featured Free Trial

Featured White Paper

Featured Free Trial

Featured Webinar

Featured Report

Featured White Paper

Featured Webinar

Featured Webinar

Featured Webinar

Featured White Paper

Featured Webinar

Featured Free Tool

Featured White Paper

Featured Webinar

Featured Webinar

Featured Free Trial

Featured Webinar

Featured White Paper

Featured White Paper

Featured Free Tool

Featured White Paper

Featured White Paper

Featured Free Tool

Featured Report

Featured Webinar

Featured Free Trial

Featured Webinar

Featured eBook

Featured Webinar

Featured White Paper

Featured Free Trial

Featured Free Trial

Featured Webinar

Featured White Paper

Featured Free Trial

Featured Webinar

Featured Webinar

Featured Webinar

Featured Webinar

Featured Webinar

Featured White Paper

Featured Report

Featured White Paper

Featured eBook

Featured White Paper

Featured White Paper

Featured White Paper

Featured White Paper

Featured White Paper

Featured Webinar

Featured Webinar

Featured Webinar

Featured eBook

Featured eBook

Featured Webinar

Featured Webinar

Featured Report

Featured Webinar

Featured Free Trial

Featured Webinar

Featured Webinar

Featured Webinar

Featured White Paper

Featured Webinar