Optimizing Root Cause Analysis to Reduce MTTR
October 11, 2012

Ariel Gordon

Share this

Efficiently detecting and resolving problems is essential, of course, to continue supporting - and minimizing impact on - business services, as well as minimizing any financial impacts.

The goal is to turn the tables on IT problems so that 80 percent of the time is spent on the root cause analysis versus 20 percent on the actual problem fixing.

In resolving the issue, communication is a critical factor for integrating different expert groups towards a common goal. Because each team holds a narrow view of its own domain and expertise, there is always the danger lurking that the "big picture" angle will be missing. You don't want lack of communication to result in blame games and finger pointing.

Some problem detection methods include:

- Infrastructure Monitoring: specific resource utilization like disk, memory, CPU are effective for identifying availability failures – sometimes even heading those off before they happen.

- Domain or Application Tools: These help, but leave the issue that overall problem detection is still a game of hide-and-seek, a manually-intensive effort that comes under the pressure of needing a fix as quickly as possible.

- Dependency mapping tools, which map business services and applications to infrastructure components, can help you generate a topology map that will improve your root cause analysis process for the following reasons:

1. Connect Symptoms to Problems: A single map that relates a business service (user point of view) to its configuration items, will help you detect problems faster.

2. Common Ground: The map ties in all elements so that different groups can focus on a cross-domain effort.

3. High-Level, Cross-Domain View: Teams can view problems not only in the context of their domain, but in a wider view of all network components. For example, a database administrator analyzing a slow database performance problem can examine the topology map to see the effect of networking components on the database.

Root cause is a complex issue, so that no single tool or approach will provide you with full coverage. The idea is to plan a portfolio of tools that together deliver the most impact for your organization.

For instance, if you do not have a central event management console, then consider implementing a topology-based event management solution. If most of your applications involve online transactions, try to look for a transaction management product that covers the technology stack that is common in your environment. Put differently, select a combination of tools that are right for your environment.

Once you assess the tools that provide the most value, implement them in ascending order of value so that you get the biggest impact first.

Ariel Gordon is VP Products and Co-Founder of Neebula.

Share this

The Latest

March 27, 2023

To achieve maximum availability, IT leaders must employ domain-agnostic solutions that identify and escalate issues across all telemetry points. These technologies, which we refer to as Artificial Intelligence for IT Operations, create convergence — in other words, they provide IT and DevOps teams with the full picture of event management and downtime ...

March 23, 2023

APMdigest and leading IT research firm Enterprise Management Associates (EMA) are partnering to bring you the EMA-APMdigest Podcast, a new podcast focused on the latest technologies impacting IT Operations. In Episode 2 - Part 1 Pete Goldin, Editor and Publisher of APMdigest, discusses Network Observability with Shamus McGillicuddy, Vice President of Research, Network Infrastructure and Operations, at EMA ...

March 22, 2023

CIOs have stepped into the role of digital leader and strategic advisor, according to the 2023 Global CIO Survey from Logicalis ...

March 21, 2023

Synthetic monitoring is crucial to deploy code with confidence as catching bugs with E2E tests on staging is becoming increasingly difficult. It isn't trivial to provide realistic staging systems, especially because today's apps are intertwined with many third-party APIs ...

March 20, 2023

Recent EMA field research found that ServiceOps is either an active effort or a formal initiative in 78% of the organizations represented by a global panel of 400+ IT leaders. It is relatively early but gaining momentum across industries and organizations of all sizes globally ...

March 16, 2023

Managing availability and performance within SAP environments has long been a challenge for IT teams. But as IT environments grow more complex and dynamic, and the speed of innovation in almost every industry continues to accelerate, this situation is becoming a whole lot worse ...

March 15, 2023

Harnessing the power of network-derived intelligence and insights is critical in detecting today's increasingly sophisticated security threats across hybrid and multi-cloud infrastructure, according to a new research study from IDC ...

March 14, 2023

Recent research suggests that many organizations are paying for more software than they need. If organizations are looking to reduce IT spend, leaders should take a closer look at the tools being offered to employees, as not all software is essential ...

March 13, 2023

Organizations are challenged by tool sprawl and data source overload, according to the Grafana Labs Observability Survey 2023, with 52% of respondents reporting that their companies use 6 or more observability tools, including 11% that use 16 or more.

March 09, 2023

An array of tools purport to maintain availability — the trick is sorting through the noise to find the right one. Let us discuss why availability is so important and then unpack the ROI of deploying Artificial Intelligence for IT Operations (AIOps) during an economic downturn ...