Optimizing Root Cause Analysis to Reduce MTTR
October 11, 2012

Ariel Gordon

Share this

Efficiently detecting and resolving problems is essential, of course, to continue supporting - and minimizing impact on - business services, as well as minimizing any financial impacts.

The goal is to turn the tables on IT problems so that 80 percent of the time is spent on the root cause analysis versus 20 percent on the actual problem fixing.

In resolving the issue, communication is a critical factor for integrating different expert groups towards a common goal. Because each team holds a narrow view of its own domain and expertise, there is always the danger lurking that the "big picture" angle will be missing. You don't want lack of communication to result in blame games and finger pointing.

Some problem detection methods include:

- Infrastructure Monitoring: specific resource utilization like disk, memory, CPU are effective for identifying availability failures – sometimes even heading those off before they happen.

- Domain or Application Tools: These help, but leave the issue that overall problem detection is still a game of hide-and-seek, a manually-intensive effort that comes under the pressure of needing a fix as quickly as possible.

- Dependency mapping tools, which map business services and applications to infrastructure components, can help you generate a topology map that will improve your root cause analysis process for the following reasons:

1. Connect Symptoms to Problems: A single map that relates a business service (user point of view) to its configuration items, will help you detect problems faster.

2. Common Ground: The map ties in all elements so that different groups can focus on a cross-domain effort.

3. High-Level, Cross-Domain View: Teams can view problems not only in the context of their domain, but in a wider view of all network components. For example, a database administrator analyzing a slow database performance problem can examine the topology map to see the effect of networking components on the database.

Root cause is a complex issue, so that no single tool or approach will provide you with full coverage. The idea is to plan a portfolio of tools that together deliver the most impact for your organization.

For instance, if you do not have a central event management console, then consider implementing a topology-based event management solution. If most of your applications involve online transactions, try to look for a transaction management product that covers the technology stack that is common in your environment. Put differently, select a combination of tools that are right for your environment.

Once you assess the tools that provide the most value, implement them in ascending order of value so that you get the biggest impact first.

Ariel Gordon is VP Products and Co-Founder of Neebula.

Share this

The Latest

September 23, 2022

In order to properly sort through all monitoring noise and identify true problems, their causes, and to prioritize them for response by the IT team, they have created and built a revolutionary new system using a meta-cognitive model ...

September 22, 2022

As we shift further into a digital-first world, where having a reliable online experience becomes more essential, Site Reliability Engineers remain in-demand among organizations of all sizes ... This diverse set of skills and values can be difficult to interview for. In this blog, we'll get you started with some example questions and processes to find your ideal SRE ...

September 21, 2022

US government agencies are bringing more of their employees back into the office and implementing hybrid work schedules, but federal workers are worried that their agencies' IT architectures aren't built to handle the "new normal." They fear that the reactive, manual methods used by the current systems in dealing with user, IT architecture and application problems will degrade the user experience and negatively affect productivity. In fact, according to a recent survey, many federal employees are concerned that they won't work as effectively back in the office as they did at home ...

September 20, 2022

Users today expect a seamless, uninterrupted experience when interacting with their web and mobile apps. Their expectations have continued to grow in tandem with their appetite for new features and consistent updates. Mobile apps have responded by increasing their release cadence by up to 40%, releasing a new full version of their app every 4-5 days, as determined in this year's SmartBear State of Software Quality | Application Stability Index report ...

September 19, 2022

In this second part of the blog series, we look at how adopting AIOps capabilities can drive business value for an organization ...

September 16, 2022

ITOPS and DevOps is in the midst of a surge of innovation. New devices and new systems are appearing at an unprecedented rate. There are many drivers of this phenomenon, from virtualization and containerization of applications and services to the need for improved security and the proliferation of 5G and IOT devices. The interconnectedness and the interdependencies of these technologies also greatly increase systems complexity and therefore increase the sheer volume of things that need to be integrated, monitored, and maintained ...

September 15, 2022

IT talent acquisition challenges are now heavily influencing technology investment decisions, according to new research from Salesforce's MuleSoft. The 2022 IT Leaders Pulse Report reveals that almost three quarters (73%) of senior IT leaders agree that acquiring IT talent has never been harder, and nearly all (98%) respondents say attracting IT talent influences their organization's technology investment choices ...

September 14, 2022

The findings of the 2022 Observability Forecast offer a detailed view of how this practice is shaping engineering and the technologies of the future. Here are 10 key takeaways from the forecast ...

September 13, 2022

Data professionals are spending 40% of their time evaluating or checking data quality and that poor data quality impacts 26% of their companies' revenue, according to The State of Data Quality 2022, a report commissioned by Monte Carlo and conducted by Wakefield Research ...

September 12, 2022
Statistically speaking, every year, slow-loading websites cost a staggering $2.6 billion in losses to their owners. Also, about 53% of the website visitors on mobile are likely to abandon the site if it takes more than 3 seconds to load. This is the reason why performance testing should be conducted rigorously on a website or web application in the SDLC before deployment ...