The team-of-experts approach to incident response was effective when network problems were less complex and everyone was part of the same organization. However, in recent years the process required for Root Cause Analysis (RCA) of network events and business application performance issues has become more difficult, obscured by infrastructural cloudiness and stakeholders residing in disparate departments, companies and geographies.
For many organizations, the task of quickly identifying root cause has become paramount to meeting Service Level Agreements (SLAs) and preventing customer churn. Yet, according to the Emulex Visibility Study, 79 percent of organizations have had events attributed to the wrong IT group, adding confusion and delays to the resolution of these issues.
This two-part series will explain a more fact-based, packet-analysis driven approach to Fault Domain Isolation (FDI), which is helping organizations troubleshoot and resolve network and application performance incidents.
Outsourcing Takes Over
It was hard enough getting visibility into what was actually happening when the entire infrastructure was owned and controlled by a single organization. With the rapid expansion of outsourcing, there are a growing number of blind spots developing throughout end-to-end business applications. When an entire technology tier is outsourced, what you have is a massive blind spot keeping you from performing root cause analysis within that technology domain. To accommodate outsourced technology, organizations must clearly define the purpose and requirements of the Fault Domain Isolation stage of the incident response workflow compared to the Root Cause Analysis stage.
Understanding FDI
The motivation behind FDI is easy to understand because anyone who’s gone to the doctor has seen it in action. An “incident investigation” in healthcare typically starts with a process that is essentially FDI. A general practitioner performs an initial assessment, orders diagnostic tests, and evaluates the results. The patient is sent to a specialist for additional diagnosis and treatment only if there is sufficient evidence to justify it. Facts, not guesswork, drive the diagnostic process.
Organizations that deploy FDI seek to minimize the number and type of technology experts involved in each incident, which is why FDI should precede RCA. The goal is to identify exactly one suspect technology tier before starting the deep dive search for root cause.
Why isolate by technology? Because that is how departments (and outsourcing) are typically organized, and how you quickly reduce the number of people involved. By implicating just one fault domain, you eliminate entire departments and external organizations from being tied up in the investigation; just as you wouldn’t pull in a neurosurgeon to examine a broken toe!
A key goal of FDI is to stop the “passing the buck” phenomenon in its tracks. For FDI to be effective it must provide irrefutable evidence that root cause lies in the “suspect” sub-system or technology tier, and just as importantly, that the same evidence confirms root cause is highly unlikely to lie anywhere else. This is especially important when the fault domain lies in an outsourced technology.
When handing the problem over to the responsible team or service provider, effective FDI also provides technology-specific, actionable data. It supplies the context, symptoms, and information needed for the technology team to immediately begin their deep dive search for root cause within the system for which they are responsible.
Exactly One Set of Facts
In order to be efficient and effective, FDI requires its analysis to be based on the actual packet data exchanged between the technology tiers. Packets don’t lie, nor do they obscure the critical details in averages or statistics. And having the underlying packets as evidence ensures the FDI process assigns irrefutable responsibility to the faulty technology tier.
Primary FDI – the act of assigning the incident to a specific technology team or outsourced service provider – is exceedingly cost effective to implement because its goal is relatively modest: to allocate incidents among a handful of departments or teams, plus any outsourced services. In practice, it involves relatively few technology tiers, a manageable number of tap points in the network, and a few network recorders monitoring between each technology tier.
Jeff Brown is Global Director of Training, NVP at Emulex.