Fault Domain Isolation Key to Avoiding Network Blame Game - Part 1
April 13, 2015

Jeff Brown
Emulex

Share this

The team-of-experts approach to incident response was effective when network problems were less complex and everyone was part of the same organization. However, in recent years the process required for Root Cause Analysis (RCA) of network events and business application performance issues has become more difficult, obscured by infrastructural cloudiness and stakeholders residing in disparate departments, companies and geographies. 
 
For many organizations, the task of quickly identifying root cause has become paramount to meeting Service Level Agreements (SLAs) and preventing customer churn. Yet, according to the Emulex Visibility Study, 79 percent of organizations have had events attributed to the wrong IT group, adding confusion and delays to the resolution of these issues.
 
This two-part series will explain a more fact-based, packet-analysis driven approach to Fault Domain Isolation (FDI), which is helping organizations troubleshoot and resolve network and application performance incidents.

Outsourcing Takes Over

It was hard enough getting visibility into what was actually happening when the entire infrastructure was owned and controlled by a single organization. With the rapid expansion of outsourcing, there are a growing number of blind spots developing throughout end-to-end business applications. When an entire technology tier is outsourced, what you have is a massive blind spot keeping you from performing root cause analysis within that technology domain. To accommodate outsourced technology, organizations must clearly define the purpose and requirements of the Fault Domain Isolation stage of the incident response workflow compared to the Root Cause Analysis stage.

Understanding FDI

The motivation behind FDI is easy to understand because anyone who’s gone to the doctor has seen it in action. An “incident investigation” in healthcare typically starts with a process that is essentially FDI. A general practitioner performs an initial assessment, orders diagnostic tests, and evaluates the results. The patient is sent to a specialist for additional diagnosis and treatment only if there is sufficient evidence to justify it. Facts, not guesswork, drive the diagnostic process.

Organizations that deploy FDI seek to minimize the number and type of technology experts involved in each incident, which is why FDI should precede RCA. The goal is to identify exactly one suspect technology tier before starting the deep dive search for root cause.

Why isolate by technology? Because that is how departments (and outsourcing) are typically organized, and how you quickly reduce the number of people involved. By implicating just one fault domain, you eliminate entire departments and external organizations from being tied up in the investigation; just as you wouldn’t pull in a neurosurgeon to examine a broken toe!

A key goal of FDI is to stop the “passing the buck” phenomenon in its tracks. For FDI to be effective it must provide irrefutable evidence that root cause lies in the “suspect” sub-system or technology tier, and just as importantly, that the same evidence confirms root cause is highly unlikely to lie anywhere else. This is especially important when the fault domain lies in an outsourced technology.

When handing the problem over to the responsible team or service provider, effective FDI also provides technology-specific, actionable data. It supplies the context, symptoms, and information needed for the technology team to immediately begin their deep dive search for root cause within the system for which they are responsible.

Exactly One Set of Facts

In order to be efficient and effective, FDI requires its analysis to be based on the actual packet data exchanged between the technology tiers. Packets don’t lie, nor do they obscure the critical details in averages or statistics. And having the underlying packets as evidence ensures the FDI process assigns irrefutable responsibility to the faulty technology tier.

Primary FDI – the act of assigning the incident to a specific technology team or outsourced service provider – is exceedingly cost effective to implement because its goal is relatively modest: to allocate incidents among a handful of departments or teams, plus any outsourced services. In practice, it involves relatively few technology tiers, a manageable number of tap points in the network, and a few network recorders monitoring between each technology tier.

Read Part 2 of this Blog, which identifies some of the hang ups of adopting FDI, as well as best practices.

Jeff Brown is Global Director of Training, NVP at Emulex.

Share this

The Latest

April 19, 2024

In MEAN TIME TO INSIGHT Episode 5, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses the network source of truth ...

April 18, 2024

A vast majority (89%) of organizations have rapidly expanded their technology in the past few years and three quarters (76%) say it's brought with it increased "chaos" that they have to manage, according to Situation Report 2024: Managing Technology Chaos from Software AG ...

April 17, 2024

In 2024 the number one challenge facing IT teams is a lack of skilled workers, and many are turning to automation as an answer, according to IT Trends: 2024 Industry Report ...

April 16, 2024

Organizations are continuing to embrace multicloud environments and cloud-native architectures to enable rapid transformation and deliver secure innovation. However, despite the speed, scale, and agility enabled by these modern cloud ecosystems, organizations are struggling to manage the explosion of data they create, according to The state of observability 2024: Overcoming complexity through AI-driven analytics and automation strategies, a report from Dynatrace ...

April 15, 2024

Organizations recognize the value of observability, but only 10% of them are actually practicing full observability of their applications and infrastructure. This is among the key findings from the recently completed Logz.io 2024 Observability Pulse Survey and Report ...

April 11, 2024

Businesses must adopt a comprehensive Internet Performance Monitoring (IPM) strategy, says Enterprise Management Associates (EMA), a leading IT analyst research firm. This strategy is crucial to bridge the significant observability gap within today's complex IT infrastructures. The recommendation is particularly timely, given that 99% of enterprises are expanding their use of the Internet as a primary connectivity conduit while facing challenges due to the inefficiency of multiple, disjointed monitoring tools, according to Modern Enterprises Must Boost Observability with Internet Performance Monitoring, a new report from EMA and Catchpoint ...

April 10, 2024

Choosing the right approach is critical with cloud monitoring in hybrid environments. Otherwise, you may drive up costs with features you don’t need and risk diminishing the visibility of your on-premises IT ...

April 09, 2024

Consumers ranked the marketing strategies and missteps that most significantly impact brand trust, which 73% say is their biggest motivator to share first-party data, according to The Rules of the Marketing Game, a 2023 report from Pantheon ...

April 08, 2024

Digital experience monitoring is the practice of monitoring and analyzing the complete digital user journey of your applications, websites, APIs, and other digital services. It involves tracking the performance of your web application from the perspective of the end user, providing detailed insights on user experience, app performance, and customer satisfaction ...

April 04, 2024
Modern organizations race to launch their high-quality cloud applications as soon as possible. On the other hand, time to market also plays an essential role in determining the application's success. However, without effective testing, it's hard to be confident in the final product ...