The team-of-experts approach to incident response was effective when network problems were less complex and everyone was part of the same organization. However, in recent years the process required for Root Cause Analysis (RCA) of network events and business application performance issues has become more difficult, obscured by infrastructural cloudiness and stakeholders residing in disparate departments, companies and geographies.
For many organizations, the task of quickly identifying root cause has become paramount to meeting Service Level Agreements (SLAs) and preventing customer churn. Yet, according to the Emulex Visibility Study, 79 percent of organizations have had events attributed to the wrong IT group, adding confusion and delays to the resolution of these issues.
This two-part series will explain a more fact-based, packet-analysis driven approach to Fault Domain Isolation (FDI), which is helping organizations troubleshoot and resolve network and application performance incidents.
Outsourcing Takes Over
It was hard enough getting visibility into what was actually happening when the entire infrastructure was owned and controlled by a single organization. With the rapid expansion of outsourcing, there are a growing number of blind spots developing throughout end-to-end business applications. When an entire technology tier is outsourced, what you have is a massive blind spot keeping you from performing root cause analysis within that technology domain. To accommodate outsourced technology, organizations must clearly define the purpose and requirements of the Fault Domain Isolation stage of the incident response workflow compared to the Root Cause Analysis stage.
Understanding FDI
The motivation behind FDI is easy to understand because anyone who’s gone to the doctor has seen it in action. An “incident investigation” in healthcare typically starts with a process that is essentially FDI. A general practitioner performs an initial assessment, orders diagnostic tests, and evaluates the results. The patient is sent to a specialist for additional diagnosis and treatment only if there is sufficient evidence to justify it. Facts, not guesswork, drive the diagnostic process.
Organizations that deploy FDI seek to minimize the number and type of technology experts involved in each incident, which is why FDI should precede RCA. The goal is to identify exactly one suspect technology tier before starting the deep dive search for root cause.
Why isolate by technology? Because that is how departments (and outsourcing) are typically organized, and how you quickly reduce the number of people involved. By implicating just one fault domain, you eliminate entire departments and external organizations from being tied up in the investigation; just as you wouldn’t pull in a neurosurgeon to examine a broken toe!
A key goal of FDI is to stop the “passing the buck” phenomenon in its tracks. For FDI to be effective it must provide irrefutable evidence that root cause lies in the “suspect” sub-system or technology tier, and just as importantly, that the same evidence confirms root cause is highly unlikely to lie anywhere else. This is especially important when the fault domain lies in an outsourced technology.
When handing the problem over to the responsible team or service provider, effective FDI also provides technology-specific, actionable data. It supplies the context, symptoms, and information needed for the technology team to immediately begin their deep dive search for root cause within the system for which they are responsible.
Exactly One Set of Facts
In order to be efficient and effective, FDI requires its analysis to be based on the actual packet data exchanged between the technology tiers. Packets don’t lie, nor do they obscure the critical details in averages or statistics. And having the underlying packets as evidence ensures the FDI process assigns irrefutable responsibility to the faulty technology tier.
Primary FDI – the act of assigning the incident to a specific technology team or outsourced service provider – is exceedingly cost effective to implement because its goal is relatively modest: to allocate incidents among a handful of departments or teams, plus any outsourced services. In practice, it involves relatively few technology tiers, a manageable number of tap points in the network, and a few network recorders monitoring between each technology tier.
Jeff Brown is Global Director of Training, NVP at Emulex.
The Latest
Part 4 covers OpenTelemetry: Next year, we're going to see more embrace of OpenTelemetry across the entire industry — opening up the future of instrumentation ...
Part 3 covers even more on Observability: Observability will move up the organization to support the sustainability and FinOps drive. The combined pressure of needing to adopt more sustainable practices and tackle rising cloud costs will catapult observability from an IT priority to a business requirement in 2024 ...
Part 2 covers more on Observability: In 2024, observability platforms will embrace and innovate with new technologies like GenAI for real-time analytics, becoming the fulcrum for digital experience management ...
The Holiday Season means it is time for APMdigest's annual list of Application Performance Management (APM) predictions, covering IT performance topics. Industry experts — from analysts and consultants to the top vendors — offer thoughtful, insightful, and often controversial predictions on how APM, Observability, AIOps and related technologies will evolve and impact business in 2024. Part 1 covers APM and Observability ...
To help you stay on top of the ever-evolving tech scene, Automox IT experts shake the proverbial magic eight ball and share their predictions about tech trends in the coming year. From M&A frenzies to sustainable tech and automation, these forecasts paint an exciting picture of the future ...
Incident management processes are not keeping pace with the demands of modern operations teams, failing to meet the needs of SREs as well as platform and ops teams. Results from the State of DevOps Automation and AI Survey, commissioned by Transposit, point to an incident management paradox. Despite nearly 60% of ITOps and DevOps professionals reporting they have a defined incident management process that's fully documented in one place and over 70% saying they have a level of automation that meets their needs, teams are unable to quickly resolve incidents ...
Today, in the world of enterprise technology, the challenges posed by legacy Virtual Desktop Infrastructure (VDI) systems have long been a source of concern for IT departments. In many instances, this promising solution has become an organizational burden, hindering progress, depleting resources, and taking a psychological and operational toll on employees ...
Within retail organizations across the world, IT teams will be bracing themselves for a hectic holiday season ... While this is an exciting opportunity for retailers to boost sales, it also intensifies severe risk. Any application performance slipup will cause consumers to turn their back on brands, possibly forever. Online shoppers will be completely unforgiving to any retailer who doesn't deliver a seamless digital experience ...
Black Friday is a time when consumers can cash in on some of the biggest deals retailers offer all year long ... Nearly two-thirds of consumers utilize a retailer's web and mobile app for holiday shopping, raising the stakes for competitors to provide the best online experience to retain customer loyalty. Perforce's 2023 Black Friday survey sheds light on consumers' expectations this time of year and how developers can properly prepare their applications for increased online traffic ...