Observability Is No Place for Tunnel Vision
November 15, 2023

Jeremy Burton
Observe

Share this

Imagine you're blindfolded and dropped into the Marina District of San Francisco. Upon removing your blindfold, you would probably first look around to get your bearings. You might see the Golden Gate Bridge to the northwest, the Transamerica Pyramid to the southeast and Golden Gate Park to the southwest. Based on your perspective, you'd probably be able to deduce your approximate location by correlating multiple familiar data points.

Now imagine you're dropped into an entirely unfamiliar city and given two photos to help you figure out where you are. The photos are grainy and dark images of places you've never seen before. With so little to go on, your chances of success are next to zero.

The second scenario is an apt analogy for the way most site reliability engineering teams operate today. They use a collection of dis-integrated tools to try to diagnose problems they've never seen before. Each day is a new unknown city or unfamiliar neighborhood.

In the same way a city is a sum of its districts and neighborhoods, complex IT systems are made of many components that continually interact. Observability — the practice of collecting data from various aspects of a computer system, application, or infrastructure to understand its performance and identify and resolve issues — requires a comprehensive and connected view of all aspects of the system, including even some that don't directly relate to its technological innards.

Busting Siloes

Observability has traditionally been about correlating the "Three Pillars," machine-generated logs, metrics and traces. Over the years, vendors of observability suites have pieced together point tools to measure these elements, often through acquisitions and siloed development projects. The result is a mishmash of isolated data points connected loosely through dashboards and broken up into more than a dozen discrete practices.

Each tool is designed to operate on a specific type of data, and the tools often don't communicate well with each other. For example, a spike in error logs can tell you that something is wrong, but it won't necessarily give you the contextual information to understand the root cause of the issue. Humans must do that.

In a typical observability scenario, site reliability engineers (SREs), DevOps engineers and administrators pore over their tool of choice and cut and paste what they see to an incident channel on Slack. Then, a person with a big brain — every company has one — tries to connect the dots across multiple screenshots to get at the root cause.

This is madness. Cloud-native applications are composed of independently built and deployed microservices that change daily or even multiple times per day. Many of the problems SREs wrestle with have never been seen before. There is no dashboard or alert for an "unknown" problem, just symptoms with little context. Troubleshooting has never been harder.

To investigate unknown problems, SREs must be able to quickly correlate data points for symptoms they are seeing. Traditional methods of correlating data, such as tagging, simply don't work with complex distributed architectures. Tags are not maintainable at any kind of scale and, even if they were, cardinality issues quickly ensue when, for example, customer counts reach tens or hundreds of thousands. This typically breaks any traditional tooling based on in-memory databases or, even if it doesn't, causes tooling costs to explode.

That's why, despite the $17 billion organizations pour into monitoring, logging and application performance management tools each year, the average mean time to resolution (MTTR) has barely budged.

Beyond the Obvious

The whole point of observability is to investigate unknown issues by seeing non-obvious relationships between data elements. You can't do that with siloed data, even if you have the requisite logs, metrics and traces.

To use our tunnel-vision analogy, a tranquil day in Golden Gate Park doesn't explain why there's a traffic jam on the Golden Gate Bridge. The two may be related, but looking at one in isolation doesn't reveal the root cause. The gridlock may be caused by a breakdown on Highway 101 three miles downstream, a protest march, a fog bank, or police action on the Presidio. Identifying the root cause of such a complex problem requires collecting more than just data about known traffic patterns. In the same way, troubleshooting outages and performance problems in complex IT environments requires collecting non-traditional data, such as which customers are affected, what's going on elsewhere in the company, and how consequential the problem is to the business. Those seemingly unrelated variables need to be integrated with the Three Pillars and presented in a comprehensive view.

Traditional observability suites don't deliver the integrated view organizations need to see the big picture of their application and infrastructure estates. However, modern data lakes and elastic compute engines make it possible at a fraction of the cost of just a few years ago.

More Than Three Pillars

Organizations need to think beyond the traditional framework and adopt a more holistic approach to observability. A unified observability offering breaks down silos by integrating logs, metrics and traces in a single platform. But it doesn't stop there. Using a modern data lake, it can incorporate any information that may be relevant to troubleshooting teams and even fold in non-obvious contextual data such as user behavior, business metrics, and code deployments.

Cloud-native solutions adapt as environments grow and change. Real-time data collection ensures that engineers always have access to the latest version of the truth. Generative AI simplifies queries and can dynamically generate "next steps" that should be taken to investigate and resolve incidents.

Modern distributed systems with siloed legacy tools are about as effective as summing up the grandeur of a world-class city in a few snapshots. Success means widening your aperture, stepping back, and taking in a panoramic view.

Jeremy Burton is CEO of Observe
Share this

The Latest

December 05, 2023

Industry experts offer thoughtful, insightful, and often controversial predictions on how APM, AIOps, Observability, OpenTelemetry and related technologies will evolve and impact business in 2024. Part 2 covers more on Observability ...

December 04, 2023

The Holiday Season means it is time for APMdigest's annual list of Application Performance Management (APM) predictions, covering IT performance topics. Industry experts — from analysts and consultants to the top vendors — offer thoughtful, insightful, and often controversial predictions on how APM, observability, AIOps and related technologies will evolve and impact business in 2024. Part 1 covers APM and Observability ...

November 30, 2023

To help you stay on top of the ever-evolving tech scene, Automox IT experts shake the proverbial magic eight ball and share their predictions about tech trends in the coming year. From M&A frenzies to sustainable tech and automation, these forecasts paint an exciting picture of the future ...

November 29, 2023
The past few years have presented numerous challenges for businesses: a pandemic, rising interest rates, supply chain disruptions, and geopolitical conflict that sent shockwaves across the global economy. But change may finally be on the horizon. According to a recent report by Endava ... a majority of executives confirmed they are feeling optimistic about the current business climate, and as a result, are forecasting larger IT budgets, increased technology funding and rollout, and prioritized innovation in the coming year ...
November 28, 2023

Incident management processes are not keeping pace with the demands of modern operations teams, failing to meet the needs of SREs as well as platform and ops teams. Results from the State of DevOps Automation and AI Survey, commissioned by Transposit, point to an incident management paradox. Despite nearly 60% of ITOps and DevOps professionals reporting they have a defined incident management process that's fully documented in one place and over 70% saying they have a level of automation that meets their needs, teams are unable to quickly resolve incidents ...

November 27, 2023

Today, in the world of enterprise technology, the challenges posed by legacy Virtual Desktop Infrastructure (VDI) systems have long been a source of concern for IT departments. In many instances, this promising solution has become an organizational burden, hindering progress, depleting resources, and taking a psychological and operational toll on employees ...

November 22, 2023

Within retail organizations across the world, IT teams will be bracing themselves for a hectic holiday season ... While this is an exciting opportunity for retailers to boost sales, it also intensifies severe risk. Any application performance slipup will cause consumers to turn their back on brands, possibly forever. Online shoppers will be completely unforgiving to any retailer who doesn't deliver a seamless digital experience ...

November 21, 2023

Black Friday is a time when consumers can cash in on some of the biggest deals retailers offer all year long ... Nearly two-thirds of consumers utilize a retailer's web and mobile app for holiday shopping, raising the stakes for competitors to provide the best online experience to retain customer loyalty. Perforce's 2023 Black Friday survey sheds light on consumers' expectations this time of year and how developers can properly prepare their applications for increased online traffic ...

November 20, 2023

This holiday shopping season, the stakes for online retailers couldn't be higher ... Even an hour or two of downtime for a digital storefront during this critical period can cost millions in lost revenue and has the potential to damage brand credibility. Savvy retailers are increasingly investing in observability to help ensure a seamless, omnichannel customer experience. Just ahead of the holiday season, New Relic released its State of Observability for Retail report, which offers insight and analysis on the adoption and business value of observability for the global retail/consumer industry ...

November 16, 2023

As organizations struggle to find and retain the talent they need to manage complex cloud implementations, many are leaning toward hybrid cloud as a solution ... While it's true that using the cloud is not a "one size fits all" proposition, it is clear that both large and small companies prefer a hybrid cloud model ...