Skip to main content

Observability Is No Place for Tunnel Vision

Jeremy Burton
Observe

Imagine you're blindfolded and dropped into the Marina District of San Francisco. Upon removing your blindfold, you would probably first look around to get your bearings. You might see the Golden Gate Bridge to the northwest, the Transamerica Pyramid to the southeast and Golden Gate Park to the southwest. Based on your perspective, you'd probably be able to deduce your approximate location by correlating multiple familiar data points.

Now imagine you're dropped into an entirely unfamiliar city and given two photos to help you figure out where you are. The photos are grainy and dark images of places you've never seen before. With so little to go on, your chances of success are next to zero.

The second scenario is an apt analogy for the way most site reliability engineering teams operate today. They use a collection of dis-integrated tools to try to diagnose problems they've never seen before. Each day is a new unknown city or unfamiliar neighborhood.

In the same way a city is a sum of its districts and neighborhoods, complex IT systems are made of many components that continually interact. Observability — the practice of collecting data from various aspects of a computer system, application, or infrastructure to understand its performance and identify and resolve issues — requires a comprehensive and connected view of all aspects of the system, including even some that don't directly relate to its technological innards.

Busting Siloes

Observability has traditionally been about correlating the "Three Pillars," machine-generated logs, metrics and traces. Over the years, vendors of observability suites have pieced together point tools to measure these elements, often through acquisitions and siloed development projects. The result is a mishmash of isolated data points connected loosely through dashboards and broken up into more than a dozen discrete practices.

Each tool is designed to operate on a specific type of data, and the tools often don't communicate well with each other. For example, a spike in error logs can tell you that something is wrong, but it won't necessarily give you the contextual information to understand the root cause of the issue. Humans must do that.

In a typical observability scenario, site reliability engineers (SREs), DevOps engineers and administrators pore over their tool of choice and cut and paste what they see to an incident channel on Slack. Then, a person with a big brain — every company has one — tries to connect the dots across multiple screenshots to get at the root cause.

This is madness. Cloud-native applications are composed of independently built and deployed microservices that change daily or even multiple times per day. Many of the problems SREs wrestle with have never been seen before. There is no dashboard or alert for an "unknown" problem, just symptoms with little context. Troubleshooting has never been harder.

To investigate unknown problems, SREs must be able to quickly correlate data points for symptoms they are seeing. Traditional methods of correlating data, such as tagging, simply don't work with complex distributed architectures. Tags are not maintainable at any kind of scale and, even if they were, cardinality issues quickly ensue when, for example, customer counts reach tens or hundreds of thousands. This typically breaks any traditional tooling based on in-memory databases or, even if it doesn't, causes tooling costs to explode.

That's why, despite the $17 billion organizations pour into monitoring, logging and application performance management tools each year, the average mean time to resolution (MTTR) has barely budged.

Beyond the Obvious

The whole point of observability is to investigate unknown issues by seeing non-obvious relationships between data elements. You can't do that with siloed data, even if you have the requisite logs, metrics and traces.

To use our tunnel-vision analogy, a tranquil day in Golden Gate Park doesn't explain why there's a traffic jam on the Golden Gate Bridge. The two may be related, but looking at one in isolation doesn't reveal the root cause. The gridlock may be caused by a breakdown on Highway 101 three miles downstream, a protest march, a fog bank, or police action on the Presidio. Identifying the root cause of such a complex problem requires collecting more than just data about known traffic patterns. In the same way, troubleshooting outages and performance problems in complex IT environments requires collecting non-traditional data, such as which customers are affected, what's going on elsewhere in the company, and how consequential the problem is to the business. Those seemingly unrelated variables need to be integrated with the Three Pillars and presented in a comprehensive view.

Traditional observability suites don't deliver the integrated view organizations need to see the big picture of their application and infrastructure estates. However, modern data lakes and elastic compute engines make it possible at a fraction of the cost of just a few years ago.

More Than Three Pillars

Organizations need to think beyond the traditional framework and adopt a more holistic approach to observability. A unified observability offering breaks down silos by integrating logs, metrics and traces in a single platform. But it doesn't stop there. Using a modern data lake, it can incorporate any information that may be relevant to troubleshooting teams and even fold in non-obvious contextual data such as user behavior, business metrics, and code deployments.

Cloud-native solutions adapt as environments grow and change. Real-time data collection ensures that engineers always have access to the latest version of the truth. Generative AI simplifies queries and can dynamically generate "next steps" that should be taken to investigate and resolve incidents.

Modern distributed systems with siloed legacy tools are about as effective as summing up the grandeur of a world-class city in a few snapshots. Success means widening your aperture, stepping back, and taking in a panoramic view.

Jeremy Burton is CEO of Observe

The Latest

I've spent a lot of time in the channel, and one thing I keep coming back to is this: a partner program is only as good as what it looks like in the field. Many programs look great on paper, but when a partner is in front of a customer navigating a complex hybrid environment or trying to make the case for AI-powered observability, the gap between what a vendor promises and what it actually delivers becomes very clear, very fast ...

Enterprises today operate in a real-time environment where uninterrupted access to trusted data has become a baseline expectation for users, applications and automated systems. Traditional DataOps models, built on manual effort and human triage, cannot keep pace with this always active demand. AI agents are emerging as the operational backbone, ensuring consistent data availability, reinforcing trustworthiness and enabling a level of scale that manual processes cannot achieve ...

For decades, trust in the digital workplace rested on familiar signals. We trusted faces on video calls, voices on the phone, and emails that appeared to come from people we knew. These cues felt human and intuitive. They anchored how decisions were made, approvals were granted, and access was authorized. AI-powered deepfakes have quietly broken that model ...

Cloud migration was supposed to be a one-way door. For most enterprises, it turns out it isn't. Cloud data repatriation is a real and growing trend. A new survey ... finds that 89% of organizations plan to expand their on-premises infrastructure footprint over the next two years — and 75% have already moved at least some workloads back from public cloud in the past 24 months. The findings point to a broad rethinking of where data belongs ...

Over the past few years, large language models (LLMs) have revolutionized the software industry. Given their ability to excel at multi-step reasoning, LLMs have helped enterprises streamline workflows and adapt to the unknown. However, employing such models comes with sky-high costs, latency issues, and limited flexibility. In the realm of IT operations, it is generally wiser to employ smaller, domain-specific models instead ...

For years, DevOps teams operated under a simple assumption: collect enough telemetry, and you can find and fix any problem. That assumption is breaking down. Modern enterprises now operate across microservices, hybrid cloud environments, APIs, Kubernetes, and highly automated delivery pipelines. Releases happen continuously, dependencies shift constantly, and failures spread faster than teams can diagnose them ...

New Relic surveyed IT and engineering leaders from the media and entertainment (M&E) sector to understand what's working — and where challenges persist with their observability practices. The findings reveal how M&E organizations are navigating rising platform complexity, audience expectations, and AI-driven change. Below are five takeaways that stand out ...

Let me start with something I've seen play out more times than I can count. A team hits a wall with the cloud. Costs creep up, then spike. Performance starts to feel inconsistent. Someone in finance asks a simple question like "why did this double?" and nobody has a clean answer ... Maybe this isn't the right place for everything. That realization feels like a breakthrough, like you've identified the problem. In reality, you've just identified the starting line ...

In MEAN TIME TO INSIGHT Episode 24, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses network observability tool sprawl ... 

In cloud-native systems, scaling is often as simple as moving a slider. For on-premise databases, the stakes are different. Over-provisioning hardware is expensive. Under-provisioning leads to performance bottlenecks that are difficult to fix once the equipment is in the rack ...

Observability Is No Place for Tunnel Vision

Jeremy Burton
Observe

Imagine you're blindfolded and dropped into the Marina District of San Francisco. Upon removing your blindfold, you would probably first look around to get your bearings. You might see the Golden Gate Bridge to the northwest, the Transamerica Pyramid to the southeast and Golden Gate Park to the southwest. Based on your perspective, you'd probably be able to deduce your approximate location by correlating multiple familiar data points.

Now imagine you're dropped into an entirely unfamiliar city and given two photos to help you figure out where you are. The photos are grainy and dark images of places you've never seen before. With so little to go on, your chances of success are next to zero.

The second scenario is an apt analogy for the way most site reliability engineering teams operate today. They use a collection of dis-integrated tools to try to diagnose problems they've never seen before. Each day is a new unknown city or unfamiliar neighborhood.

In the same way a city is a sum of its districts and neighborhoods, complex IT systems are made of many components that continually interact. Observability — the practice of collecting data from various aspects of a computer system, application, or infrastructure to understand its performance and identify and resolve issues — requires a comprehensive and connected view of all aspects of the system, including even some that don't directly relate to its technological innards.

Busting Siloes

Observability has traditionally been about correlating the "Three Pillars," machine-generated logs, metrics and traces. Over the years, vendors of observability suites have pieced together point tools to measure these elements, often through acquisitions and siloed development projects. The result is a mishmash of isolated data points connected loosely through dashboards and broken up into more than a dozen discrete practices.

Each tool is designed to operate on a specific type of data, and the tools often don't communicate well with each other. For example, a spike in error logs can tell you that something is wrong, but it won't necessarily give you the contextual information to understand the root cause of the issue. Humans must do that.

In a typical observability scenario, site reliability engineers (SREs), DevOps engineers and administrators pore over their tool of choice and cut and paste what they see to an incident channel on Slack. Then, a person with a big brain — every company has one — tries to connect the dots across multiple screenshots to get at the root cause.

This is madness. Cloud-native applications are composed of independently built and deployed microservices that change daily or even multiple times per day. Many of the problems SREs wrestle with have never been seen before. There is no dashboard or alert for an "unknown" problem, just symptoms with little context. Troubleshooting has never been harder.

To investigate unknown problems, SREs must be able to quickly correlate data points for symptoms they are seeing. Traditional methods of correlating data, such as tagging, simply don't work with complex distributed architectures. Tags are not maintainable at any kind of scale and, even if they were, cardinality issues quickly ensue when, for example, customer counts reach tens or hundreds of thousands. This typically breaks any traditional tooling based on in-memory databases or, even if it doesn't, causes tooling costs to explode.

That's why, despite the $17 billion organizations pour into monitoring, logging and application performance management tools each year, the average mean time to resolution (MTTR) has barely budged.

Beyond the Obvious

The whole point of observability is to investigate unknown issues by seeing non-obvious relationships between data elements. You can't do that with siloed data, even if you have the requisite logs, metrics and traces.

To use our tunnel-vision analogy, a tranquil day in Golden Gate Park doesn't explain why there's a traffic jam on the Golden Gate Bridge. The two may be related, but looking at one in isolation doesn't reveal the root cause. The gridlock may be caused by a breakdown on Highway 101 three miles downstream, a protest march, a fog bank, or police action on the Presidio. Identifying the root cause of such a complex problem requires collecting more than just data about known traffic patterns. In the same way, troubleshooting outages and performance problems in complex IT environments requires collecting non-traditional data, such as which customers are affected, what's going on elsewhere in the company, and how consequential the problem is to the business. Those seemingly unrelated variables need to be integrated with the Three Pillars and presented in a comprehensive view.

Traditional observability suites don't deliver the integrated view organizations need to see the big picture of their application and infrastructure estates. However, modern data lakes and elastic compute engines make it possible at a fraction of the cost of just a few years ago.

More Than Three Pillars

Organizations need to think beyond the traditional framework and adopt a more holistic approach to observability. A unified observability offering breaks down silos by integrating logs, metrics and traces in a single platform. But it doesn't stop there. Using a modern data lake, it can incorporate any information that may be relevant to troubleshooting teams and even fold in non-obvious contextual data such as user behavior, business metrics, and code deployments.

Cloud-native solutions adapt as environments grow and change. Real-time data collection ensures that engineers always have access to the latest version of the truth. Generative AI simplifies queries and can dynamically generate "next steps" that should be taken to investigate and resolve incidents.

Modern distributed systems with siloed legacy tools are about as effective as summing up the grandeur of a world-class city in a few snapshots. Success means widening your aperture, stepping back, and taking in a panoramic view.

Jeremy Burton is CEO of Observe

The Latest

I've spent a lot of time in the channel, and one thing I keep coming back to is this: a partner program is only as good as what it looks like in the field. Many programs look great on paper, but when a partner is in front of a customer navigating a complex hybrid environment or trying to make the case for AI-powered observability, the gap between what a vendor promises and what it actually delivers becomes very clear, very fast ...

Enterprises today operate in a real-time environment where uninterrupted access to trusted data has become a baseline expectation for users, applications and automated systems. Traditional DataOps models, built on manual effort and human triage, cannot keep pace with this always active demand. AI agents are emerging as the operational backbone, ensuring consistent data availability, reinforcing trustworthiness and enabling a level of scale that manual processes cannot achieve ...

For decades, trust in the digital workplace rested on familiar signals. We trusted faces on video calls, voices on the phone, and emails that appeared to come from people we knew. These cues felt human and intuitive. They anchored how decisions were made, approvals were granted, and access was authorized. AI-powered deepfakes have quietly broken that model ...

Cloud migration was supposed to be a one-way door. For most enterprises, it turns out it isn't. Cloud data repatriation is a real and growing trend. A new survey ... finds that 89% of organizations plan to expand their on-premises infrastructure footprint over the next two years — and 75% have already moved at least some workloads back from public cloud in the past 24 months. The findings point to a broad rethinking of where data belongs ...

Over the past few years, large language models (LLMs) have revolutionized the software industry. Given their ability to excel at multi-step reasoning, LLMs have helped enterprises streamline workflows and adapt to the unknown. However, employing such models comes with sky-high costs, latency issues, and limited flexibility. In the realm of IT operations, it is generally wiser to employ smaller, domain-specific models instead ...

For years, DevOps teams operated under a simple assumption: collect enough telemetry, and you can find and fix any problem. That assumption is breaking down. Modern enterprises now operate across microservices, hybrid cloud environments, APIs, Kubernetes, and highly automated delivery pipelines. Releases happen continuously, dependencies shift constantly, and failures spread faster than teams can diagnose them ...

New Relic surveyed IT and engineering leaders from the media and entertainment (M&E) sector to understand what's working — and where challenges persist with their observability practices. The findings reveal how M&E organizations are navigating rising platform complexity, audience expectations, and AI-driven change. Below are five takeaways that stand out ...

Let me start with something I've seen play out more times than I can count. A team hits a wall with the cloud. Costs creep up, then spike. Performance starts to feel inconsistent. Someone in finance asks a simple question like "why did this double?" and nobody has a clean answer ... Maybe this isn't the right place for everything. That realization feels like a breakthrough, like you've identified the problem. In reality, you've just identified the starting line ...

In MEAN TIME TO INSIGHT Episode 24, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses network observability tool sprawl ... 

In cloud-native systems, scaling is often as simple as moving a slider. For on-premise databases, the stakes are different. Over-provisioning hardware is expensive. Under-provisioning leads to performance bottlenecks that are difficult to fix once the equipment is in the rack ...