Skip to main content

Observability Is No Place for Tunnel Vision

Jeremy Burton
Observe

Imagine you're blindfolded and dropped into the Marina District of San Francisco. Upon removing your blindfold, you would probably first look around to get your bearings. You might see the Golden Gate Bridge to the northwest, the Transamerica Pyramid to the southeast and Golden Gate Park to the southwest. Based on your perspective, you'd probably be able to deduce your approximate location by correlating multiple familiar data points.

Now imagine you're dropped into an entirely unfamiliar city and given two photos to help you figure out where you are. The photos are grainy and dark images of places you've never seen before. With so little to go on, your chances of success are next to zero.

The second scenario is an apt analogy for the way most site reliability engineering teams operate today. They use a collection of dis-integrated tools to try to diagnose problems they've never seen before. Each day is a new unknown city or unfamiliar neighborhood.

In the same way a city is a sum of its districts and neighborhoods, complex IT systems are made of many components that continually interact. Observability — the practice of collecting data from various aspects of a computer system, application, or infrastructure to understand its performance and identify and resolve issues — requires a comprehensive and connected view of all aspects of the system, including even some that don't directly relate to its technological innards.

Busting Siloes

Observability has traditionally been about correlating the "Three Pillars," machine-generated logs, metrics and traces. Over the years, vendors of observability suites have pieced together point tools to measure these elements, often through acquisitions and siloed development projects. The result is a mishmash of isolated data points connected loosely through dashboards and broken up into more than a dozen discrete practices.

Each tool is designed to operate on a specific type of data, and the tools often don't communicate well with each other. For example, a spike in error logs can tell you that something is wrong, but it won't necessarily give you the contextual information to understand the root cause of the issue. Humans must do that.

In a typical observability scenario, site reliability engineers (SREs), DevOps engineers and administrators pore over their tool of choice and cut and paste what they see to an incident channel on Slack. Then, a person with a big brain — every company has one — tries to connect the dots across multiple screenshots to get at the root cause.

This is madness. Cloud-native applications are composed of independently built and deployed microservices that change daily or even multiple times per day. Many of the problems SREs wrestle with have never been seen before. There is no dashboard or alert for an "unknown" problem, just symptoms with little context. Troubleshooting has never been harder.

To investigate unknown problems, SREs must be able to quickly correlate data points for symptoms they are seeing. Traditional methods of correlating data, such as tagging, simply don't work with complex distributed architectures. Tags are not maintainable at any kind of scale and, even if they were, cardinality issues quickly ensue when, for example, customer counts reach tens or hundreds of thousands. This typically breaks any traditional tooling based on in-memory databases or, even if it doesn't, causes tooling costs to explode.

That's why, despite the $17 billion organizations pour into monitoring, logging and application performance management tools each year, the average mean time to resolution (MTTR) has barely budged.

Beyond the Obvious

The whole point of observability is to investigate unknown issues by seeing non-obvious relationships between data elements. You can't do that with siloed data, even if you have the requisite logs, metrics and traces.

To use our tunnel-vision analogy, a tranquil day in Golden Gate Park doesn't explain why there's a traffic jam on the Golden Gate Bridge. The two may be related, but looking at one in isolation doesn't reveal the root cause. The gridlock may be caused by a breakdown on Highway 101 three miles downstream, a protest march, a fog bank, or police action on the Presidio. Identifying the root cause of such a complex problem requires collecting more than just data about known traffic patterns. In the same way, troubleshooting outages and performance problems in complex IT environments requires collecting non-traditional data, such as which customers are affected, what's going on elsewhere in the company, and how consequential the problem is to the business. Those seemingly unrelated variables need to be integrated with the Three Pillars and presented in a comprehensive view.

Traditional observability suites don't deliver the integrated view organizations need to see the big picture of their application and infrastructure estates. However, modern data lakes and elastic compute engines make it possible at a fraction of the cost of just a few years ago.

More Than Three Pillars

Organizations need to think beyond the traditional framework and adopt a more holistic approach to observability. A unified observability offering breaks down silos by integrating logs, metrics and traces in a single platform. But it doesn't stop there. Using a modern data lake, it can incorporate any information that may be relevant to troubleshooting teams and even fold in non-obvious contextual data such as user behavior, business metrics, and code deployments.

Cloud-native solutions adapt as environments grow and change. Real-time data collection ensures that engineers always have access to the latest version of the truth. Generative AI simplifies queries and can dynamically generate "next steps" that should be taken to investigate and resolve incidents.

Modern distributed systems with siloed legacy tools are about as effective as summing up the grandeur of a world-class city in a few snapshots. Success means widening your aperture, stepping back, and taking in a panoramic view.

Jeremy Burton is CEO of Observe

The Latest

While 87% of manufacturing leaders and technical specialists report that ROI from their AIOps initiatives has met or exceeded expectations, only 37% say they are fully prepared to operationalize AI at scale, according to The Future of IT Operations in the AI Era, a report from Riverbed ...

Many organizations rely on cloud-first architectures to aggregate, analyze, and act on their operational data ... However, not all environments are conducive to cloud-first architectures ... There are limitations to cloud-first architectures that render them ineffective in mission-critical situations where responsiveness, cost control, and data sovereignty are non-negotiable; these limitations include ...

For years, cybersecurity was built around a simple assumption: protect the physical network and trust everything inside it. That model made sense when employees worked in offices, applications lived in data centers, and devices rarely left the building. Today's reality is fluid: people work from everywhere, applications run across multiple clouds, and AI-driven agents are beginning to act on behalf of users. But while the old perimeter dissolved, a new one quietly emerged ...

For years, infrastructure teams have treated compute as a relatively stable input. Capacity was provisioned, costs were forecasted, and performance expectations were set based on the assumption that identical resources behaved identically. That mental model is starting to break down. AI infrastructure is no longer behaving like static cloud capacity. It is increasingly behaving like a market ...

Resilience can no longer be defined by how quickly an organization recovers from an incident or disruption. The effectiveness of any resilience strategy is dependent on its ability to anticipate change, operate under continuous stress, and adapt confidently amid uncertainty ...

Mobile users are less tolerant of app instability than ever before. According to a new report from Luciq, No Margin for Error: What Mobile Users Expect and What Mobile Leaders Must Deliver in 2026, even minor performance issues now result in immediate abandonment, lost purchases, and long-term brand impact ...

Artificial intelligence (AI) has become the dominant force shaping enterprise data strategies. Boards expect progress. Executives expect returns. And data leaders are under pressure to prove that their organizations are "AI-ready" ...

Agentic AI is a major buzzword for 2026. Many tech companies are making bold promises about this technology, but many aren't grounded in reality, at least not yet. This coming year will likely be shaped by reality checks for IT teams, and progress will only come from a focus on strong foundations and disciplined execution ...

AI systems are still prone to hallucinations and misjudgments ... To build the trust needed for adoption, AI must be paired with human-in-the-loop (HITL) oversight, or checkpoints where humans verify, guide, and decide what actions are taken. The balance between autonomy and accountability is what will allow AI to deliver on its promise without sacrificing human trust ...

More data center leaders are reducing their reliance on utility grids by investing in onsite power for rapidly scaling data centers, according to the Data Center Power Report from Bloom Energy ...

Observability Is No Place for Tunnel Vision

Jeremy Burton
Observe

Imagine you're blindfolded and dropped into the Marina District of San Francisco. Upon removing your blindfold, you would probably first look around to get your bearings. You might see the Golden Gate Bridge to the northwest, the Transamerica Pyramid to the southeast and Golden Gate Park to the southwest. Based on your perspective, you'd probably be able to deduce your approximate location by correlating multiple familiar data points.

Now imagine you're dropped into an entirely unfamiliar city and given two photos to help you figure out where you are. The photos are grainy and dark images of places you've never seen before. With so little to go on, your chances of success are next to zero.

The second scenario is an apt analogy for the way most site reliability engineering teams operate today. They use a collection of dis-integrated tools to try to diagnose problems they've never seen before. Each day is a new unknown city or unfamiliar neighborhood.

In the same way a city is a sum of its districts and neighborhoods, complex IT systems are made of many components that continually interact. Observability — the practice of collecting data from various aspects of a computer system, application, or infrastructure to understand its performance and identify and resolve issues — requires a comprehensive and connected view of all aspects of the system, including even some that don't directly relate to its technological innards.

Busting Siloes

Observability has traditionally been about correlating the "Three Pillars," machine-generated logs, metrics and traces. Over the years, vendors of observability suites have pieced together point tools to measure these elements, often through acquisitions and siloed development projects. The result is a mishmash of isolated data points connected loosely through dashboards and broken up into more than a dozen discrete practices.

Each tool is designed to operate on a specific type of data, and the tools often don't communicate well with each other. For example, a spike in error logs can tell you that something is wrong, but it won't necessarily give you the contextual information to understand the root cause of the issue. Humans must do that.

In a typical observability scenario, site reliability engineers (SREs), DevOps engineers and administrators pore over their tool of choice and cut and paste what they see to an incident channel on Slack. Then, a person with a big brain — every company has one — tries to connect the dots across multiple screenshots to get at the root cause.

This is madness. Cloud-native applications are composed of independently built and deployed microservices that change daily or even multiple times per day. Many of the problems SREs wrestle with have never been seen before. There is no dashboard or alert for an "unknown" problem, just symptoms with little context. Troubleshooting has never been harder.

To investigate unknown problems, SREs must be able to quickly correlate data points for symptoms they are seeing. Traditional methods of correlating data, such as tagging, simply don't work with complex distributed architectures. Tags are not maintainable at any kind of scale and, even if they were, cardinality issues quickly ensue when, for example, customer counts reach tens or hundreds of thousands. This typically breaks any traditional tooling based on in-memory databases or, even if it doesn't, causes tooling costs to explode.

That's why, despite the $17 billion organizations pour into monitoring, logging and application performance management tools each year, the average mean time to resolution (MTTR) has barely budged.

Beyond the Obvious

The whole point of observability is to investigate unknown issues by seeing non-obvious relationships between data elements. You can't do that with siloed data, even if you have the requisite logs, metrics and traces.

To use our tunnel-vision analogy, a tranquil day in Golden Gate Park doesn't explain why there's a traffic jam on the Golden Gate Bridge. The two may be related, but looking at one in isolation doesn't reveal the root cause. The gridlock may be caused by a breakdown on Highway 101 three miles downstream, a protest march, a fog bank, or police action on the Presidio. Identifying the root cause of such a complex problem requires collecting more than just data about known traffic patterns. In the same way, troubleshooting outages and performance problems in complex IT environments requires collecting non-traditional data, such as which customers are affected, what's going on elsewhere in the company, and how consequential the problem is to the business. Those seemingly unrelated variables need to be integrated with the Three Pillars and presented in a comprehensive view.

Traditional observability suites don't deliver the integrated view organizations need to see the big picture of their application and infrastructure estates. However, modern data lakes and elastic compute engines make it possible at a fraction of the cost of just a few years ago.

More Than Three Pillars

Organizations need to think beyond the traditional framework and adopt a more holistic approach to observability. A unified observability offering breaks down silos by integrating logs, metrics and traces in a single platform. But it doesn't stop there. Using a modern data lake, it can incorporate any information that may be relevant to troubleshooting teams and even fold in non-obvious contextual data such as user behavior, business metrics, and code deployments.

Cloud-native solutions adapt as environments grow and change. Real-time data collection ensures that engineers always have access to the latest version of the truth. Generative AI simplifies queries and can dynamically generate "next steps" that should be taken to investigate and resolve incidents.

Modern distributed systems with siloed legacy tools are about as effective as summing up the grandeur of a world-class city in a few snapshots. Success means widening your aperture, stepping back, and taking in a panoramic view.

Jeremy Burton is CEO of Observe

The Latest

While 87% of manufacturing leaders and technical specialists report that ROI from their AIOps initiatives has met or exceeded expectations, only 37% say they are fully prepared to operationalize AI at scale, according to The Future of IT Operations in the AI Era, a report from Riverbed ...

Many organizations rely on cloud-first architectures to aggregate, analyze, and act on their operational data ... However, not all environments are conducive to cloud-first architectures ... There are limitations to cloud-first architectures that render them ineffective in mission-critical situations where responsiveness, cost control, and data sovereignty are non-negotiable; these limitations include ...

For years, cybersecurity was built around a simple assumption: protect the physical network and trust everything inside it. That model made sense when employees worked in offices, applications lived in data centers, and devices rarely left the building. Today's reality is fluid: people work from everywhere, applications run across multiple clouds, and AI-driven agents are beginning to act on behalf of users. But while the old perimeter dissolved, a new one quietly emerged ...

For years, infrastructure teams have treated compute as a relatively stable input. Capacity was provisioned, costs were forecasted, and performance expectations were set based on the assumption that identical resources behaved identically. That mental model is starting to break down. AI infrastructure is no longer behaving like static cloud capacity. It is increasingly behaving like a market ...

Resilience can no longer be defined by how quickly an organization recovers from an incident or disruption. The effectiveness of any resilience strategy is dependent on its ability to anticipate change, operate under continuous stress, and adapt confidently amid uncertainty ...

Mobile users are less tolerant of app instability than ever before. According to a new report from Luciq, No Margin for Error: What Mobile Users Expect and What Mobile Leaders Must Deliver in 2026, even minor performance issues now result in immediate abandonment, lost purchases, and long-term brand impact ...

Artificial intelligence (AI) has become the dominant force shaping enterprise data strategies. Boards expect progress. Executives expect returns. And data leaders are under pressure to prove that their organizations are "AI-ready" ...

Agentic AI is a major buzzword for 2026. Many tech companies are making bold promises about this technology, but many aren't grounded in reality, at least not yet. This coming year will likely be shaped by reality checks for IT teams, and progress will only come from a focus on strong foundations and disciplined execution ...

AI systems are still prone to hallucinations and misjudgments ... To build the trust needed for adoption, AI must be paired with human-in-the-loop (HITL) oversight, or checkpoints where humans verify, guide, and decide what actions are taken. The balance between autonomy and accountability is what will allow AI to deliver on its promise without sacrificing human trust ...

More data center leaders are reducing their reliance on utility grids by investing in onsite power for rapidly scaling data centers, according to the Data Center Power Report from Bloom Energy ...