Skip to main content

Observability Is No Place for Tunnel Vision

Jeremy Burton
Observe

Imagine you're blindfolded and dropped into the Marina District of San Francisco. Upon removing your blindfold, you would probably first look around to get your bearings. You might see the Golden Gate Bridge to the northwest, the Transamerica Pyramid to the southeast and Golden Gate Park to the southwest. Based on your perspective, you'd probably be able to deduce your approximate location by correlating multiple familiar data points.

Now imagine you're dropped into an entirely unfamiliar city and given two photos to help you figure out where you are. The photos are grainy and dark images of places you've never seen before. With so little to go on, your chances of success are next to zero.

The second scenario is an apt analogy for the way most site reliability engineering teams operate today. They use a collection of dis-integrated tools to try to diagnose problems they've never seen before. Each day is a new unknown city or unfamiliar neighborhood.

In the same way a city is a sum of its districts and neighborhoods, complex IT systems are made of many components that continually interact. Observability — the practice of collecting data from various aspects of a computer system, application, or infrastructure to understand its performance and identify and resolve issues — requires a comprehensive and connected view of all aspects of the system, including even some that don't directly relate to its technological innards.

Busting Siloes

Observability has traditionally been about correlating the "Three Pillars," machine-generated logs, metrics and traces. Over the years, vendors of observability suites have pieced together point tools to measure these elements, often through acquisitions and siloed development projects. The result is a mishmash of isolated data points connected loosely through dashboards and broken up into more than a dozen discrete practices.

Each tool is designed to operate on a specific type of data, and the tools often don't communicate well with each other. For example, a spike in error logs can tell you that something is wrong, but it won't necessarily give you the contextual information to understand the root cause of the issue. Humans must do that.

In a typical observability scenario, site reliability engineers (SREs), DevOps engineers and administrators pore over their tool of choice and cut and paste what they see to an incident channel on Slack. Then, a person with a big brain — every company has one — tries to connect the dots across multiple screenshots to get at the root cause.

This is madness. Cloud-native applications are composed of independently built and deployed microservices that change daily or even multiple times per day. Many of the problems SREs wrestle with have never been seen before. There is no dashboard or alert for an "unknown" problem, just symptoms with little context. Troubleshooting has never been harder.

To investigate unknown problems, SREs must be able to quickly correlate data points for symptoms they are seeing. Traditional methods of correlating data, such as tagging, simply don't work with complex distributed architectures. Tags are not maintainable at any kind of scale and, even if they were, cardinality issues quickly ensue when, for example, customer counts reach tens or hundreds of thousands. This typically breaks any traditional tooling based on in-memory databases or, even if it doesn't, causes tooling costs to explode.

That's why, despite the $17 billion organizations pour into monitoring, logging and application performance management tools each year, the average mean time to resolution (MTTR) has barely budged.

Beyond the Obvious

The whole point of observability is to investigate unknown issues by seeing non-obvious relationships between data elements. You can't do that with siloed data, even if you have the requisite logs, metrics and traces.

To use our tunnel-vision analogy, a tranquil day in Golden Gate Park doesn't explain why there's a traffic jam on the Golden Gate Bridge. The two may be related, but looking at one in isolation doesn't reveal the root cause. The gridlock may be caused by a breakdown on Highway 101 three miles downstream, a protest march, a fog bank, or police action on the Presidio. Identifying the root cause of such a complex problem requires collecting more than just data about known traffic patterns. In the same way, troubleshooting outages and performance problems in complex IT environments requires collecting non-traditional data, such as which customers are affected, what's going on elsewhere in the company, and how consequential the problem is to the business. Those seemingly unrelated variables need to be integrated with the Three Pillars and presented in a comprehensive view.

Traditional observability suites don't deliver the integrated view organizations need to see the big picture of their application and infrastructure estates. However, modern data lakes and elastic compute engines make it possible at a fraction of the cost of just a few years ago.

More Than Three Pillars

Organizations need to think beyond the traditional framework and adopt a more holistic approach to observability. A unified observability offering breaks down silos by integrating logs, metrics and traces in a single platform. But it doesn't stop there. Using a modern data lake, it can incorporate any information that may be relevant to troubleshooting teams and even fold in non-obvious contextual data such as user behavior, business metrics, and code deployments.

Cloud-native solutions adapt as environments grow and change. Real-time data collection ensures that engineers always have access to the latest version of the truth. Generative AI simplifies queries and can dynamically generate "next steps" that should be taken to investigate and resolve incidents.

Modern distributed systems with siloed legacy tools are about as effective as summing up the grandeur of a world-class city in a few snapshots. Success means widening your aperture, stepping back, and taking in a panoramic view.

Jeremy Burton is CEO of Observe

The Latest

According to Auvik's 2025 IT Trends Report, 60% of IT professionals feel at least moderately burned out on the job, with 43% stating that their workload is contributing to work stress. At the same time, many IT professionals are naming AI and machine learning as key areas they'd most like to upskill ...

Businesses that face downtime or outages risk financial and reputational damage, as well as reducing partner, shareholder, and customer trust. One of the major challenges that enterprises face is implementing a robust business continuity plan. What's the solution? The answer may lie in disaster recovery tactics such as truly immutable storage and regular disaster recovery testing ...

IT spending is expected to jump nearly 10% in 2025, and organizations are now facing pressure to manage costs without slowing down critical functions like observability. To meet the challenge, leaders are turning to smarter, more cost effective business strategies. Enter stage right: OpenTelemetry, the missing piece of the puzzle that is no longer just an option but rather a strategic advantage ...

Amidst the threat of cyberhacks and data breaches, companies install several security measures to keep their business safely afloat. These measures aim to protect businesses, employees, and crucial data. Yet, employees perceive them as burdensome. Frustrated with complex logins, slow access, and constant security checks, workers decide to completely bypass all security set-ups ...

Image
Cloudbrink's Personal SASE services provide last-mile acceleration and reduction in latency

In MEAN TIME TO INSIGHT Episode 13, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses hybrid multi-cloud networking strategy ... 

In high-traffic environments, the sheer volume and unpredictable nature of network incidents can quickly overwhelm even the most skilled teams, hindering their ability to react swiftly and effectively, potentially impacting service availability and overall business performance. This is where closed-loop remediation comes into the picture: an IT management concept designed to address the escalating complexity of modern networks ...

In 2025, enterprise workflows are undergoing a seismic shift. Propelled by breakthroughs in generative AI (GenAI), large language models (LLMs), and natural language processing (NLP), a new paradigm is emerging — agentic AI. This technology is not just automating tasks; it's reimagining how organizations make decisions, engage customers, and operate at scale ...

In the early days of the cloud revolution, business leaders perceived cloud services as a means of sidelining IT organizations. IT was too slow, too expensive, or incapable of supporting new technologies. With a team of developers, line of business managers could deploy new applications and services in the cloud. IT has been fighting to retake control ever since. Today, IT is back in the driver's seat, according to new research by Enterprise Management Associates (EMA) ...

In today's fast-paced and increasingly complex network environments, Network Operations Centers (NOCs) are the backbone of ensuring continuous uptime, smooth service delivery, and rapid issue resolution. However, the challenges faced by NOC teams are only growing. In a recent study, 78% state network complexity has grown significantly over the last few years while 84% regularly learn about network issues from users. It is imperative we adopt a new approach to managing today's network experiences ...

Image
Broadcom

From growing reliance on FinOps teams to the increasing attention on artificial intelligence (AI), and software licensing, the Flexera 2025 State of the Cloud Report digs into how organizations are improving cloud spend efficiency, while tackling the complexities of emerging technologies ...

Observability Is No Place for Tunnel Vision

Jeremy Burton
Observe

Imagine you're blindfolded and dropped into the Marina District of San Francisco. Upon removing your blindfold, you would probably first look around to get your bearings. You might see the Golden Gate Bridge to the northwest, the Transamerica Pyramid to the southeast and Golden Gate Park to the southwest. Based on your perspective, you'd probably be able to deduce your approximate location by correlating multiple familiar data points.

Now imagine you're dropped into an entirely unfamiliar city and given two photos to help you figure out where you are. The photos are grainy and dark images of places you've never seen before. With so little to go on, your chances of success are next to zero.

The second scenario is an apt analogy for the way most site reliability engineering teams operate today. They use a collection of dis-integrated tools to try to diagnose problems they've never seen before. Each day is a new unknown city or unfamiliar neighborhood.

In the same way a city is a sum of its districts and neighborhoods, complex IT systems are made of many components that continually interact. Observability — the practice of collecting data from various aspects of a computer system, application, or infrastructure to understand its performance and identify and resolve issues — requires a comprehensive and connected view of all aspects of the system, including even some that don't directly relate to its technological innards.

Busting Siloes

Observability has traditionally been about correlating the "Three Pillars," machine-generated logs, metrics and traces. Over the years, vendors of observability suites have pieced together point tools to measure these elements, often through acquisitions and siloed development projects. The result is a mishmash of isolated data points connected loosely through dashboards and broken up into more than a dozen discrete practices.

Each tool is designed to operate on a specific type of data, and the tools often don't communicate well with each other. For example, a spike in error logs can tell you that something is wrong, but it won't necessarily give you the contextual information to understand the root cause of the issue. Humans must do that.

In a typical observability scenario, site reliability engineers (SREs), DevOps engineers and administrators pore over their tool of choice and cut and paste what they see to an incident channel on Slack. Then, a person with a big brain — every company has one — tries to connect the dots across multiple screenshots to get at the root cause.

This is madness. Cloud-native applications are composed of independently built and deployed microservices that change daily or even multiple times per day. Many of the problems SREs wrestle with have never been seen before. There is no dashboard or alert for an "unknown" problem, just symptoms with little context. Troubleshooting has never been harder.

To investigate unknown problems, SREs must be able to quickly correlate data points for symptoms they are seeing. Traditional methods of correlating data, such as tagging, simply don't work with complex distributed architectures. Tags are not maintainable at any kind of scale and, even if they were, cardinality issues quickly ensue when, for example, customer counts reach tens or hundreds of thousands. This typically breaks any traditional tooling based on in-memory databases or, even if it doesn't, causes tooling costs to explode.

That's why, despite the $17 billion organizations pour into monitoring, logging and application performance management tools each year, the average mean time to resolution (MTTR) has barely budged.

Beyond the Obvious

The whole point of observability is to investigate unknown issues by seeing non-obvious relationships between data elements. You can't do that with siloed data, even if you have the requisite logs, metrics and traces.

To use our tunnel-vision analogy, a tranquil day in Golden Gate Park doesn't explain why there's a traffic jam on the Golden Gate Bridge. The two may be related, but looking at one in isolation doesn't reveal the root cause. The gridlock may be caused by a breakdown on Highway 101 three miles downstream, a protest march, a fog bank, or police action on the Presidio. Identifying the root cause of such a complex problem requires collecting more than just data about known traffic patterns. In the same way, troubleshooting outages and performance problems in complex IT environments requires collecting non-traditional data, such as which customers are affected, what's going on elsewhere in the company, and how consequential the problem is to the business. Those seemingly unrelated variables need to be integrated with the Three Pillars and presented in a comprehensive view.

Traditional observability suites don't deliver the integrated view organizations need to see the big picture of their application and infrastructure estates. However, modern data lakes and elastic compute engines make it possible at a fraction of the cost of just a few years ago.

More Than Three Pillars

Organizations need to think beyond the traditional framework and adopt a more holistic approach to observability. A unified observability offering breaks down silos by integrating logs, metrics and traces in a single platform. But it doesn't stop there. Using a modern data lake, it can incorporate any information that may be relevant to troubleshooting teams and even fold in non-obvious contextual data such as user behavior, business metrics, and code deployments.

Cloud-native solutions adapt as environments grow and change. Real-time data collection ensures that engineers always have access to the latest version of the truth. Generative AI simplifies queries and can dynamically generate "next steps" that should be taken to investigate and resolve incidents.

Modern distributed systems with siloed legacy tools are about as effective as summing up the grandeur of a world-class city in a few snapshots. Success means widening your aperture, stepping back, and taking in a panoramic view.

Jeremy Burton is CEO of Observe

The Latest

According to Auvik's 2025 IT Trends Report, 60% of IT professionals feel at least moderately burned out on the job, with 43% stating that their workload is contributing to work stress. At the same time, many IT professionals are naming AI and machine learning as key areas they'd most like to upskill ...

Businesses that face downtime or outages risk financial and reputational damage, as well as reducing partner, shareholder, and customer trust. One of the major challenges that enterprises face is implementing a robust business continuity plan. What's the solution? The answer may lie in disaster recovery tactics such as truly immutable storage and regular disaster recovery testing ...

IT spending is expected to jump nearly 10% in 2025, and organizations are now facing pressure to manage costs without slowing down critical functions like observability. To meet the challenge, leaders are turning to smarter, more cost effective business strategies. Enter stage right: OpenTelemetry, the missing piece of the puzzle that is no longer just an option but rather a strategic advantage ...

Amidst the threat of cyberhacks and data breaches, companies install several security measures to keep their business safely afloat. These measures aim to protect businesses, employees, and crucial data. Yet, employees perceive them as burdensome. Frustrated with complex logins, slow access, and constant security checks, workers decide to completely bypass all security set-ups ...

Image
Cloudbrink's Personal SASE services provide last-mile acceleration and reduction in latency

In MEAN TIME TO INSIGHT Episode 13, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses hybrid multi-cloud networking strategy ... 

In high-traffic environments, the sheer volume and unpredictable nature of network incidents can quickly overwhelm even the most skilled teams, hindering their ability to react swiftly and effectively, potentially impacting service availability and overall business performance. This is where closed-loop remediation comes into the picture: an IT management concept designed to address the escalating complexity of modern networks ...

In 2025, enterprise workflows are undergoing a seismic shift. Propelled by breakthroughs in generative AI (GenAI), large language models (LLMs), and natural language processing (NLP), a new paradigm is emerging — agentic AI. This technology is not just automating tasks; it's reimagining how organizations make decisions, engage customers, and operate at scale ...

In the early days of the cloud revolution, business leaders perceived cloud services as a means of sidelining IT organizations. IT was too slow, too expensive, or incapable of supporting new technologies. With a team of developers, line of business managers could deploy new applications and services in the cloud. IT has been fighting to retake control ever since. Today, IT is back in the driver's seat, according to new research by Enterprise Management Associates (EMA) ...

In today's fast-paced and increasingly complex network environments, Network Operations Centers (NOCs) are the backbone of ensuring continuous uptime, smooth service delivery, and rapid issue resolution. However, the challenges faced by NOC teams are only growing. In a recent study, 78% state network complexity has grown significantly over the last few years while 84% regularly learn about network issues from users. It is imperative we adopt a new approach to managing today's network experiences ...

Image
Broadcom

From growing reliance on FinOps teams to the increasing attention on artificial intelligence (AI), and software licensing, the Flexera 2025 State of the Cloud Report digs into how organizations are improving cloud spend efficiency, while tackling the complexities of emerging technologies ...