Skip to main content

Observability Gaps Are Costing You. Here's How to Fix Them - Fast

Mehdi Daoudi
Catchpoint

It's 7 in the morning. You get an alert from your team. A critical service is down. Yet, your monitoring systems show no critical alerts. Where is the problem? You are considering calling for a war room. It will be a massive distraction for the best people on your team, but what option do you have?

Your next thought may be: why was this not caught with our APM tools? We spend a fortune on them. These incidents are happening in places you cannot see. From global SaaS disruptions to regional ISP failures, to APIs your systems rely on and cloud services in multiple availability zones, the Internet has become a critical extension of enterprise infrastructure. And yet, many teams are still relying on legacy observability strategies that were never built for internet-centric dependencies. The result? Ongoing blind spots, impatient users, and increasing operational costs.

And then there are the micro-outages your team might not even be aware of. Regional incidents, small hiccups, and process failures that are likely happening, often undetected, often unreported and like paper cuts, they cut into user satisfaction, damage the business and erode trust. In fact, our annual Internet Resilience Report found that, in 2025, one in eight businesses now lose over $10 million a month to disruptions and half lose over a million plus a month.

Thus, it's clear a new approach is needed — one that complements APM's detailed view into code, infrastructure, and events with a broad view of the internet stack and more user-centric monitoring. Let's break down how organizations are taking this approach to close the gaps and why the cost of ignoring them is only getting higher.

Why Observability Is Falling Short

Observability tools' main focus is to monitor internal systems: servers, containers, microservices — code traces, metrics, logs and events (MELT). But the modern enterprise is no longer built on custom applications that run on the infrastructure they manage. Cloud apps, SaaS platforms, APIs, and third-party services are now integral to delivering digital experiences. And all of them rely on the health of the Internet: DNS, SSL, BGP, routing, ISPs, etc.

That's where APM alone starts to fail. They were not built to monitor a massively distributed service-oriented multi-party applications. They offer insufficient insight into the routes, external services, internet protocols, and the regional performance that determine whether users can access your app at all.

In fact, what really matters isn't backend-system health; instead, it's real-world user experience. The customer waiting at the rental car counter doesn't care that your servers are humming along at 72% CPU utilization. They care that they need to get to a meeting and the person at the other side of the counter says "Sorry, my computer is slow today". And if you can't tell whether the root cause is your code, your cloud provider, the local internet, DNS resolution times, latency for an API, or a BGP routing issue in some part of the world, you're in trouble without the visibility you need.

APM + IPM = End-to-End Visibility

To solve this, forward-looking enterprises are covering their visibility gap by enhancing the visibility they get from APM tools with Internet Performance Monitoring (IPM). On one side, APM delivers the inside-out view, including instrumentation, tracing, and system health. On the other, IPM offers the outside-in perspective, including real user experience, the health of the global Internet, and proactive testing of everything that may impact a user including first and third party dependencies — from APIs to cloud services to VPNs to database timeouts.

Together, they provide true end-to-end observability, a model is already proving invaluable for global enterprises like SAP, IKEA, and Akamai. APM tools paired with IPM are delivering the unified view of performance that teams need, from the application code to the end user's screen, wherever in the world they are.

With this approach, teams are moving way faster and resolving issues more rapidly and in this way are aligning themselves better to meet business outcomes by making customer experience KPIs the primary objective of observability teams. For instance, they can measure the impact of outages on customer satisfaction and revenue, not just uptime and latency.

The Role of OpenTelemetry

If APM and IPM are the two sides of the observability coin, OpenTelemetry is the glue that binds them. OTel has emerged as the de facto standard for integrating monitoring data, including traces, logs, and metrics, from multiple components of an ecosystem. Its adoption is accelerating because it helps teams break vendor lock-in, standardize data collection, and reduce the cost of managing multiple tools.

In fact, most enterprises now require OTel support as a prerequisite for any observability solution. The best outcomes happen when OpenTelemetry is part of a broader strategy that includes governance, platform selection, and integration with both APM and IPM tools.

As an example, an OTel SDK on a native mobile application could feed telemetry to both APM and IPM systems and both of these could feed a central system with a unified dashboard and/or an alerting or AIOps system. What is possible with OTel is growing and becoming more practical over time.

Centralized Observability Is on the Rise

With greater complexity and greater stakes, enterprises are shifting observability decisions to centralized teams. These groups, sometimes part of architecture, sometimes under operations, are tasked with standardizing vendors, enforcing best practices, and ensuring observability aligns with business needs.

Trend-wise, this is a direct response to tool sprawl and rising costs. According to a recent Elastic survey, many organizations are actively consolidating their observability stacks to improve collaboration and reduce licensing and training expenses.

Centralized observability teams are also the ones most likely to invest in IPM, recognizing that the user's path through the Internet is as important as the path through the code. EMA research recently confirmed this, noting that "Internet Performance Monitoring tools have become just as important as application performance management, if not more so."

Real Results from Modern Observability

Enterprises that embrace this model APM + IPM + OTel, led by a centralized team are already seeing results. They include:

  • Faster time to resolution: By monitoring beyond the firewall, teams spot and diagnose issues quicker.
  • Cost savings: Fewer tools, better data, less duplication.
  • Improved user experience: Outages that used to take hours to triage now take minutes to fix.
  • Greater alignment with business goals: IT teams can tie observability metrics to user impact and revenue risk.

By integrating Internet Performance Monitoring alongside APM, adopting OpenTelemetry for data consistency, and empowering centralized observability teams to lead the way, enterprises can close their performance blind spots and deliver better digital experiences faster and more reliably.

In 2025, observability isn't just about keeping the lights on. It's about creating resilience, reducing cost, and proving the value of IT across the business. And that starts with seeing the whole picture, inside and out.

Mehdi Daoudi is CEO and Co-Founder of Catchpoint

The Latest

New Relic surveyed IT and engineering leaders from the media and entertainment (M&E) sector to understand what's working — and where challenges persist with their observability practices. The findings reveal how M&E organizations are navigating rising platform complexity, audience expectations, and AI-driven change. Below are five takeaways that stand out ...

Let me start with something I've seen play out more times than I can count. A team hits a wall with the cloud. Costs creep up, then spike. Performance starts to feel inconsistent. Someone in finance asks a simple question like "why did this double?" and nobody has a clean answer ... Maybe this isn't the right place for everything. That realization feels like a breakthrough, like you've identified the problem. In reality, you've just identified the starting line ...

In MEAN TIME TO INSIGHT Episode 24, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses network observability tool sprawl ... 

In cloud-native systems, scaling is often as simple as moving a slider. For on-premise databases, the stakes are different. Over-provisioning hardware is expensive. Under-provisioning leads to performance bottlenecks that are difficult to fix once the equipment is in the rack ...

When most people think about cybersecurity, they picture firewalls, encryption, and access controls — technical tools designed to protect systems and data. But beneath the technology lies a deeper set of principles about trust, decision-making, and resilience ... The best leaders don't eliminate risk. They manage it intelligently. And in many ways, cybersecurity offers a surprisingly useful playbook for doing exactly that ...

Many organizations assumed their infrastructure strategy was settled. It had been implemented, optimized and built into long-term plans. Recent changes in technology and vendor consolidation are forcing a second look. Cloud outages and licensing changes have exposed how much dependency exists on a small number of platforms. As a result, organizations are reevaluating whether those decisions still hold up under current conditions ...

Edge AI is strategically embedded in core IT and infrastructure spending across industries, according to the 2026 Edge AI Survey from ZEDEDA. The research shows that 83% of C-suite and IT executive respondents say edge AI is important to their core business strategy ...

As AI adoption accelerates, operational complexity — not model intelligence — is becoming the primary barrier to reliable AI at scale, according to the State of AI Engineering 2026 from Datadog ... The report highlights a compounding complexity challenge as AI systems scale ... Around 5% of AI model requests fail in production, with nearly 60% of those failures caused by capacity limits ...

For years, production operations teams have treated alert fatigue as a quality-of-life problem: something that makes on-call rotations miserable but isn't considered a direct contributor to outages. That framing doesn't capture how these systems fail, and we now have data to show why. More importantly, it's now clear alert fatigue is a symptom of a deeper issue: production systems have outgrown the current operational approaches ...

I was on a customer call last fall when an enterprise architect said something I haven't been able to shake. Her team had just spent four months trying to swap one AI vendor for another. The original plan said three weeks. "We didn't switch vendors," she told me. "We rebuilt half our integrations and discovered what we'd actually been depending on." Most enterprise leaders don't expect that to be the experience ...

Observability Gaps Are Costing You. Here's How to Fix Them - Fast

Mehdi Daoudi
Catchpoint

It's 7 in the morning. You get an alert from your team. A critical service is down. Yet, your monitoring systems show no critical alerts. Where is the problem? You are considering calling for a war room. It will be a massive distraction for the best people on your team, but what option do you have?

Your next thought may be: why was this not caught with our APM tools? We spend a fortune on them. These incidents are happening in places you cannot see. From global SaaS disruptions to regional ISP failures, to APIs your systems rely on and cloud services in multiple availability zones, the Internet has become a critical extension of enterprise infrastructure. And yet, many teams are still relying on legacy observability strategies that were never built for internet-centric dependencies. The result? Ongoing blind spots, impatient users, and increasing operational costs.

And then there are the micro-outages your team might not even be aware of. Regional incidents, small hiccups, and process failures that are likely happening, often undetected, often unreported and like paper cuts, they cut into user satisfaction, damage the business and erode trust. In fact, our annual Internet Resilience Report found that, in 2025, one in eight businesses now lose over $10 million a month to disruptions and half lose over a million plus a month.

Thus, it's clear a new approach is needed — one that complements APM's detailed view into code, infrastructure, and events with a broad view of the internet stack and more user-centric monitoring. Let's break down how organizations are taking this approach to close the gaps and why the cost of ignoring them is only getting higher.

Why Observability Is Falling Short

Observability tools' main focus is to monitor internal systems: servers, containers, microservices — code traces, metrics, logs and events (MELT). But the modern enterprise is no longer built on custom applications that run on the infrastructure they manage. Cloud apps, SaaS platforms, APIs, and third-party services are now integral to delivering digital experiences. And all of them rely on the health of the Internet: DNS, SSL, BGP, routing, ISPs, etc.

That's where APM alone starts to fail. They were not built to monitor a massively distributed service-oriented multi-party applications. They offer insufficient insight into the routes, external services, internet protocols, and the regional performance that determine whether users can access your app at all.

In fact, what really matters isn't backend-system health; instead, it's real-world user experience. The customer waiting at the rental car counter doesn't care that your servers are humming along at 72% CPU utilization. They care that they need to get to a meeting and the person at the other side of the counter says "Sorry, my computer is slow today". And if you can't tell whether the root cause is your code, your cloud provider, the local internet, DNS resolution times, latency for an API, or a BGP routing issue in some part of the world, you're in trouble without the visibility you need.

APM + IPM = End-to-End Visibility

To solve this, forward-looking enterprises are covering their visibility gap by enhancing the visibility they get from APM tools with Internet Performance Monitoring (IPM). On one side, APM delivers the inside-out view, including instrumentation, tracing, and system health. On the other, IPM offers the outside-in perspective, including real user experience, the health of the global Internet, and proactive testing of everything that may impact a user including first and third party dependencies — from APIs to cloud services to VPNs to database timeouts.

Together, they provide true end-to-end observability, a model is already proving invaluable for global enterprises like SAP, IKEA, and Akamai. APM tools paired with IPM are delivering the unified view of performance that teams need, from the application code to the end user's screen, wherever in the world they are.

With this approach, teams are moving way faster and resolving issues more rapidly and in this way are aligning themselves better to meet business outcomes by making customer experience KPIs the primary objective of observability teams. For instance, they can measure the impact of outages on customer satisfaction and revenue, not just uptime and latency.

The Role of OpenTelemetry

If APM and IPM are the two sides of the observability coin, OpenTelemetry is the glue that binds them. OTel has emerged as the de facto standard for integrating monitoring data, including traces, logs, and metrics, from multiple components of an ecosystem. Its adoption is accelerating because it helps teams break vendor lock-in, standardize data collection, and reduce the cost of managing multiple tools.

In fact, most enterprises now require OTel support as a prerequisite for any observability solution. The best outcomes happen when OpenTelemetry is part of a broader strategy that includes governance, platform selection, and integration with both APM and IPM tools.

As an example, an OTel SDK on a native mobile application could feed telemetry to both APM and IPM systems and both of these could feed a central system with a unified dashboard and/or an alerting or AIOps system. What is possible with OTel is growing and becoming more practical over time.

Centralized Observability Is on the Rise

With greater complexity and greater stakes, enterprises are shifting observability decisions to centralized teams. These groups, sometimes part of architecture, sometimes under operations, are tasked with standardizing vendors, enforcing best practices, and ensuring observability aligns with business needs.

Trend-wise, this is a direct response to tool sprawl and rising costs. According to a recent Elastic survey, many organizations are actively consolidating their observability stacks to improve collaboration and reduce licensing and training expenses.

Centralized observability teams are also the ones most likely to invest in IPM, recognizing that the user's path through the Internet is as important as the path through the code. EMA research recently confirmed this, noting that "Internet Performance Monitoring tools have become just as important as application performance management, if not more so."

Real Results from Modern Observability

Enterprises that embrace this model APM + IPM + OTel, led by a centralized team are already seeing results. They include:

  • Faster time to resolution: By monitoring beyond the firewall, teams spot and diagnose issues quicker.
  • Cost savings: Fewer tools, better data, less duplication.
  • Improved user experience: Outages that used to take hours to triage now take minutes to fix.
  • Greater alignment with business goals: IT teams can tie observability metrics to user impact and revenue risk.

By integrating Internet Performance Monitoring alongside APM, adopting OpenTelemetry for data consistency, and empowering centralized observability teams to lead the way, enterprises can close their performance blind spots and deliver better digital experiences faster and more reliably.

In 2025, observability isn't just about keeping the lights on. It's about creating resilience, reducing cost, and proving the value of IT across the business. And that starts with seeing the whole picture, inside and out.

Mehdi Daoudi is CEO and Co-Founder of Catchpoint

The Latest

New Relic surveyed IT and engineering leaders from the media and entertainment (M&E) sector to understand what's working — and where challenges persist with their observability practices. The findings reveal how M&E organizations are navigating rising platform complexity, audience expectations, and AI-driven change. Below are five takeaways that stand out ...

Let me start with something I've seen play out more times than I can count. A team hits a wall with the cloud. Costs creep up, then spike. Performance starts to feel inconsistent. Someone in finance asks a simple question like "why did this double?" and nobody has a clean answer ... Maybe this isn't the right place for everything. That realization feels like a breakthrough, like you've identified the problem. In reality, you've just identified the starting line ...

In MEAN TIME TO INSIGHT Episode 24, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses network observability tool sprawl ... 

In cloud-native systems, scaling is often as simple as moving a slider. For on-premise databases, the stakes are different. Over-provisioning hardware is expensive. Under-provisioning leads to performance bottlenecks that are difficult to fix once the equipment is in the rack ...

When most people think about cybersecurity, they picture firewalls, encryption, and access controls — technical tools designed to protect systems and data. But beneath the technology lies a deeper set of principles about trust, decision-making, and resilience ... The best leaders don't eliminate risk. They manage it intelligently. And in many ways, cybersecurity offers a surprisingly useful playbook for doing exactly that ...

Many organizations assumed their infrastructure strategy was settled. It had been implemented, optimized and built into long-term plans. Recent changes in technology and vendor consolidation are forcing a second look. Cloud outages and licensing changes have exposed how much dependency exists on a small number of platforms. As a result, organizations are reevaluating whether those decisions still hold up under current conditions ...

Edge AI is strategically embedded in core IT and infrastructure spending across industries, according to the 2026 Edge AI Survey from ZEDEDA. The research shows that 83% of C-suite and IT executive respondents say edge AI is important to their core business strategy ...

As AI adoption accelerates, operational complexity — not model intelligence — is becoming the primary barrier to reliable AI at scale, according to the State of AI Engineering 2026 from Datadog ... The report highlights a compounding complexity challenge as AI systems scale ... Around 5% of AI model requests fail in production, with nearly 60% of those failures caused by capacity limits ...

For years, production operations teams have treated alert fatigue as a quality-of-life problem: something that makes on-call rotations miserable but isn't considered a direct contributor to outages. That framing doesn't capture how these systems fail, and we now have data to show why. More importantly, it's now clear alert fatigue is a symptom of a deeper issue: production systems have outgrown the current operational approaches ...

I was on a customer call last fall when an enterprise architect said something I haven't been able to shake. Her team had just spent four months trying to swap one AI vendor for another. The original plan said three weeks. "We didn't switch vendors," she told me. "We rebuilt half our integrations and discovered what we'd actually been depending on." Most enterprise leaders don't expect that to be the experience ...