Skip to main content

Observability Gaps Are Costing You. Here's How to Fix Them - Fast

Mehdi Daoudi
Catchpoint

It's 7 in the morning. You get an alert from your team. A critical service is down. Yet, your monitoring systems show no critical alerts. Where is the problem? You are considering calling for a war room. It will be a massive distraction for the best people on your team, but what option do you have?

Your next thought may be: why was this not caught with our APM tools? We spend a fortune on them. These incidents are happening in places you cannot see. From global SaaS disruptions to regional ISP failures, to APIs your systems rely on and cloud services in multiple availability zones, the Internet has become a critical extension of enterprise infrastructure. And yet, many teams are still relying on legacy observability strategies that were never built for internet-centric dependencies. The result? Ongoing blind spots, impatient users, and increasing operational costs.

And then there are the micro-outages your team might not even be aware of. Regional incidents, small hiccups, and process failures that are likely happening, often undetected, often unreported and like paper cuts, they cut into user satisfaction, damage the business and erode trust. In fact, our annual Internet Resilience Report found that, in 2025, one in eight businesses now lose over $10 million a month to disruptions and half lose over a million plus a month.

Thus, it's clear a new approach is needed — one that complements APM's detailed view into code, infrastructure, and events with a broad view of the internet stack and more user-centric monitoring. Let's break down how organizations are taking this approach to close the gaps and why the cost of ignoring them is only getting higher.

Why Observability Is Falling Short

Observability tools' main focus is to monitor internal systems: servers, containers, microservices — code traces, metrics, logs and events (MELT). But the modern enterprise is no longer built on custom applications that run on the infrastructure they manage. Cloud apps, SaaS platforms, APIs, and third-party services are now integral to delivering digital experiences. And all of them rely on the health of the Internet: DNS, SSL, BGP, routing, ISPs, etc.

That's where APM alone starts to fail. They were not built to monitor a massively distributed service-oriented multi-party applications. They offer insufficient insight into the routes, external services, internet protocols, and the regional performance that determine whether users can access your app at all.

In fact, what really matters isn't backend-system health; instead, it's real-world user experience. The customer waiting at the rental car counter doesn't care that your servers are humming along at 72% CPU utilization. They care that they need to get to a meeting and the person at the other side of the counter says "Sorry, my computer is slow today". And if you can't tell whether the root cause is your code, your cloud provider, the local internet, DNS resolution times, latency for an API, or a BGP routing issue in some part of the world, you're in trouble without the visibility you need.

APM + IPM = End-to-End Visibility

To solve this, forward-looking enterprises are covering their visibility gap by enhancing the visibility they get from APM tools with Internet Performance Monitoring (IPM). On one side, APM delivers the inside-out view, including instrumentation, tracing, and system health. On the other, IPM offers the outside-in perspective, including real user experience, the health of the global Internet, and proactive testing of everything that may impact a user including first and third party dependencies — from APIs to cloud services to VPNs to database timeouts.

Together, they provide true end-to-end observability, a model is already proving invaluable for global enterprises like SAP, IKEA, and Akamai. APM tools paired with IPM are delivering the unified view of performance that teams need, from the application code to the end user's screen, wherever in the world they are.

With this approach, teams are moving way faster and resolving issues more rapidly and in this way are aligning themselves better to meet business outcomes by making customer experience KPIs the primary objective of observability teams. For instance, they can measure the impact of outages on customer satisfaction and revenue, not just uptime and latency.

The Role of OpenTelemetry

If APM and IPM are the two sides of the observability coin, OpenTelemetry is the glue that binds them. OTel has emerged as the de facto standard for integrating monitoring data, including traces, logs, and metrics, from multiple components of an ecosystem. Its adoption is accelerating because it helps teams break vendor lock-in, standardize data collection, and reduce the cost of managing multiple tools.

In fact, most enterprises now require OTel support as a prerequisite for any observability solution. The best outcomes happen when OpenTelemetry is part of a broader strategy that includes governance, platform selection, and integration with both APM and IPM tools.

As an example, an OTel SDK on a native mobile application could feed telemetry to both APM and IPM systems and both of these could feed a central system with a unified dashboard and/or an alerting or AIOps system. What is possible with OTel is growing and becoming more practical over time.

Centralized Observability Is on the Rise

With greater complexity and greater stakes, enterprises are shifting observability decisions to centralized teams. These groups, sometimes part of architecture, sometimes under operations, are tasked with standardizing vendors, enforcing best practices, and ensuring observability aligns with business needs.

Trend-wise, this is a direct response to tool sprawl and rising costs. According to a recent Elastic survey, many organizations are actively consolidating their observability stacks to improve collaboration and reduce licensing and training expenses.

Centralized observability teams are also the ones most likely to invest in IPM, recognizing that the user's path through the Internet is as important as the path through the code. EMA research recently confirmed this, noting that "Internet Performance Monitoring tools have become just as important as application performance management, if not more so."

Real Results from Modern Observability

Enterprises that embrace this model APM + IPM + OTel, led by a centralized team are already seeing results. They include:

  • Faster time to resolution: By monitoring beyond the firewall, teams spot and diagnose issues quicker.
  • Cost savings: Fewer tools, better data, less duplication.
  • Improved user experience: Outages that used to take hours to triage now take minutes to fix.
  • Greater alignment with business goals: IT teams can tie observability metrics to user impact and revenue risk.

By integrating Internet Performance Monitoring alongside APM, adopting OpenTelemetry for data consistency, and empowering centralized observability teams to lead the way, enterprises can close their performance blind spots and deliver better digital experiences faster and more reliably.

In 2025, observability isn't just about keeping the lights on. It's about creating resilience, reducing cost, and proving the value of IT across the business. And that starts with seeing the whole picture, inside and out.

Mehdi Daoudi is CEO and Co-Founder of Catchpoint

The Latest

Technology leaders across the federal landscape are facing, and will continue to face, an uphill battle when it comes to fortifying their digital environments against hostile and persistent threat actors. On one hand, they are being asked to push digital transformation ... On the other hand, they are facing the fiscal uncertainty of continuing resolutions (CR) and government shutdowns looming near and far. In the face of these challenges, CIOs, CTOs, and CISOs must figure out how to modernize legacy systems and infrastructure while doing more with less and still defending against external and internal threats ...

Reliability is no longer proven by uptime alone, according to the The SRE Report 2026 from LogicMonitor. In the AI era, it is experienced through speed, consistency, and user trust, and increasingly judged by business impact. As digital services grow more complex and AI systems move into production, traditional monitoring approaches are struggling to keep pace, increasing the need for AI-first observability that spans applications, infrastructure, and the Internet ...

If AI is the engine of a modern organization, then data engineering is the road system beneath it. You can build the most powerful engine in the world, but without paved roads, traffic signals, and bridges that can support its weight, it will stall. In many enterprises, the engine is ready. The roads are not ...

In the world of digital-first business, there is no tolerance for service outages. Businesses know that outages are the quickest way to lose money and customers. For smaller organizations, unplanned downtime could even force the business to close ... A new study from PagerDuty, The State of AI-First Operations, reveals that companies actively incorporating AI into operations now view operational resilience as a growth driver rather than a cost center. But how are they achieving it? ...

In live financial environments, capital markets software cannot pause for rebuilds. New capabilities are introduced as stacked technology layers to meet evolving demands while systems remain active, data keeps moving, and controls stay intact. AI is no exception, and its opportunities are significant: accelerated decision cycles, compressed manual workflows, and more effective operations across complex environments. The constraint isn't the models themselves, but the architectural environments they enter ...

Like most digital transformation shifts, organizations often prioritize productivity and leave security and observability to keep pace. This usually translates to both the mass implementation of new technology and fragmented monitoring and observability (M&O) tooling. In the era of AI and varied cloud architecture, a disparate observability function can be dangerous. IT teams will lack a complete picture of their IT environment, making it harder to diagnose issues while slowing down mean time to resolve (MTTR). In fact, according to recent data from the SolarWinds State of Monitoring & Observability Report, 77% of IT personnel said the lack of visibility across their on-prem and cloud architecture was an issue ...

In MEAN TIME TO INSIGHT Episode 23, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses the NetOps labor shortage ... 

Technology management is evolving, and in turn, so is the scope of FinOps. The FinOps Foundation recently updated their mission statement from "advancing the people who manage the value of cloud" to "advancing the people who manage the value of technology." This seemingly small change solidifies a larger evolution: FinOps practitioners have organically expanded to be focused on more than just cloud cost optimization. Today, FinOps teams are largely — and quickly — expanding their job descriptions, evolving into a critical function for managing the full value of technology ...

Enterprises are under pressure to scale AI quickly. Yet despite considerable investment, adoption continues to stall. One of the most overlooked reasons is vendor sprawl ... In reality, no organization deliberately sets out to create sprawling vendor ecosystems. More often, complexity accumulates over time through well-intentioned initiatives, such as enterprise-wide digital transformation efforts, point solutions, or decentralized sourcing strategies ...

Nearly every conversation about AI eventually circles back to compute. GPUs dominate the headlines while cloud platforms compete for workloads and model benchmarks drive investment decisions. But underneath that noise, a quieter infrastructure challenge is taking shape. The real bottleneck in enterprise AI is not processing power, it is the ability to store, manage and retrieve the relentless volumes of data that AI systems generate, consume and multiply ...

Observability Gaps Are Costing You. Here's How to Fix Them - Fast

Mehdi Daoudi
Catchpoint

It's 7 in the morning. You get an alert from your team. A critical service is down. Yet, your monitoring systems show no critical alerts. Where is the problem? You are considering calling for a war room. It will be a massive distraction for the best people on your team, but what option do you have?

Your next thought may be: why was this not caught with our APM tools? We spend a fortune on them. These incidents are happening in places you cannot see. From global SaaS disruptions to regional ISP failures, to APIs your systems rely on and cloud services in multiple availability zones, the Internet has become a critical extension of enterprise infrastructure. And yet, many teams are still relying on legacy observability strategies that were never built for internet-centric dependencies. The result? Ongoing blind spots, impatient users, and increasing operational costs.

And then there are the micro-outages your team might not even be aware of. Regional incidents, small hiccups, and process failures that are likely happening, often undetected, often unreported and like paper cuts, they cut into user satisfaction, damage the business and erode trust. In fact, our annual Internet Resilience Report found that, in 2025, one in eight businesses now lose over $10 million a month to disruptions and half lose over a million plus a month.

Thus, it's clear a new approach is needed — one that complements APM's detailed view into code, infrastructure, and events with a broad view of the internet stack and more user-centric monitoring. Let's break down how organizations are taking this approach to close the gaps and why the cost of ignoring them is only getting higher.

Why Observability Is Falling Short

Observability tools' main focus is to monitor internal systems: servers, containers, microservices — code traces, metrics, logs and events (MELT). But the modern enterprise is no longer built on custom applications that run on the infrastructure they manage. Cloud apps, SaaS platforms, APIs, and third-party services are now integral to delivering digital experiences. And all of them rely on the health of the Internet: DNS, SSL, BGP, routing, ISPs, etc.

That's where APM alone starts to fail. They were not built to monitor a massively distributed service-oriented multi-party applications. They offer insufficient insight into the routes, external services, internet protocols, and the regional performance that determine whether users can access your app at all.

In fact, what really matters isn't backend-system health; instead, it's real-world user experience. The customer waiting at the rental car counter doesn't care that your servers are humming along at 72% CPU utilization. They care that they need to get to a meeting and the person at the other side of the counter says "Sorry, my computer is slow today". And if you can't tell whether the root cause is your code, your cloud provider, the local internet, DNS resolution times, latency for an API, or a BGP routing issue in some part of the world, you're in trouble without the visibility you need.

APM + IPM = End-to-End Visibility

To solve this, forward-looking enterprises are covering their visibility gap by enhancing the visibility they get from APM tools with Internet Performance Monitoring (IPM). On one side, APM delivers the inside-out view, including instrumentation, tracing, and system health. On the other, IPM offers the outside-in perspective, including real user experience, the health of the global Internet, and proactive testing of everything that may impact a user including first and third party dependencies — from APIs to cloud services to VPNs to database timeouts.

Together, they provide true end-to-end observability, a model is already proving invaluable for global enterprises like SAP, IKEA, and Akamai. APM tools paired with IPM are delivering the unified view of performance that teams need, from the application code to the end user's screen, wherever in the world they are.

With this approach, teams are moving way faster and resolving issues more rapidly and in this way are aligning themselves better to meet business outcomes by making customer experience KPIs the primary objective of observability teams. For instance, they can measure the impact of outages on customer satisfaction and revenue, not just uptime and latency.

The Role of OpenTelemetry

If APM and IPM are the two sides of the observability coin, OpenTelemetry is the glue that binds them. OTel has emerged as the de facto standard for integrating monitoring data, including traces, logs, and metrics, from multiple components of an ecosystem. Its adoption is accelerating because it helps teams break vendor lock-in, standardize data collection, and reduce the cost of managing multiple tools.

In fact, most enterprises now require OTel support as a prerequisite for any observability solution. The best outcomes happen when OpenTelemetry is part of a broader strategy that includes governance, platform selection, and integration with both APM and IPM tools.

As an example, an OTel SDK on a native mobile application could feed telemetry to both APM and IPM systems and both of these could feed a central system with a unified dashboard and/or an alerting or AIOps system. What is possible with OTel is growing and becoming more practical over time.

Centralized Observability Is on the Rise

With greater complexity and greater stakes, enterprises are shifting observability decisions to centralized teams. These groups, sometimes part of architecture, sometimes under operations, are tasked with standardizing vendors, enforcing best practices, and ensuring observability aligns with business needs.

Trend-wise, this is a direct response to tool sprawl and rising costs. According to a recent Elastic survey, many organizations are actively consolidating their observability stacks to improve collaboration and reduce licensing and training expenses.

Centralized observability teams are also the ones most likely to invest in IPM, recognizing that the user's path through the Internet is as important as the path through the code. EMA research recently confirmed this, noting that "Internet Performance Monitoring tools have become just as important as application performance management, if not more so."

Real Results from Modern Observability

Enterprises that embrace this model APM + IPM + OTel, led by a centralized team are already seeing results. They include:

  • Faster time to resolution: By monitoring beyond the firewall, teams spot and diagnose issues quicker.
  • Cost savings: Fewer tools, better data, less duplication.
  • Improved user experience: Outages that used to take hours to triage now take minutes to fix.
  • Greater alignment with business goals: IT teams can tie observability metrics to user impact and revenue risk.

By integrating Internet Performance Monitoring alongside APM, adopting OpenTelemetry for data consistency, and empowering centralized observability teams to lead the way, enterprises can close their performance blind spots and deliver better digital experiences faster and more reliably.

In 2025, observability isn't just about keeping the lights on. It's about creating resilience, reducing cost, and proving the value of IT across the business. And that starts with seeing the whole picture, inside and out.

Mehdi Daoudi is CEO and Co-Founder of Catchpoint

The Latest

Technology leaders across the federal landscape are facing, and will continue to face, an uphill battle when it comes to fortifying their digital environments against hostile and persistent threat actors. On one hand, they are being asked to push digital transformation ... On the other hand, they are facing the fiscal uncertainty of continuing resolutions (CR) and government shutdowns looming near and far. In the face of these challenges, CIOs, CTOs, and CISOs must figure out how to modernize legacy systems and infrastructure while doing more with less and still defending against external and internal threats ...

Reliability is no longer proven by uptime alone, according to the The SRE Report 2026 from LogicMonitor. In the AI era, it is experienced through speed, consistency, and user trust, and increasingly judged by business impact. As digital services grow more complex and AI systems move into production, traditional monitoring approaches are struggling to keep pace, increasing the need for AI-first observability that spans applications, infrastructure, and the Internet ...

If AI is the engine of a modern organization, then data engineering is the road system beneath it. You can build the most powerful engine in the world, but without paved roads, traffic signals, and bridges that can support its weight, it will stall. In many enterprises, the engine is ready. The roads are not ...

In the world of digital-first business, there is no tolerance for service outages. Businesses know that outages are the quickest way to lose money and customers. For smaller organizations, unplanned downtime could even force the business to close ... A new study from PagerDuty, The State of AI-First Operations, reveals that companies actively incorporating AI into operations now view operational resilience as a growth driver rather than a cost center. But how are they achieving it? ...

In live financial environments, capital markets software cannot pause for rebuilds. New capabilities are introduced as stacked technology layers to meet evolving demands while systems remain active, data keeps moving, and controls stay intact. AI is no exception, and its opportunities are significant: accelerated decision cycles, compressed manual workflows, and more effective operations across complex environments. The constraint isn't the models themselves, but the architectural environments they enter ...

Like most digital transformation shifts, organizations often prioritize productivity and leave security and observability to keep pace. This usually translates to both the mass implementation of new technology and fragmented monitoring and observability (M&O) tooling. In the era of AI and varied cloud architecture, a disparate observability function can be dangerous. IT teams will lack a complete picture of their IT environment, making it harder to diagnose issues while slowing down mean time to resolve (MTTR). In fact, according to recent data from the SolarWinds State of Monitoring & Observability Report, 77% of IT personnel said the lack of visibility across their on-prem and cloud architecture was an issue ...

In MEAN TIME TO INSIGHT Episode 23, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses the NetOps labor shortage ... 

Technology management is evolving, and in turn, so is the scope of FinOps. The FinOps Foundation recently updated their mission statement from "advancing the people who manage the value of cloud" to "advancing the people who manage the value of technology." This seemingly small change solidifies a larger evolution: FinOps practitioners have organically expanded to be focused on more than just cloud cost optimization. Today, FinOps teams are largely — and quickly — expanding their job descriptions, evolving into a critical function for managing the full value of technology ...

Enterprises are under pressure to scale AI quickly. Yet despite considerable investment, adoption continues to stall. One of the most overlooked reasons is vendor sprawl ... In reality, no organization deliberately sets out to create sprawling vendor ecosystems. More often, complexity accumulates over time through well-intentioned initiatives, such as enterprise-wide digital transformation efforts, point solutions, or decentralized sourcing strategies ...

Nearly every conversation about AI eventually circles back to compute. GPUs dominate the headlines while cloud platforms compete for workloads and model benchmarks drive investment decisions. But underneath that noise, a quieter infrastructure challenge is taking shape. The real bottleneck in enterprise AI is not processing power, it is the ability to store, manage and retrieve the relentless volumes of data that AI systems generate, consume and multiply ...