Skip to main content

What Is Real User Monitoring in an Observability World? It Is Not APM "Agents" - Part 1

Eric Futoran
Embrace

Agent-based approaches to real user monitoring (RUM) simply do not work. If you are pitched to install an "agent" in your mobile or web environments, you should run for the hills. The world is now all about end-users. This paradigm of focusing on the end-user was simply not true a few years ago, as backend metrics generally revolved around uptime, SLAs, latency, and the like. DevOps teams always pitched and presented the metrics they thought were the most correlated to the end-user experience. But let's be blunt: Unless there was an egregious fire, the correlated metrics were super loose or entirely false. Instead, your teams should prioritize alerts, monitoring, and work based on impact to the end-user, as it directly affects your businesses. And your developers and DevOps teams should collect data, monitor, prioritize, and resolve issues accordingly.

The agent-based RUM problem

"Agents" are a mechanism that does not work in the current end-user centric world. They were born out of shimmying the principles of the backend to mobile, web, and the myriad of other ways users interact with the world. Let's compare the difference between user environments and backend environments: ■ User environments are open, unstructured, and uncontrollable as they are unowned devices and browsers with the central figure being an unpredictable user. ■ Backend environments are closed, structured, and controlled as they are composed of relatively homogenous physical and cloud applications. With closed systems that have fewer external variables, agents focus on a known set of errors to monitor and to trigger data collection for resolution. However, monitoring systems outside of the backend is complex because there are a multitude of types of errors way beyond crashes, error logs, network traces, and API errors. In an observability world, real user monitoring is about collecting "all" the data for every session — good or bad — and not just a sampled set based on predefined error types. Only by collecting the entirety of every session can the best vendors have the opportunity to analyze and provide the utmost value to your teams. These vendors have evolved beyond agents to surface every type of user-impacting issue, help resolve them by comparing against good sessions, and prioritize overall impact across the complete set of issue types. For example, the same crash for two different users could have different root causes because of the environments, third-party SDKs, and API timeout parameters. To hit the difference home, watch a developer, outside of DevOps, open a RUM dashboard for a vendor who uses the agent-based approach. The core dashboard will have the following: ■ A geographical map laying out the incidents ■ A generic list of error logs and crashes ■ Some sort of mapping of network errors ■ A single health score The developer reviewing this dashboard will not come back to it regularly or at all. And it's not hard to see why. The dashboard does not tell them which users are affected, where to prioritize their efforts, or the types of bugs and optimizations that they should care most about. It's not built for them from the data collected to data organization and display. There is a reason why these developers always implement and use other vendors — even for simple concepts like error logging and crashes — alongside those application performance monitoring vendors. Let's deep dive into the core differences between these approaches and explore what a true real user monitoring methodology looks like. That way, you will know it when you see it and can create the best experience for your end-users as well as your developers and DevOps team.

The spider web problem

To illustrate the core implication of an agent mentality, let's focus on the "spider webs." You know the ones I'm talking about. You've seen the cool demos with a picture connecting nodes across your systems to demonstrate "visibility" across all the apps running on your servers and machines. Everything is connected by an ever-expanding spider web of nodes and lines — every app, compute instance, API call, etc. Oh, it's very pretty to see all the apps and API calls going to and from each other. It's also a nice source of confidence that the agents are collecting the data required to monitor, identify, and resolve potential issues. However, the very nature of this mental model of a spider web is it assumes all the issues occur on the lines between the nodes or on the nodes themselves: ■ An increase in network latency means you should look at the connected database, server, or service calls. ■ An increase in downtime means you should look at the connected servers to see if they're under heavy load. ■ An increase in transaction failures means you should look at the connected service calls for a point of failure. The paradigm of agents is one of looking for a closed set of known symptoms for broken apps, failing processes, and poorly designed code. To help resolve these symptoms, the agents collect samples of app and process information, so that when an API throws an error or a process has downtime, the agent collects the corresponding data in reaction to the error. And this approach works … on the backend, for a known set of errors, in a controlled environment, with little external pressure from the outside world. But when applied to the client side of web and mobile, what happens when the complexity explodes?  What happens when there are an infinite number of unknown pressures, from the users, the devices, the operating systems, the app versions, the network connectivities, and the other apps running? How do you truly understand your team's effectiveness when the biggest issues are not related to downtime or following individual service calls throughout a distributed system?

The problem with uncontrolled environments

Uncontrolled environments are any digital experience that's external to data centers. Beyond just smartphones and web browsers, they're point of sales, VR and AR devices, tablets in the field, and smart cars. And the world is increasingly one of uncontrolled environments for business-critical touchpoints. The most effective developer and DevOps teams monitor these client-side environments with early warning systems to determine when users are impacted so they can triage and resolve issues. They flip the traditional application monitoring paradigm. ■ Traditional application monitoring: Sample data by looking for a known set of errors, then gather context around them. ■ Modern application monitoring: Gather data without knowing its full value, correlate those data points to user impact from the end-user vantage point, then determine the error, measure the impact in order to prioritize it, and route it accordingly. In order to collect, identify, and resolve errors correctly, DevOps teams must understand the challenges that come along with running apps in these types of uncontrolled environments. After all, the assumptions about where failure points can happen are vastly different. Start with: What Is Real User Monitoring in an Observability World? It Is Not APM "Agents" - Part 2

Eric Futoran is CEO of Embrace

The Latest

I've spent a lot of time in the channel, and one thing I keep coming back to is this: a partner program is only as good as what it looks like in the field. Many programs look great on paper, but when a partner is in front of a customer navigating a complex hybrid environment or trying to make the case for AI-powered observability, the gap between what a vendor promises and what it actually delivers becomes very clear, very fast ...

Enterprises today operate in a real-time environment where uninterrupted access to trusted data has become a baseline expectation for users, applications and automated systems. Traditional DataOps models, built on manual effort and human triage, cannot keep pace with this always active demand. AI agents are emerging as the operational backbone, ensuring consistent data availability, reinforcing trustworthiness and enabling a level of scale that manual processes cannot achieve ...

For decades, trust in the digital workplace rested on familiar signals. We trusted faces on video calls, voices on the phone, and emails that appeared to come from people we knew. These cues felt human and intuitive. They anchored how decisions were made, approvals were granted, and access was authorized. AI-powered deepfakes have quietly broken that model ...

Cloud migration was supposed to be a one-way door. For most enterprises, it turns out it isn't. Cloud data repatriation is a real and growing trend. A new survey ... finds that 89% of organizations plan to expand their on-premises infrastructure footprint over the next two years — and 75% have already moved at least some workloads back from public cloud in the past 24 months. The findings point to a broad rethinking of where data belongs ...

Over the past few years, large language models (LLMs) have revolutionized the software industry. Given their ability to excel at multi-step reasoning, LLMs have helped enterprises streamline workflows and adapt to the unknown. However, employing such models comes with sky-high costs, latency issues, and limited flexibility. In the realm of IT operations, it is generally wiser to employ smaller, domain-specific models instead ...

For years, DevOps teams operated under a simple assumption: collect enough telemetry, and you can find and fix any problem. That assumption is breaking down. Modern enterprises now operate across microservices, hybrid cloud environments, APIs, Kubernetes, and highly automated delivery pipelines. Releases happen continuously, dependencies shift constantly, and failures spread faster than teams can diagnose them ...

New Relic surveyed IT and engineering leaders from the media and entertainment (M&E) sector to understand what's working — and where challenges persist with their observability practices. The findings reveal how M&E organizations are navigating rising platform complexity, audience expectations, and AI-driven change. Below are five takeaways that stand out ...

Let me start with something I've seen play out more times than I can count. A team hits a wall with the cloud. Costs creep up, then spike. Performance starts to feel inconsistent. Someone in finance asks a simple question like "why did this double?" and nobody has a clean answer ... Maybe this isn't the right place for everything. That realization feels like a breakthrough, like you've identified the problem. In reality, you've just identified the starting line ...

In MEAN TIME TO INSIGHT Episode 24, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses network observability tool sprawl ... 

In cloud-native systems, scaling is often as simple as moving a slider. For on-premise databases, the stakes are different. Over-provisioning hardware is expensive. Under-provisioning leads to performance bottlenecks that are difficult to fix once the equipment is in the rack ...

What Is Real User Monitoring in an Observability World? It Is Not APM "Agents" - Part 1

Eric Futoran
Embrace

Agent-based approaches to real user monitoring (RUM) simply do not work. If you are pitched to install an "agent" in your mobile or web environments, you should run for the hills. The world is now all about end-users. This paradigm of focusing on the end-user was simply not true a few years ago, as backend metrics generally revolved around uptime, SLAs, latency, and the like. DevOps teams always pitched and presented the metrics they thought were the most correlated to the end-user experience. But let's be blunt: Unless there was an egregious fire, the correlated metrics were super loose or entirely false. Instead, your teams should prioritize alerts, monitoring, and work based on impact to the end-user, as it directly affects your businesses. And your developers and DevOps teams should collect data, monitor, prioritize, and resolve issues accordingly.

The agent-based RUM problem

"Agents" are a mechanism that does not work in the current end-user centric world. They were born out of shimmying the principles of the backend to mobile, web, and the myriad of other ways users interact with the world. Let's compare the difference between user environments and backend environments: ■ User environments are open, unstructured, and uncontrollable as they are unowned devices and browsers with the central figure being an unpredictable user. ■ Backend environments are closed, structured, and controlled as they are composed of relatively homogenous physical and cloud applications. With closed systems that have fewer external variables, agents focus on a known set of errors to monitor and to trigger data collection for resolution. However, monitoring systems outside of the backend is complex because there are a multitude of types of errors way beyond crashes, error logs, network traces, and API errors. In an observability world, real user monitoring is about collecting "all" the data for every session — good or bad — and not just a sampled set based on predefined error types. Only by collecting the entirety of every session can the best vendors have the opportunity to analyze and provide the utmost value to your teams. These vendors have evolved beyond agents to surface every type of user-impacting issue, help resolve them by comparing against good sessions, and prioritize overall impact across the complete set of issue types. For example, the same crash for two different users could have different root causes because of the environments, third-party SDKs, and API timeout parameters. To hit the difference home, watch a developer, outside of DevOps, open a RUM dashboard for a vendor who uses the agent-based approach. The core dashboard will have the following: ■ A geographical map laying out the incidents ■ A generic list of error logs and crashes ■ Some sort of mapping of network errors ■ A single health score The developer reviewing this dashboard will not come back to it regularly or at all. And it's not hard to see why. The dashboard does not tell them which users are affected, where to prioritize their efforts, or the types of bugs and optimizations that they should care most about. It's not built for them from the data collected to data organization and display. There is a reason why these developers always implement and use other vendors — even for simple concepts like error logging and crashes — alongside those application performance monitoring vendors. Let's deep dive into the core differences between these approaches and explore what a true real user monitoring methodology looks like. That way, you will know it when you see it and can create the best experience for your end-users as well as your developers and DevOps team.

The spider web problem

To illustrate the core implication of an agent mentality, let's focus on the "spider webs." You know the ones I'm talking about. You've seen the cool demos with a picture connecting nodes across your systems to demonstrate "visibility" across all the apps running on your servers and machines. Everything is connected by an ever-expanding spider web of nodes and lines — every app, compute instance, API call, etc. Oh, it's very pretty to see all the apps and API calls going to and from each other. It's also a nice source of confidence that the agents are collecting the data required to monitor, identify, and resolve potential issues. However, the very nature of this mental model of a spider web is it assumes all the issues occur on the lines between the nodes or on the nodes themselves: ■ An increase in network latency means you should look at the connected database, server, or service calls. ■ An increase in downtime means you should look at the connected servers to see if they're under heavy load. ■ An increase in transaction failures means you should look at the connected service calls for a point of failure. The paradigm of agents is one of looking for a closed set of known symptoms for broken apps, failing processes, and poorly designed code. To help resolve these symptoms, the agents collect samples of app and process information, so that when an API throws an error or a process has downtime, the agent collects the corresponding data in reaction to the error. And this approach works … on the backend, for a known set of errors, in a controlled environment, with little external pressure from the outside world. But when applied to the client side of web and mobile, what happens when the complexity explodes?  What happens when there are an infinite number of unknown pressures, from the users, the devices, the operating systems, the app versions, the network connectivities, and the other apps running? How do you truly understand your team's effectiveness when the biggest issues are not related to downtime or following individual service calls throughout a distributed system?

The problem with uncontrolled environments

Uncontrolled environments are any digital experience that's external to data centers. Beyond just smartphones and web browsers, they're point of sales, VR and AR devices, tablets in the field, and smart cars. And the world is increasingly one of uncontrolled environments for business-critical touchpoints. The most effective developer and DevOps teams monitor these client-side environments with early warning systems to determine when users are impacted so they can triage and resolve issues. They flip the traditional application monitoring paradigm. ■ Traditional application monitoring: Sample data by looking for a known set of errors, then gather context around them. ■ Modern application monitoring: Gather data without knowing its full value, correlate those data points to user impact from the end-user vantage point, then determine the error, measure the impact in order to prioritize it, and route it accordingly. In order to collect, identify, and resolve errors correctly, DevOps teams must understand the challenges that come along with running apps in these types of uncontrolled environments. After all, the assumptions about where failure points can happen are vastly different. Start with: What Is Real User Monitoring in an Observability World? It Is Not APM "Agents" - Part 2

Eric Futoran is CEO of Embrace

The Latest

I've spent a lot of time in the channel, and one thing I keep coming back to is this: a partner program is only as good as what it looks like in the field. Many programs look great on paper, but when a partner is in front of a customer navigating a complex hybrid environment or trying to make the case for AI-powered observability, the gap between what a vendor promises and what it actually delivers becomes very clear, very fast ...

Enterprises today operate in a real-time environment where uninterrupted access to trusted data has become a baseline expectation for users, applications and automated systems. Traditional DataOps models, built on manual effort and human triage, cannot keep pace with this always active demand. AI agents are emerging as the operational backbone, ensuring consistent data availability, reinforcing trustworthiness and enabling a level of scale that manual processes cannot achieve ...

For decades, trust in the digital workplace rested on familiar signals. We trusted faces on video calls, voices on the phone, and emails that appeared to come from people we knew. These cues felt human and intuitive. They anchored how decisions were made, approvals were granted, and access was authorized. AI-powered deepfakes have quietly broken that model ...

Cloud migration was supposed to be a one-way door. For most enterprises, it turns out it isn't. Cloud data repatriation is a real and growing trend. A new survey ... finds that 89% of organizations plan to expand their on-premises infrastructure footprint over the next two years — and 75% have already moved at least some workloads back from public cloud in the past 24 months. The findings point to a broad rethinking of where data belongs ...

Over the past few years, large language models (LLMs) have revolutionized the software industry. Given their ability to excel at multi-step reasoning, LLMs have helped enterprises streamline workflows and adapt to the unknown. However, employing such models comes with sky-high costs, latency issues, and limited flexibility. In the realm of IT operations, it is generally wiser to employ smaller, domain-specific models instead ...

For years, DevOps teams operated under a simple assumption: collect enough telemetry, and you can find and fix any problem. That assumption is breaking down. Modern enterprises now operate across microservices, hybrid cloud environments, APIs, Kubernetes, and highly automated delivery pipelines. Releases happen continuously, dependencies shift constantly, and failures spread faster than teams can diagnose them ...

New Relic surveyed IT and engineering leaders from the media and entertainment (M&E) sector to understand what's working — and where challenges persist with their observability practices. The findings reveal how M&E organizations are navigating rising platform complexity, audience expectations, and AI-driven change. Below are five takeaways that stand out ...

Let me start with something I've seen play out more times than I can count. A team hits a wall with the cloud. Costs creep up, then spike. Performance starts to feel inconsistent. Someone in finance asks a simple question like "why did this double?" and nobody has a clean answer ... Maybe this isn't the right place for everything. That realization feels like a breakthrough, like you've identified the problem. In reality, you've just identified the starting line ...

In MEAN TIME TO INSIGHT Episode 24, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses network observability tool sprawl ... 

In cloud-native systems, scaling is often as simple as moving a slider. For on-premise databases, the stakes are different. Over-provisioning hardware is expensive. Under-provisioning leads to performance bottlenecks that are difficult to fix once the equipment is in the rack ...