Skip to main content

What Is Real User Monitoring in an Observability World? It Is Not APM "Agents" - Part 2

Eric Futoran
Embrace

Agent-based approaches to real user monitoring (RUM) simply do not work. If you are pitched to install an "agent" in your mobile or web environments, you should run for the hills. Start with: What Is Real User Monitoring in an Observability World? It Is Not APM "Agents" - Part 1

Connectivity and API error assumptions

The best apps run without the assumption of constant network connectivity. They are running compiled code on a device out in the wild, not just a series of always connected services. When an API experiences an error or timeout, these apps should provide a seamless way to continue to operate. The agent methodology assumes all API errors are created equal. Backend teams thus use network calls as a proxy for issues. However, a backend error, like a 400 or 500, may never affect the user. For example, a purchase may complete even if the API used for the purchase flow throws an error. But if DevOps sees that spike in errors, they will raise the alarm. Developers then scramble to figure them out. This can easily result in a huge waste of time and effort, which is incredibly frustrating. An even worse scenario is when developers see an issue with an API call, like a failed sign-on, and the backend team is never alerted because it didn't throw a 400 or 500. Or it did, but not at the threshold where such an issue should matter.

The "user is just a property" assumption

When understanding end-user experiences, the user should be the focus, not just a property on an API call. In other words, the user is the primary pivot. When a user complains, the CEO discovers an error in their own apps, or a developer is debugging a hair problem, the first step is to look up the user. The goal is to see the behavioral and technical data in context to that user and identify the cause of the issue (e.g., environment, user behavior, code, third-party SDK, network connectivity). The backend does not provide this user-centric view. Ask DevOps to track down a problem that a specific user is having, and they will probably: ■ Query the user ID and look for traces that contain it. ■ Search across sampled sets of data and hope that the transactions in question were collected. ■ Connect multiple data sources together to track what the user did. ■ Hit a deadend when the data is incomplete, sampled, non-existent … or spread across multiple vendors. The user experience is not just a series of transactions, so this will always result in toil, guesswork, and frustration.

Mobile device error assumptions

Android and iOS do not run like other operating systems. Failures can happen beyond just logs, crashes, and network errors. ■ There are freezes and application not responding errors (ANRs). ■ The UI can stutter from heavy paints and animations. ■ The app can be killed for exceeding memory limits. ■ The app can be killed in the background to free up resources for other apps. ■ The app can be killed by the operating system for taking too long to complete a task. ■ Users can rage tap and force quit apps that cause them frustration. ■ Low power mode can cause unanticipated app performance problems. Vendors must think broader about how to collect telemetry to drill down into what the user experienced. In addition, the signals that represent poor user experiences must be expanded; otherwise, problems go undetected, which undermines team effectiveness and efficiency.

Cardinality and data assumptions

While dealing with cardinality on the backend is no joke — otherwise vendors would not charge more for collecting additional properties — it is also a challenge when working on the client-side, especially for mobile. Cardinality is an assumption that breaks the data collection approach of an agent. After all, every user is unique, and every experience is different. The same stack trace could stem from different root causes. A broken purchase may have different steps leading up to it, and the developer will need to see all these sets of clues to solve it. To find any error that ultimately impacts the user, and, more importantly, prioritize that error based on how many other users are affected, every user experience must be collected, including: ■ Behavioral data points: You must know what actions the user took, including the non-linear paths they took through the app. What happens when users background the app and return much later to complete key actions? What happens when users launch the app through deeplinks or push notifications? ■ Technical data points: You must know what code is running (e.g., first-party versus third-party SDKs), what the device state is (e.g., low battery, high memory usage, high CPU usage), what network calls are firing, and what other apps are running. You also need to collect metrics, logs, traces, and events for analysis.

RUM should look like in today's world

If developers knew to instrument every log, metric and trace, they would be godlike in their cognizant powers. However, there are just too many variables to rely on these three pillars alone to identify and prioritize work efforts. What developers need is a user-focused approach to web and mobile telemetry. Look up any user Developers need the ability to look up any user and see how they experienced the app. This includes every session, including when the app is backgrounded. Apps don't stop running when backgrounded. Users expect to be able to jump back in immediately, so seamless experiences must persist from foreground to background. See every session Developers need access to every session, not just the ones an agent would collect. If you only collect data when an obligatory error or crash happens, you'll miss many sources of poor user experience, like failing network calls, connectivity switches, freezes, and exceeding system resources. If your investigation starts by understanding what happened to the user, then you're not wasting time looking at backend errors that might have no user impact. Surface the user impact Once a true user-impacting issue is identified, developers and DevOps can zoom out to see how many other users are impacted. Then, if applicable, they should spend time investigating traces through backend services to understand what caused the issues your business cares about. In other words, they shouldn't be starting from DevOps errors and just fix them as they come up. That way, the work is prioritized from a user impact perspective as opposed to just going off technical signals. Solve with complete data The final step is to eliminate guesswork when it comes to solving issues. With an agent approach to RUM, limited data types are collected — often not even metrics, logs, traces and events. Teams are not able to get the full picture of the user experience without lots of manual effort. Instead, developers must have a complete picture of the technical and behavioral details of every user experience. That way, you can instantly reproduce every issue by seeing exactly what caused it.

Closing thoughts

When a vendor calls their RUM integration an "agent," everyone should squirm and run. An agent is simply a broken methodology in a web or mobile environment. The spider web approach to telemetry works for monitoring backend systems with homogeneous environments, constant connectivity, and known error types. However, client-side applications have endless variables across devices, operating systems, connectivities, user journeys, and more. The RUM methodology must view the user as the central pivot. That way, both developers and DevOps can work hand-in-hand to get the complete picture of the health of their respective systems, and then work most effectively to move the business forward.

Eric Futoran is CEO of Embrace

The Latest

I've spent a lot of time in the channel, and one thing I keep coming back to is this: a partner program is only as good as what it looks like in the field. Many programs look great on paper, but when a partner is in front of a customer navigating a complex hybrid environment or trying to make the case for AI-powered observability, the gap between what a vendor promises and what it actually delivers becomes very clear, very fast ...

Enterprises today operate in a real-time environment where uninterrupted access to trusted data has become a baseline expectation for users, applications and automated systems. Traditional DataOps models, built on manual effort and human triage, cannot keep pace with this always active demand. AI agents are emerging as the operational backbone, ensuring consistent data availability, reinforcing trustworthiness and enabling a level of scale that manual processes cannot achieve ...

For decades, trust in the digital workplace rested on familiar signals. We trusted faces on video calls, voices on the phone, and emails that appeared to come from people we knew. These cues felt human and intuitive. They anchored how decisions were made, approvals were granted, and access was authorized. AI-powered deepfakes have quietly broken that model ...

Cloud migration was supposed to be a one-way door. For most enterprises, it turns out it isn't. Cloud data repatriation is a real and growing trend. A new survey ... finds that 89% of organizations plan to expand their on-premises infrastructure footprint over the next two years — and 75% have already moved at least some workloads back from public cloud in the past 24 months. The findings point to a broad rethinking of where data belongs ...

Over the past few years, large language models (LLMs) have revolutionized the software industry. Given their ability to excel at multi-step reasoning, LLMs have helped enterprises streamline workflows and adapt to the unknown. However, employing such models comes with sky-high costs, latency issues, and limited flexibility. In the realm of IT operations, it is generally wiser to employ smaller, domain-specific models instead ...

For years, DevOps teams operated under a simple assumption: collect enough telemetry, and you can find and fix any problem. That assumption is breaking down. Modern enterprises now operate across microservices, hybrid cloud environments, APIs, Kubernetes, and highly automated delivery pipelines. Releases happen continuously, dependencies shift constantly, and failures spread faster than teams can diagnose them ...

New Relic surveyed IT and engineering leaders from the media and entertainment (M&E) sector to understand what's working — and where challenges persist with their observability practices. The findings reveal how M&E organizations are navigating rising platform complexity, audience expectations, and AI-driven change. Below are five takeaways that stand out ...

Let me start with something I've seen play out more times than I can count. A team hits a wall with the cloud. Costs creep up, then spike. Performance starts to feel inconsistent. Someone in finance asks a simple question like "why did this double?" and nobody has a clean answer ... Maybe this isn't the right place for everything. That realization feels like a breakthrough, like you've identified the problem. In reality, you've just identified the starting line ...

In MEAN TIME TO INSIGHT Episode 24, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses network observability tool sprawl ... 

In cloud-native systems, scaling is often as simple as moving a slider. For on-premise databases, the stakes are different. Over-provisioning hardware is expensive. Under-provisioning leads to performance bottlenecks that are difficult to fix once the equipment is in the rack ...

What Is Real User Monitoring in an Observability World? It Is Not APM "Agents" - Part 2

Eric Futoran
Embrace

Agent-based approaches to real user monitoring (RUM) simply do not work. If you are pitched to install an "agent" in your mobile or web environments, you should run for the hills. Start with: What Is Real User Monitoring in an Observability World? It Is Not APM "Agents" - Part 1

Connectivity and API error assumptions

The best apps run without the assumption of constant network connectivity. They are running compiled code on a device out in the wild, not just a series of always connected services. When an API experiences an error or timeout, these apps should provide a seamless way to continue to operate. The agent methodology assumes all API errors are created equal. Backend teams thus use network calls as a proxy for issues. However, a backend error, like a 400 or 500, may never affect the user. For example, a purchase may complete even if the API used for the purchase flow throws an error. But if DevOps sees that spike in errors, they will raise the alarm. Developers then scramble to figure them out. This can easily result in a huge waste of time and effort, which is incredibly frustrating. An even worse scenario is when developers see an issue with an API call, like a failed sign-on, and the backend team is never alerted because it didn't throw a 400 or 500. Or it did, but not at the threshold where such an issue should matter.

The "user is just a property" assumption

When understanding end-user experiences, the user should be the focus, not just a property on an API call. In other words, the user is the primary pivot. When a user complains, the CEO discovers an error in their own apps, or a developer is debugging a hair problem, the first step is to look up the user. The goal is to see the behavioral and technical data in context to that user and identify the cause of the issue (e.g., environment, user behavior, code, third-party SDK, network connectivity). The backend does not provide this user-centric view. Ask DevOps to track down a problem that a specific user is having, and they will probably: ■ Query the user ID and look for traces that contain it. ■ Search across sampled sets of data and hope that the transactions in question were collected. ■ Connect multiple data sources together to track what the user did. ■ Hit a deadend when the data is incomplete, sampled, non-existent … or spread across multiple vendors. The user experience is not just a series of transactions, so this will always result in toil, guesswork, and frustration.

Mobile device error assumptions

Android and iOS do not run like other operating systems. Failures can happen beyond just logs, crashes, and network errors. ■ There are freezes and application not responding errors (ANRs). ■ The UI can stutter from heavy paints and animations. ■ The app can be killed for exceeding memory limits. ■ The app can be killed in the background to free up resources for other apps. ■ The app can be killed by the operating system for taking too long to complete a task. ■ Users can rage tap and force quit apps that cause them frustration. ■ Low power mode can cause unanticipated app performance problems. Vendors must think broader about how to collect telemetry to drill down into what the user experienced. In addition, the signals that represent poor user experiences must be expanded; otherwise, problems go undetected, which undermines team effectiveness and efficiency.

Cardinality and data assumptions

While dealing with cardinality on the backend is no joke — otherwise vendors would not charge more for collecting additional properties — it is also a challenge when working on the client-side, especially for mobile. Cardinality is an assumption that breaks the data collection approach of an agent. After all, every user is unique, and every experience is different. The same stack trace could stem from different root causes. A broken purchase may have different steps leading up to it, and the developer will need to see all these sets of clues to solve it. To find any error that ultimately impacts the user, and, more importantly, prioritize that error based on how many other users are affected, every user experience must be collected, including: ■ Behavioral data points: You must know what actions the user took, including the non-linear paths they took through the app. What happens when users background the app and return much later to complete key actions? What happens when users launch the app through deeplinks or push notifications? ■ Technical data points: You must know what code is running (e.g., first-party versus third-party SDKs), what the device state is (e.g., low battery, high memory usage, high CPU usage), what network calls are firing, and what other apps are running. You also need to collect metrics, logs, traces, and events for analysis.

RUM should look like in today's world

If developers knew to instrument every log, metric and trace, they would be godlike in their cognizant powers. However, there are just too many variables to rely on these three pillars alone to identify and prioritize work efforts. What developers need is a user-focused approach to web and mobile telemetry. Look up any user Developers need the ability to look up any user and see how they experienced the app. This includes every session, including when the app is backgrounded. Apps don't stop running when backgrounded. Users expect to be able to jump back in immediately, so seamless experiences must persist from foreground to background. See every session Developers need access to every session, not just the ones an agent would collect. If you only collect data when an obligatory error or crash happens, you'll miss many sources of poor user experience, like failing network calls, connectivity switches, freezes, and exceeding system resources. If your investigation starts by understanding what happened to the user, then you're not wasting time looking at backend errors that might have no user impact. Surface the user impact Once a true user-impacting issue is identified, developers and DevOps can zoom out to see how many other users are impacted. Then, if applicable, they should spend time investigating traces through backend services to understand what caused the issues your business cares about. In other words, they shouldn't be starting from DevOps errors and just fix them as they come up. That way, the work is prioritized from a user impact perspective as opposed to just going off technical signals. Solve with complete data The final step is to eliminate guesswork when it comes to solving issues. With an agent approach to RUM, limited data types are collected — often not even metrics, logs, traces and events. Teams are not able to get the full picture of the user experience without lots of manual effort. Instead, developers must have a complete picture of the technical and behavioral details of every user experience. That way, you can instantly reproduce every issue by seeing exactly what caused it.

Closing thoughts

When a vendor calls their RUM integration an "agent," everyone should squirm and run. An agent is simply a broken methodology in a web or mobile environment. The spider web approach to telemetry works for monitoring backend systems with homogeneous environments, constant connectivity, and known error types. However, client-side applications have endless variables across devices, operating systems, connectivities, user journeys, and more. The RUM methodology must view the user as the central pivot. That way, both developers and DevOps can work hand-in-hand to get the complete picture of the health of their respective systems, and then work most effectively to move the business forward.

Eric Futoran is CEO of Embrace

The Latest

I've spent a lot of time in the channel, and one thing I keep coming back to is this: a partner program is only as good as what it looks like in the field. Many programs look great on paper, but when a partner is in front of a customer navigating a complex hybrid environment or trying to make the case for AI-powered observability, the gap between what a vendor promises and what it actually delivers becomes very clear, very fast ...

Enterprises today operate in a real-time environment where uninterrupted access to trusted data has become a baseline expectation for users, applications and automated systems. Traditional DataOps models, built on manual effort and human triage, cannot keep pace with this always active demand. AI agents are emerging as the operational backbone, ensuring consistent data availability, reinforcing trustworthiness and enabling a level of scale that manual processes cannot achieve ...

For decades, trust in the digital workplace rested on familiar signals. We trusted faces on video calls, voices on the phone, and emails that appeared to come from people we knew. These cues felt human and intuitive. They anchored how decisions were made, approvals were granted, and access was authorized. AI-powered deepfakes have quietly broken that model ...

Cloud migration was supposed to be a one-way door. For most enterprises, it turns out it isn't. Cloud data repatriation is a real and growing trend. A new survey ... finds that 89% of organizations plan to expand their on-premises infrastructure footprint over the next two years — and 75% have already moved at least some workloads back from public cloud in the past 24 months. The findings point to a broad rethinking of where data belongs ...

Over the past few years, large language models (LLMs) have revolutionized the software industry. Given their ability to excel at multi-step reasoning, LLMs have helped enterprises streamline workflows and adapt to the unknown. However, employing such models comes with sky-high costs, latency issues, and limited flexibility. In the realm of IT operations, it is generally wiser to employ smaller, domain-specific models instead ...

For years, DevOps teams operated under a simple assumption: collect enough telemetry, and you can find and fix any problem. That assumption is breaking down. Modern enterprises now operate across microservices, hybrid cloud environments, APIs, Kubernetes, and highly automated delivery pipelines. Releases happen continuously, dependencies shift constantly, and failures spread faster than teams can diagnose them ...

New Relic surveyed IT and engineering leaders from the media and entertainment (M&E) sector to understand what's working — and where challenges persist with their observability practices. The findings reveal how M&E organizations are navigating rising platform complexity, audience expectations, and AI-driven change. Below are five takeaways that stand out ...

Let me start with something I've seen play out more times than I can count. A team hits a wall with the cloud. Costs creep up, then spike. Performance starts to feel inconsistent. Someone in finance asks a simple question like "why did this double?" and nobody has a clean answer ... Maybe this isn't the right place for everything. That realization feels like a breakthrough, like you've identified the problem. In reality, you've just identified the starting line ...

In MEAN TIME TO INSIGHT Episode 24, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses network observability tool sprawl ... 

In cloud-native systems, scaling is often as simple as moving a slider. For on-premise databases, the stakes are different. Over-provisioning hardware is expensive. Under-provisioning leads to performance bottlenecks that are difficult to fix once the equipment is in the rack ...