Skip to main content

What Is Real User Monitoring in an Observability World? It Is Not APM "Agents" - Part 2

Eric Futoran
Embrace

Agent-based approaches to real user monitoring (RUM) simply do not work. If you are pitched to install an "agent" in your mobile or web environments, you should run for the hills.

Start with: What Is Real User Monitoring in an Observability World? It Is Not APM "Agents" - Part 1

Connectivity and API error assumptions

The best apps run without the assumption of constant network connectivity. They are running compiled code on a device out in the wild, not just a series of always connected services. When an API experiences an error or timeout, these apps should provide a seamless way to continue to operate.

The agent methodology assumes all API errors are created equal. Backend teams thus use network calls as a proxy for issues. However, a backend error, like a 400 or 500, may never affect the user. For example, a purchase may complete even if the API used for the purchase flow throws an error. But if DevOps sees that spike in errors, they will raise the alarm. Developers then scramble to figure them out. This can easily result in a huge waste of time and effort, which is incredibly frustrating.

An even worse scenario is when developers see an issue with an API call, like a failed sign-on, and the backend team is never alerted because it didn't throw a 400 or 500. Or it did, but not at the threshold where such an issue should matter.

The "user is just a property" assumption

When understanding end-user experiences, the user should be the focus, not just a property on an API call. In other words, the user is the primary pivot. When a user complains, the CEO discovers an error in their own apps, or a developer is debugging a hair problem, the first step is to look up the user.

The goal is to see the behavioral and technical data in context to that user and identify the cause of the issue (e.g., environment, user behavior, code, third-party SDK, network connectivity).

The backend does not provide this user-centric view. Ask DevOps to track down a problem that a specific user is having, and they will probably:

■ Query the user ID and look for traces that contain it.

■ Search across sampled sets of data and hope that the transactions in question were collected.

■ Connect multiple data sources together to track what the user did.

■ Hit a deadend when the data is incomplete, sampled, non-existent … or spread across multiple vendors.

The user experience is not just a series of transactions, so this will always result in toil, guesswork, and frustration.

Mobile device error assumptions

Android and iOS do not run like other operating systems. Failures can happen beyond just logs, crashes, and network errors.

■ There are freezes and application not responding errors (ANRs).

■ The UI can stutter from heavy paints and animations.

■ The app can be killed for exceeding memory limits.

■ The app can be killed in the background to free up resources for other apps.

■ The app can be killed by the operating system for taking too long to complete a task.

■ Users can rage tap and force quit apps that cause them frustration.

■ Low power mode can cause unanticipated app performance problems.

Vendors must think broader about how to collect telemetry to drill down into what the user experienced. In addition, the signals that represent poor user experiences must be expanded; otherwise, problems go undetected, which undermines team effectiveness and efficiency.

Cardinality and data assumptions

While dealing with cardinality on the backend is no joke — otherwise vendors would not charge more for collecting additional properties — it is also a challenge when working on the client-side, especially for mobile.

Cardinality is an assumption that breaks the data collection approach of an agent. After all, every user is unique, and every experience is different. The same stack trace could stem from different root causes. A broken purchase may have different steps leading up to it, and the developer will need to see all these sets of clues to solve it.

To find any error that ultimately impacts the user, and, more importantly, prioritize that error based on how many other users are affected, every user experience must be collected, including:

Behavioral data points: You must know what actions the user took, including the non-linear paths they took through the app. What happens when users background the app and return much later to complete key actions? What happens when users launch the app through deeplinks or push notifications?

Technical data points: You must know what code is running (e.g., first-party versus third-party SDKs), what the device state is (e.g., low battery, high memory usage, high CPU usage), what network calls are firing, and what other apps are running. You also need to collect metrics, logs, traces, and events for analysis.

RUM should look like in today's world

If developers knew to instrument every log, metric and trace, they would be godlike in their cognizant powers. However, there are just too many variables to rely on these three pillars alone to identify and prioritize work efforts.

What developers need is a user-focused approach to web and mobile telemetry.

Look up any user

Developers need the ability to look up any user and see how they experienced the app. This includes every session, including when the app is backgrounded. Apps don't stop running when backgrounded. Users expect to be able to jump back in immediately, so seamless experiences must persist from foreground to background.

See every session

Developers need access to every session, not just the ones an agent would collect. If you only collect data when an obligatory error or crash happens, you'll miss many sources of poor user experience, like failing network calls, connectivity switches, freezes, and exceeding system resources.

If your investigation starts by understanding what happened to the user, then you're not wasting time looking at backend errors that might have no user impact.

Surface the user impact

Once a true user-impacting issue is identified, developers and DevOps can zoom out to see how many other users are impacted. Then, if applicable, they should spend time investigating traces through backend services to understand what caused the issues your business cares about.

In other words, they shouldn't be starting from DevOps errors and just fix them as they come up. That way, the work is prioritized from a user impact perspective as opposed to just going off technical signals.

Solve with complete data

The final step is to eliminate guesswork when it comes to solving issues. With an agent approach to RUM, limited data types are collected — often not even metrics, logs, traces and events. Teams are not able to get the full picture of the user experience without lots of manual effort. Instead, developers must have a complete picture of the technical and behavioral details of every user experience. That way, you can instantly reproduce every issue by seeing exactly what caused it.

Closing thoughts

When a vendor calls their RUM integration an "agent," everyone should squirm and run. An agent is simply a broken methodology in a web or mobile environment.

The spider web approach to telemetry works for monitoring backend systems with homogeneous environments, constant connectivity, and known error types. However, client-side applications have endless variables across devices, operating systems, connectivities, user journeys, and more.

The RUM methodology must view the user as the central pivot. That way, both developers and DevOps can work hand-in-hand to get the complete picture of the health of their respective systems, and then work most effectively to move the business forward.

Eric Futoran is CEO of Embrace

The Latest

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 5 covers the infrastructure and hardware supporting AI ...

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 4 covers advancements in AI technology ...

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 3 covers AI's impact on employees and their roles ...

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 2 covers the challenges presented by AI, as well as solutions to those problems ...

In the final part of APMdigest's 2025 Predictions Series, industry experts offer predictions on how AI will evolve and impact technology and business in 2025 ...

E-commerce is set to skyrocket with a 9% rise over the next few years ... To thrive in this competitive environment, retailers must identify digital resilience as their top priority. In a world where savvy shoppers expect 24/7 access to online deals and experiences, any unexpected downtime to digital services can lead to significant financial losses, damage to brand reputation, abandoned carts with designer shoes, and additional issues ...

Efficiency is a highly-desirable objective in business ... We're seeing this scenario play out in enterprises around the world as they continue to struggle with infrastructures and remote work models with an eye toward operational efficiencies. In contrast to that goal, a recent Broadcom survey of global IT and network professionals found widespread adoption of these strategies is making the network more complex and hampering observability, leading to uptime, performance and security issues. Let's look more closely at these challenges ...

Image
Broadcom

The 2025 Catchpoint SRE Report dives into the forces transforming the SRE landscape, exploring both the challenges and opportunities ahead. Let's break down the key findings and what they mean for SRE professionals and the businesses relying on them ...

Image
Catchpoint

The pressure on IT teams has never been greater. As data environments grow increasingly complex, resource shortages are emerging as a major obstacle for IT leaders striving to meet the demands of modern infrastructure management ... According to DataStrike's newly released 2025 Data Infrastructure Survey Report, more than half (54%) of IT leaders cite resource limitations as a top challenge, highlighting a growing trend toward outsourcing as a solution ...

Image
Datastrike

Gartner revealed its top strategic predictions for 2025 and beyond. Gartner's top predictions explore how generative AI (GenAI) is affecting areas where most would assume only humans can have lasting impact ...

What Is Real User Monitoring in an Observability World? It Is Not APM "Agents" - Part 2

Eric Futoran
Embrace

Agent-based approaches to real user monitoring (RUM) simply do not work. If you are pitched to install an "agent" in your mobile or web environments, you should run for the hills.

Start with: What Is Real User Monitoring in an Observability World? It Is Not APM "Agents" - Part 1

Connectivity and API error assumptions

The best apps run without the assumption of constant network connectivity. They are running compiled code on a device out in the wild, not just a series of always connected services. When an API experiences an error or timeout, these apps should provide a seamless way to continue to operate.

The agent methodology assumes all API errors are created equal. Backend teams thus use network calls as a proxy for issues. However, a backend error, like a 400 or 500, may never affect the user. For example, a purchase may complete even if the API used for the purchase flow throws an error. But if DevOps sees that spike in errors, they will raise the alarm. Developers then scramble to figure them out. This can easily result in a huge waste of time and effort, which is incredibly frustrating.

An even worse scenario is when developers see an issue with an API call, like a failed sign-on, and the backend team is never alerted because it didn't throw a 400 or 500. Or it did, but not at the threshold where such an issue should matter.

The "user is just a property" assumption

When understanding end-user experiences, the user should be the focus, not just a property on an API call. In other words, the user is the primary pivot. When a user complains, the CEO discovers an error in their own apps, or a developer is debugging a hair problem, the first step is to look up the user.

The goal is to see the behavioral and technical data in context to that user and identify the cause of the issue (e.g., environment, user behavior, code, third-party SDK, network connectivity).

The backend does not provide this user-centric view. Ask DevOps to track down a problem that a specific user is having, and they will probably:

■ Query the user ID and look for traces that contain it.

■ Search across sampled sets of data and hope that the transactions in question were collected.

■ Connect multiple data sources together to track what the user did.

■ Hit a deadend when the data is incomplete, sampled, non-existent … or spread across multiple vendors.

The user experience is not just a series of transactions, so this will always result in toil, guesswork, and frustration.

Mobile device error assumptions

Android and iOS do not run like other operating systems. Failures can happen beyond just logs, crashes, and network errors.

■ There are freezes and application not responding errors (ANRs).

■ The UI can stutter from heavy paints and animations.

■ The app can be killed for exceeding memory limits.

■ The app can be killed in the background to free up resources for other apps.

■ The app can be killed by the operating system for taking too long to complete a task.

■ Users can rage tap and force quit apps that cause them frustration.

■ Low power mode can cause unanticipated app performance problems.

Vendors must think broader about how to collect telemetry to drill down into what the user experienced. In addition, the signals that represent poor user experiences must be expanded; otherwise, problems go undetected, which undermines team effectiveness and efficiency.

Cardinality and data assumptions

While dealing with cardinality on the backend is no joke — otherwise vendors would not charge more for collecting additional properties — it is also a challenge when working on the client-side, especially for mobile.

Cardinality is an assumption that breaks the data collection approach of an agent. After all, every user is unique, and every experience is different. The same stack trace could stem from different root causes. A broken purchase may have different steps leading up to it, and the developer will need to see all these sets of clues to solve it.

To find any error that ultimately impacts the user, and, more importantly, prioritize that error based on how many other users are affected, every user experience must be collected, including:

Behavioral data points: You must know what actions the user took, including the non-linear paths they took through the app. What happens when users background the app and return much later to complete key actions? What happens when users launch the app through deeplinks or push notifications?

Technical data points: You must know what code is running (e.g., first-party versus third-party SDKs), what the device state is (e.g., low battery, high memory usage, high CPU usage), what network calls are firing, and what other apps are running. You also need to collect metrics, logs, traces, and events for analysis.

RUM should look like in today's world

If developers knew to instrument every log, metric and trace, they would be godlike in their cognizant powers. However, there are just too many variables to rely on these three pillars alone to identify and prioritize work efforts.

What developers need is a user-focused approach to web and mobile telemetry.

Look up any user

Developers need the ability to look up any user and see how they experienced the app. This includes every session, including when the app is backgrounded. Apps don't stop running when backgrounded. Users expect to be able to jump back in immediately, so seamless experiences must persist from foreground to background.

See every session

Developers need access to every session, not just the ones an agent would collect. If you only collect data when an obligatory error or crash happens, you'll miss many sources of poor user experience, like failing network calls, connectivity switches, freezes, and exceeding system resources.

If your investigation starts by understanding what happened to the user, then you're not wasting time looking at backend errors that might have no user impact.

Surface the user impact

Once a true user-impacting issue is identified, developers and DevOps can zoom out to see how many other users are impacted. Then, if applicable, they should spend time investigating traces through backend services to understand what caused the issues your business cares about.

In other words, they shouldn't be starting from DevOps errors and just fix them as they come up. That way, the work is prioritized from a user impact perspective as opposed to just going off technical signals.

Solve with complete data

The final step is to eliminate guesswork when it comes to solving issues. With an agent approach to RUM, limited data types are collected — often not even metrics, logs, traces and events. Teams are not able to get the full picture of the user experience without lots of manual effort. Instead, developers must have a complete picture of the technical and behavioral details of every user experience. That way, you can instantly reproduce every issue by seeing exactly what caused it.

Closing thoughts

When a vendor calls their RUM integration an "agent," everyone should squirm and run. An agent is simply a broken methodology in a web or mobile environment.

The spider web approach to telemetry works for monitoring backend systems with homogeneous environments, constant connectivity, and known error types. However, client-side applications have endless variables across devices, operating systems, connectivities, user journeys, and more.

The RUM methodology must view the user as the central pivot. That way, both developers and DevOps can work hand-in-hand to get the complete picture of the health of their respective systems, and then work most effectively to move the business forward.

Eric Futoran is CEO of Embrace

The Latest

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 5 covers the infrastructure and hardware supporting AI ...

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 4 covers advancements in AI technology ...

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 3 covers AI's impact on employees and their roles ...

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 2 covers the challenges presented by AI, as well as solutions to those problems ...

In the final part of APMdigest's 2025 Predictions Series, industry experts offer predictions on how AI will evolve and impact technology and business in 2025 ...

E-commerce is set to skyrocket with a 9% rise over the next few years ... To thrive in this competitive environment, retailers must identify digital resilience as their top priority. In a world where savvy shoppers expect 24/7 access to online deals and experiences, any unexpected downtime to digital services can lead to significant financial losses, damage to brand reputation, abandoned carts with designer shoes, and additional issues ...

Efficiency is a highly-desirable objective in business ... We're seeing this scenario play out in enterprises around the world as they continue to struggle with infrastructures and remote work models with an eye toward operational efficiencies. In contrast to that goal, a recent Broadcom survey of global IT and network professionals found widespread adoption of these strategies is making the network more complex and hampering observability, leading to uptime, performance and security issues. Let's look more closely at these challenges ...

Image
Broadcom

The 2025 Catchpoint SRE Report dives into the forces transforming the SRE landscape, exploring both the challenges and opportunities ahead. Let's break down the key findings and what they mean for SRE professionals and the businesses relying on them ...

Image
Catchpoint

The pressure on IT teams has never been greater. As data environments grow increasingly complex, resource shortages are emerging as a major obstacle for IT leaders striving to meet the demands of modern infrastructure management ... According to DataStrike's newly released 2025 Data Infrastructure Survey Report, more than half (54%) of IT leaders cite resource limitations as a top challenge, highlighting a growing trend toward outsourcing as a solution ...

Image
Datastrike

Gartner revealed its top strategic predictions for 2025 and beyond. Gartner's top predictions explore how generative AI (GenAI) is affecting areas where most would assume only humans can have lasting impact ...