Skip to main content

What Is Real User Monitoring in an Observability World? It Is Not APM "Agents" - Part 2

Eric Futoran
Embrace

Agent-based approaches to real user monitoring (RUM) simply do not work. If you are pitched to install an "agent" in your mobile or web environments, you should run for the hills.

Start with: What Is Real User Monitoring in an Observability World? It Is Not APM "Agents" - Part 1

Connectivity and API error assumptions

The best apps run without the assumption of constant network connectivity. They are running compiled code on a device out in the wild, not just a series of always connected services. When an API experiences an error or timeout, these apps should provide a seamless way to continue to operate.

The agent methodology assumes all API errors are created equal. Backend teams thus use network calls as a proxy for issues. However, a backend error, like a 400 or 500, may never affect the user. For example, a purchase may complete even if the API used for the purchase flow throws an error. But if DevOps sees that spike in errors, they will raise the alarm. Developers then scramble to figure them out. This can easily result in a huge waste of time and effort, which is incredibly frustrating.

An even worse scenario is when developers see an issue with an API call, like a failed sign-on, and the backend team is never alerted because it didn't throw a 400 or 500. Or it did, but not at the threshold where such an issue should matter.

The "user is just a property" assumption

When understanding end-user experiences, the user should be the focus, not just a property on an API call. In other words, the user is the primary pivot. When a user complains, the CEO discovers an error in their own apps, or a developer is debugging a hair problem, the first step is to look up the user.

The goal is to see the behavioral and technical data in context to that user and identify the cause of the issue (e.g., environment, user behavior, code, third-party SDK, network connectivity).

The backend does not provide this user-centric view. Ask DevOps to track down a problem that a specific user is having, and they will probably:

■ Query the user ID and look for traces that contain it.

■ Search across sampled sets of data and hope that the transactions in question were collected.

■ Connect multiple data sources together to track what the user did.

■ Hit a deadend when the data is incomplete, sampled, non-existent … or spread across multiple vendors.

The user experience is not just a series of transactions, so this will always result in toil, guesswork, and frustration.

Mobile device error assumptions

Android and iOS do not run like other operating systems. Failures can happen beyond just logs, crashes, and network errors.

■ There are freezes and application not responding errors (ANRs).

■ The UI can stutter from heavy paints and animations.

■ The app can be killed for exceeding memory limits.

■ The app can be killed in the background to free up resources for other apps.

■ The app can be killed by the operating system for taking too long to complete a task.

■ Users can rage tap and force quit apps that cause them frustration.

■ Low power mode can cause unanticipated app performance problems.

Vendors must think broader about how to collect telemetry to drill down into what the user experienced. In addition, the signals that represent poor user experiences must be expanded; otherwise, problems go undetected, which undermines team effectiveness and efficiency.

Cardinality and data assumptions

While dealing with cardinality on the backend is no joke — otherwise vendors would not charge more for collecting additional properties — it is also a challenge when working on the client-side, especially for mobile.

Cardinality is an assumption that breaks the data collection approach of an agent. After all, every user is unique, and every experience is different. The same stack trace could stem from different root causes. A broken purchase may have different steps leading up to it, and the developer will need to see all these sets of clues to solve it.

To find any error that ultimately impacts the user, and, more importantly, prioritize that error based on how many other users are affected, every user experience must be collected, including:

Behavioral data points: You must know what actions the user took, including the non-linear paths they took through the app. What happens when users background the app and return much later to complete key actions? What happens when users launch the app through deeplinks or push notifications?

Technical data points: You must know what code is running (e.g., first-party versus third-party SDKs), what the device state is (e.g., low battery, high memory usage, high CPU usage), what network calls are firing, and what other apps are running. You also need to collect metrics, logs, traces, and events for analysis.

RUM should look like in today's world

If developers knew to instrument every log, metric and trace, they would be godlike in their cognizant powers. However, there are just too many variables to rely on these three pillars alone to identify and prioritize work efforts.

What developers need is a user-focused approach to web and mobile telemetry.

Look up any user

Developers need the ability to look up any user and see how they experienced the app. This includes every session, including when the app is backgrounded. Apps don't stop running when backgrounded. Users expect to be able to jump back in immediately, so seamless experiences must persist from foreground to background.

See every session

Developers need access to every session, not just the ones an agent would collect. If you only collect data when an obligatory error or crash happens, you'll miss many sources of poor user experience, like failing network calls, connectivity switches, freezes, and exceeding system resources.

If your investigation starts by understanding what happened to the user, then you're not wasting time looking at backend errors that might have no user impact.

Surface the user impact

Once a true user-impacting issue is identified, developers and DevOps can zoom out to see how many other users are impacted. Then, if applicable, they should spend time investigating traces through backend services to understand what caused the issues your business cares about.

In other words, they shouldn't be starting from DevOps errors and just fix them as they come up. That way, the work is prioritized from a user impact perspective as opposed to just going off technical signals.

Solve with complete data

The final step is to eliminate guesswork when it comes to solving issues. With an agent approach to RUM, limited data types are collected — often not even metrics, logs, traces and events. Teams are not able to get the full picture of the user experience without lots of manual effort. Instead, developers must have a complete picture of the technical and behavioral details of every user experience. That way, you can instantly reproduce every issue by seeing exactly what caused it.

Closing thoughts

When a vendor calls their RUM integration an "agent," everyone should squirm and run. An agent is simply a broken methodology in a web or mobile environment.

The spider web approach to telemetry works for monitoring backend systems with homogeneous environments, constant connectivity, and known error types. However, client-side applications have endless variables across devices, operating systems, connectivities, user journeys, and more.

The RUM methodology must view the user as the central pivot. That way, both developers and DevOps can work hand-in-hand to get the complete picture of the health of their respective systems, and then work most effectively to move the business forward.

Eric Futoran is CEO of Embrace

The Latest

Businesses that face downtime or outages risk financial and reputational damage, as well as reducing partner, shareholder, and customer trust. One of the major challenges that enterprises face is implementing a robust business continuity plan. What's the solution? The answer may lie in disaster recovery tactics such as truly immutable storage and regular disaster recovery testing ...

IT spending is expected to jump nearly 10% in 2025, and organizations are now facing pressure to manage costs without slowing down critical functions like observability. To meet the challenge, leaders are turning to smarter, more cost effective business strategies. Enter stage right: OpenTelemetry, the missing piece of the puzzle that is no longer just an option but rather a strategic advantage ...

Amidst the threat of cyberhacks and data breaches, companies install several security measures to keep their business safely afloat. These measures aim to protect businesses, employees, and crucial data. Yet, employees perceive them as burdensome. Frustrated with complex logins, slow access, and constant security checks, workers decide to completely bypass all security set-ups ...

Image
Cloudbrink's Personal SASE services provide last-mile acceleration and reduction in latency

In MEAN TIME TO INSIGHT Episode 13, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses hybrid multi-cloud networking strategy ... 

In high-traffic environments, the sheer volume and unpredictable nature of network incidents can quickly overwhelm even the most skilled teams, hindering their ability to react swiftly and effectively, potentially impacting service availability and overall business performance. This is where closed-loop remediation comes into the picture: an IT management concept designed to address the escalating complexity of modern networks ...

In 2025, enterprise workflows are undergoing a seismic shift. Propelled by breakthroughs in generative AI (GenAI), large language models (LLMs), and natural language processing (NLP), a new paradigm is emerging — agentic AI. This technology is not just automating tasks; it's reimagining how organizations make decisions, engage customers, and operate at scale ...

In the early days of the cloud revolution, business leaders perceived cloud services as a means of sidelining IT organizations. IT was too slow, too expensive, or incapable of supporting new technologies. With a team of developers, line of business managers could deploy new applications and services in the cloud. IT has been fighting to retake control ever since. Today, IT is back in the driver's seat, according to new research by Enterprise Management Associates (EMA) ...

In today's fast-paced and increasingly complex network environments, Network Operations Centers (NOCs) are the backbone of ensuring continuous uptime, smooth service delivery, and rapid issue resolution. However, the challenges faced by NOC teams are only growing. In a recent study, 78% state network complexity has grown significantly over the last few years while 84% regularly learn about network issues from users. It is imperative we adopt a new approach to managing today's network experiences ...

Image
Broadcom

From growing reliance on FinOps teams to the increasing attention on artificial intelligence (AI), and software licensing, the Flexera 2025 State of the Cloud Report digs into how organizations are improving cloud spend efficiency, while tackling the complexities of emerging technologies ...

Today, organizations are generating and processing more data than ever before. From training AI models to running complex analytics, massive datasets have become the backbone of innovation. However, as businesses embrace the cloud for its scalability and flexibility, a new challenge arises: managing the soaring costs of storing and processing this data ...

What Is Real User Monitoring in an Observability World? It Is Not APM "Agents" - Part 2

Eric Futoran
Embrace

Agent-based approaches to real user monitoring (RUM) simply do not work. If you are pitched to install an "agent" in your mobile or web environments, you should run for the hills.

Start with: What Is Real User Monitoring in an Observability World? It Is Not APM "Agents" - Part 1

Connectivity and API error assumptions

The best apps run without the assumption of constant network connectivity. They are running compiled code on a device out in the wild, not just a series of always connected services. When an API experiences an error or timeout, these apps should provide a seamless way to continue to operate.

The agent methodology assumes all API errors are created equal. Backend teams thus use network calls as a proxy for issues. However, a backend error, like a 400 or 500, may never affect the user. For example, a purchase may complete even if the API used for the purchase flow throws an error. But if DevOps sees that spike in errors, they will raise the alarm. Developers then scramble to figure them out. This can easily result in a huge waste of time and effort, which is incredibly frustrating.

An even worse scenario is when developers see an issue with an API call, like a failed sign-on, and the backend team is never alerted because it didn't throw a 400 or 500. Or it did, but not at the threshold where such an issue should matter.

The "user is just a property" assumption

When understanding end-user experiences, the user should be the focus, not just a property on an API call. In other words, the user is the primary pivot. When a user complains, the CEO discovers an error in their own apps, or a developer is debugging a hair problem, the first step is to look up the user.

The goal is to see the behavioral and technical data in context to that user and identify the cause of the issue (e.g., environment, user behavior, code, third-party SDK, network connectivity).

The backend does not provide this user-centric view. Ask DevOps to track down a problem that a specific user is having, and they will probably:

■ Query the user ID and look for traces that contain it.

■ Search across sampled sets of data and hope that the transactions in question were collected.

■ Connect multiple data sources together to track what the user did.

■ Hit a deadend when the data is incomplete, sampled, non-existent … or spread across multiple vendors.

The user experience is not just a series of transactions, so this will always result in toil, guesswork, and frustration.

Mobile device error assumptions

Android and iOS do not run like other operating systems. Failures can happen beyond just logs, crashes, and network errors.

■ There are freezes and application not responding errors (ANRs).

■ The UI can stutter from heavy paints and animations.

■ The app can be killed for exceeding memory limits.

■ The app can be killed in the background to free up resources for other apps.

■ The app can be killed by the operating system for taking too long to complete a task.

■ Users can rage tap and force quit apps that cause them frustration.

■ Low power mode can cause unanticipated app performance problems.

Vendors must think broader about how to collect telemetry to drill down into what the user experienced. In addition, the signals that represent poor user experiences must be expanded; otherwise, problems go undetected, which undermines team effectiveness and efficiency.

Cardinality and data assumptions

While dealing with cardinality on the backend is no joke — otherwise vendors would not charge more for collecting additional properties — it is also a challenge when working on the client-side, especially for mobile.

Cardinality is an assumption that breaks the data collection approach of an agent. After all, every user is unique, and every experience is different. The same stack trace could stem from different root causes. A broken purchase may have different steps leading up to it, and the developer will need to see all these sets of clues to solve it.

To find any error that ultimately impacts the user, and, more importantly, prioritize that error based on how many other users are affected, every user experience must be collected, including:

Behavioral data points: You must know what actions the user took, including the non-linear paths they took through the app. What happens when users background the app and return much later to complete key actions? What happens when users launch the app through deeplinks or push notifications?

Technical data points: You must know what code is running (e.g., first-party versus third-party SDKs), what the device state is (e.g., low battery, high memory usage, high CPU usage), what network calls are firing, and what other apps are running. You also need to collect metrics, logs, traces, and events for analysis.

RUM should look like in today's world

If developers knew to instrument every log, metric and trace, they would be godlike in their cognizant powers. However, there are just too many variables to rely on these three pillars alone to identify and prioritize work efforts.

What developers need is a user-focused approach to web and mobile telemetry.

Look up any user

Developers need the ability to look up any user and see how they experienced the app. This includes every session, including when the app is backgrounded. Apps don't stop running when backgrounded. Users expect to be able to jump back in immediately, so seamless experiences must persist from foreground to background.

See every session

Developers need access to every session, not just the ones an agent would collect. If you only collect data when an obligatory error or crash happens, you'll miss many sources of poor user experience, like failing network calls, connectivity switches, freezes, and exceeding system resources.

If your investigation starts by understanding what happened to the user, then you're not wasting time looking at backend errors that might have no user impact.

Surface the user impact

Once a true user-impacting issue is identified, developers and DevOps can zoom out to see how many other users are impacted. Then, if applicable, they should spend time investigating traces through backend services to understand what caused the issues your business cares about.

In other words, they shouldn't be starting from DevOps errors and just fix them as they come up. That way, the work is prioritized from a user impact perspective as opposed to just going off technical signals.

Solve with complete data

The final step is to eliminate guesswork when it comes to solving issues. With an agent approach to RUM, limited data types are collected — often not even metrics, logs, traces and events. Teams are not able to get the full picture of the user experience without lots of manual effort. Instead, developers must have a complete picture of the technical and behavioral details of every user experience. That way, you can instantly reproduce every issue by seeing exactly what caused it.

Closing thoughts

When a vendor calls their RUM integration an "agent," everyone should squirm and run. An agent is simply a broken methodology in a web or mobile environment.

The spider web approach to telemetry works for monitoring backend systems with homogeneous environments, constant connectivity, and known error types. However, client-side applications have endless variables across devices, operating systems, connectivities, user journeys, and more.

The RUM methodology must view the user as the central pivot. That way, both developers and DevOps can work hand-in-hand to get the complete picture of the health of their respective systems, and then work most effectively to move the business forward.

Eric Futoran is CEO of Embrace

The Latest

Businesses that face downtime or outages risk financial and reputational damage, as well as reducing partner, shareholder, and customer trust. One of the major challenges that enterprises face is implementing a robust business continuity plan. What's the solution? The answer may lie in disaster recovery tactics such as truly immutable storage and regular disaster recovery testing ...

IT spending is expected to jump nearly 10% in 2025, and organizations are now facing pressure to manage costs without slowing down critical functions like observability. To meet the challenge, leaders are turning to smarter, more cost effective business strategies. Enter stage right: OpenTelemetry, the missing piece of the puzzle that is no longer just an option but rather a strategic advantage ...

Amidst the threat of cyberhacks and data breaches, companies install several security measures to keep their business safely afloat. These measures aim to protect businesses, employees, and crucial data. Yet, employees perceive them as burdensome. Frustrated with complex logins, slow access, and constant security checks, workers decide to completely bypass all security set-ups ...

Image
Cloudbrink's Personal SASE services provide last-mile acceleration and reduction in latency

In MEAN TIME TO INSIGHT Episode 13, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses hybrid multi-cloud networking strategy ... 

In high-traffic environments, the sheer volume and unpredictable nature of network incidents can quickly overwhelm even the most skilled teams, hindering their ability to react swiftly and effectively, potentially impacting service availability and overall business performance. This is where closed-loop remediation comes into the picture: an IT management concept designed to address the escalating complexity of modern networks ...

In 2025, enterprise workflows are undergoing a seismic shift. Propelled by breakthroughs in generative AI (GenAI), large language models (LLMs), and natural language processing (NLP), a new paradigm is emerging — agentic AI. This technology is not just automating tasks; it's reimagining how organizations make decisions, engage customers, and operate at scale ...

In the early days of the cloud revolution, business leaders perceived cloud services as a means of sidelining IT organizations. IT was too slow, too expensive, or incapable of supporting new technologies. With a team of developers, line of business managers could deploy new applications and services in the cloud. IT has been fighting to retake control ever since. Today, IT is back in the driver's seat, according to new research by Enterprise Management Associates (EMA) ...

In today's fast-paced and increasingly complex network environments, Network Operations Centers (NOCs) are the backbone of ensuring continuous uptime, smooth service delivery, and rapid issue resolution. However, the challenges faced by NOC teams are only growing. In a recent study, 78% state network complexity has grown significantly over the last few years while 84% regularly learn about network issues from users. It is imperative we adopt a new approach to managing today's network experiences ...

Image
Broadcom

From growing reliance on FinOps teams to the increasing attention on artificial intelligence (AI), and software licensing, the Flexera 2025 State of the Cloud Report digs into how organizations are improving cloud spend efficiency, while tackling the complexities of emerging technologies ...

Today, organizations are generating and processing more data than ever before. From training AI models to running complex analytics, massive datasets have become the backbone of innovation. However, as businesses embrace the cloud for its scalability and flexibility, a new challenge arises: managing the soaring costs of storing and processing this data ...