
Agent-based approaches to real user monitoring (RUM) simply do not work. If you are pitched to install an "agent" in your mobile or web environments, you should run for the hills.
Start with: What Is Real User Monitoring in an Observability World? It Is Not APM "Agents" - Part 1
Connectivity and API error assumptions
The best apps run without the assumption of constant network connectivity. They are running compiled code on a device out in the wild, not just a series of always connected services. When an API experiences an error or timeout, these apps should provide a seamless way to continue to operate.
The agent methodology assumes all API errors are created equal. Backend teams thus use network calls as a proxy for issues. However, a backend error, like a 400 or 500, may never affect the user. For example, a purchase may complete even if the API used for the purchase flow throws an error. But if DevOps sees that spike in errors, they will raise the alarm. Developers then scramble to figure them out. This can easily result in a huge waste of time and effort, which is incredibly frustrating.
An even worse scenario is when developers see an issue with an API call, like a failed sign-on, and the backend team is never alerted because it didn't throw a 400 or 500. Or it did, but not at the threshold where such an issue should matter.
The "user is just a property" assumption
When understanding end-user experiences, the user should be the focus, not just a property on an API call. In other words, the user is the primary pivot. When a user complains, the CEO discovers an error in their own apps, or a developer is debugging a hair problem, the first step is to look up the user.
The goal is to see the behavioral and technical data in context to that user and identify the cause of the issue (e.g., environment, user behavior, code, third-party SDK, network connectivity).
The backend does not provide this user-centric view. Ask DevOps to track down a problem that a specific user is having, and they will probably:
■ Query the user ID and look for traces that contain it.
■ Search across sampled sets of data and hope that the transactions in question were collected.
■ Connect multiple data sources together to track what the user did.
■ Hit a deadend when the data is incomplete, sampled, non-existent … or spread across multiple vendors.
The user experience is not just a series of transactions, so this will always result in toil, guesswork, and frustration.
Mobile device error assumptions
Android and iOS do not run like other operating systems. Failures can happen beyond just logs, crashes, and network errors.
■ There are freezes and application not responding errors (ANRs).
■ The UI can stutter from heavy paints and animations.
■ The app can be killed for exceeding memory limits.
■ The app can be killed in the background to free up resources for other apps.
■ The app can be killed by the operating system for taking too long to complete a task.
■ Users can rage tap and force quit apps that cause them frustration.
■ Low power mode can cause unanticipated app performance problems.
Vendors must think broader about how to collect telemetry to drill down into what the user experienced. In addition, the signals that represent poor user experiences must be expanded; otherwise, problems go undetected, which undermines team effectiveness and efficiency.
Cardinality and data assumptions
While dealing with cardinality on the backend is no joke — otherwise vendors would not charge more for collecting additional properties — it is also a challenge when working on the client-side, especially for mobile.
Cardinality is an assumption that breaks the data collection approach of an agent. After all, every user is unique, and every experience is different. The same stack trace could stem from different root causes. A broken purchase may have different steps leading up to it, and the developer will need to see all these sets of clues to solve it.
To find any error that ultimately impacts the user, and, more importantly, prioritize that error based on how many other users are affected, every user experience must be collected, including:
■ Behavioral data points: You must know what actions the user took, including the non-linear paths they took through the app. What happens when users background the app and return much later to complete key actions? What happens when users launch the app through deeplinks or push notifications?
■ Technical data points: You must know what code is running (e.g., first-party versus third-party SDKs), what the device state is (e.g., low battery, high memory usage, high CPU usage), what network calls are firing, and what other apps are running. You also need to collect metrics, logs, traces, and events for analysis.
RUM should look like in today's world
If developers knew to instrument every log, metric and trace, they would be godlike in their cognizant powers. However, there are just too many variables to rely on these three pillars alone to identify and prioritize work efforts.
What developers need is a user-focused approach to web and mobile telemetry.
Look up any user
Developers need the ability to look up any user and see how they experienced the app. This includes every session, including when the app is backgrounded. Apps don't stop running when backgrounded. Users expect to be able to jump back in immediately, so seamless experiences must persist from foreground to background.
See every session
Developers need access to every session, not just the ones an agent would collect. If you only collect data when an obligatory error or crash happens, you'll miss many sources of poor user experience, like failing network calls, connectivity switches, freezes, and exceeding system resources.
If your investigation starts by understanding what happened to the user, then you're not wasting time looking at backend errors that might have no user impact.
Surface the user impact
Once a true user-impacting issue is identified, developers and DevOps can zoom out to see how many other users are impacted. Then, if applicable, they should spend time investigating traces through backend services to understand what caused the issues your business cares about.
In other words, they shouldn't be starting from DevOps errors and just fix them as they come up. That way, the work is prioritized from a user impact perspective as opposed to just going off technical signals.
Solve with complete data
The final step is to eliminate guesswork when it comes to solving issues. With an agent approach to RUM, limited data types are collected — often not even metrics, logs, traces and events. Teams are not able to get the full picture of the user experience without lots of manual effort. Instead, developers must have a complete picture of the technical and behavioral details of every user experience. That way, you can instantly reproduce every issue by seeing exactly what caused it.
Closing thoughts
When a vendor calls their RUM integration an "agent," everyone should squirm and run. An agent is simply a broken methodology in a web or mobile environment.
The spider web approach to telemetry works for monitoring backend systems with homogeneous environments, constant connectivity, and known error types. However, client-side applications have endless variables across devices, operating systems, connectivities, user journeys, and more.
The RUM methodology must view the user as the central pivot. That way, both developers and DevOps can work hand-in-hand to get the complete picture of the health of their respective systems, and then work most effectively to move the business forward.