What Is Real User Monitoring in an Observability World? It Is Not APM "Agents" - Part 2
March 27, 2024

Eric Futoran
Embrace

Share this

Agent-based approaches to real user monitoring (RUM) simply do not work. If you are pitched to install an "agent" in your mobile or web environments, you should run for the hills.

Start with: What Is Real User Monitoring in an Observability World? It Is Not APM "Agents" - Part 1

Connectivity and API error assumptions

The best apps run without the assumption of constant network connectivity. They are running compiled code on a device out in the wild, not just a series of always connected services. When an API experiences an error or timeout, these apps should provide a seamless way to continue to operate.

The agent methodology assumes all API errors are created equal. Backend teams thus use network calls as a proxy for issues. However, a backend error, like a 400 or 500, may never affect the user. For example, a purchase may complete even if the API used for the purchase flow throws an error. But if DevOps sees that spike in errors, they will raise the alarm. Developers then scramble to figure them out. This can easily result in a huge waste of time and effort, which is incredibly frustrating.

An even worse scenario is when developers see an issue with an API call, like a failed sign-on, and the backend team is never alerted because it didn't throw a 400 or 500. Or it did, but not at the threshold where such an issue should matter.

The "user is just a property" assumption

When understanding end-user experiences, the user should be the focus, not just a property on an API call. In other words, the user is the primary pivot. When a user complains, the CEO discovers an error in their own apps, or a developer is debugging a hair problem, the first step is to look up the user.

The goal is to see the behavioral and technical data in context to that user and identify the cause of the issue (e.g., environment, user behavior, code, third-party SDK, network connectivity).

The backend does not provide this user-centric view. Ask DevOps to track down a problem that a specific user is having, and they will probably:

■ Query the user ID and look for traces that contain it.

■ Search across sampled sets of data and hope that the transactions in question were collected.

■ Connect multiple data sources together to track what the user did.

■ Hit a deadend when the data is incomplete, sampled, non-existent … or spread across multiple vendors.

The user experience is not just a series of transactions, so this will always result in toil, guesswork, and frustration.

Mobile device error assumptions

Android and iOS do not run like other operating systems. Failures can happen beyond just logs, crashes, and network errors.

■ There are freezes and application not responding errors (ANRs).

■ The UI can stutter from heavy paints and animations.

■ The app can be killed for exceeding memory limits.

■ The app can be killed in the background to free up resources for other apps.

■ The app can be killed by the operating system for taking too long to complete a task.

■ Users can rage tap and force quit apps that cause them frustration.

■ Low power mode can cause unanticipated app performance problems.

Vendors must think broader about how to collect telemetry to drill down into what the user experienced. In addition, the signals that represent poor user experiences must be expanded; otherwise, problems go undetected, which undermines team effectiveness and efficiency.

Cardinality and data assumptions

While dealing with cardinality on the backend is no joke — otherwise vendors would not charge more for collecting additional properties — it is also a challenge when working on the client-side, especially for mobile.

Cardinality is an assumption that breaks the data collection approach of an agent. After all, every user is unique, and every experience is different. The same stack trace could stem from different root causes. A broken purchase may have different steps leading up to it, and the developer will need to see all these sets of clues to solve it.

To find any error that ultimately impacts the user, and, more importantly, prioritize that error based on how many other users are affected, every user experience must be collected, including:

Behavioral data points: You must know what actions the user took, including the non-linear paths they took through the app. What happens when users background the app and return much later to complete key actions? What happens when users launch the app through deeplinks or push notifications?

Technical data points: You must know what code is running (e.g., first-party versus third-party SDKs), what the device state is (e.g., low battery, high memory usage, high CPU usage), what network calls are firing, and what other apps are running. You also need to collect metrics, logs, traces, and events for analysis.

RUM should look like in today's world

If developers knew to instrument every log, metric and trace, they would be godlike in their cognizant powers. However, there are just too many variables to rely on these three pillars alone to identify and prioritize work efforts.

What developers need is a user-focused approach to web and mobile telemetry.

Look up any user

Developers need the ability to look up any user and see how they experienced the app. This includes every session, including when the app is backgrounded. Apps don't stop running when backgrounded. Users expect to be able to jump back in immediately, so seamless experiences must persist from foreground to background.

See every session

Developers need access to every session, not just the ones an agent would collect. If you only collect data when an obligatory error or crash happens, you'll miss many sources of poor user experience, like failing network calls, connectivity switches, freezes, and exceeding system resources.

If your investigation starts by understanding what happened to the user, then you're not wasting time looking at backend errors that might have no user impact.

Surface the user impact

Once a true user-impacting issue is identified, developers and DevOps can zoom out to see how many other users are impacted. Then, if applicable, they should spend time investigating traces through backend services to understand what caused the issues your business cares about.

In other words, they shouldn't be starting from DevOps errors and just fix them as they come up. That way, the work is prioritized from a user impact perspective as opposed to just going off technical signals.

Solve with complete data

The final step is to eliminate guesswork when it comes to solving issues. With an agent approach to RUM, limited data types are collected — often not even metrics, logs, traces and events. Teams are not able to get the full picture of the user experience without lots of manual effort. Instead, developers must have a complete picture of the technical and behavioral details of every user experience. That way, you can instantly reproduce every issue by seeing exactly what caused it.

Closing thoughts

When a vendor calls their RUM integration an "agent," everyone should squirm and run. An agent is simply a broken methodology in a web or mobile environment.

The spider web approach to telemetry works for monitoring backend systems with homogeneous environments, constant connectivity, and known error types. However, client-side applications have endless variables across devices, operating systems, connectivities, user journeys, and more.

The RUM methodology must view the user as the central pivot. That way, both developers and DevOps can work hand-in-hand to get the complete picture of the health of their respective systems, and then work most effectively to move the business forward.

Eric Futoran is CEO of Embrace
Share this

The Latest

April 19, 2024

In MEAN TIME TO INSIGHT Episode 5, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses the network source of truth ...

April 18, 2024

A vast majority (89%) of organizations have rapidly expanded their technology in the past few years and three quarters (76%) say it's brought with it increased "chaos" that they have to manage, according to Situation Report 2024: Managing Technology Chaos from Software AG ...

April 17, 2024

In 2024 the number one challenge facing IT teams is a lack of skilled workers, and many are turning to automation as an answer, according to IT Trends: 2024 Industry Report ...

April 16, 2024

Organizations are continuing to embrace multicloud environments and cloud-native architectures to enable rapid transformation and deliver secure innovation. However, despite the speed, scale, and agility enabled by these modern cloud ecosystems, organizations are struggling to manage the explosion of data they create, according to The state of observability 2024: Overcoming complexity through AI-driven analytics and automation strategies, a report from Dynatrace ...

April 15, 2024

Organizations recognize the value of observability, but only 10% of them are actually practicing full observability of their applications and infrastructure. This is among the key findings from the recently completed Logz.io 2024 Observability Pulse Survey and Report ...

April 11, 2024

Businesses must adopt a comprehensive Internet Performance Monitoring (IPM) strategy, says Enterprise Management Associates (EMA), a leading IT analyst research firm. This strategy is crucial to bridge the significant observability gap within today's complex IT infrastructures. The recommendation is particularly timely, given that 99% of enterprises are expanding their use of the Internet as a primary connectivity conduit while facing challenges due to the inefficiency of multiple, disjointed monitoring tools, according to Modern Enterprises Must Boost Observability with Internet Performance Monitoring, a new report from EMA and Catchpoint ...

April 10, 2024

Choosing the right approach is critical with cloud monitoring in hybrid environments. Otherwise, you may drive up costs with features you don’t need and risk diminishing the visibility of your on-premises IT ...

April 09, 2024

Consumers ranked the marketing strategies and missteps that most significantly impact brand trust, which 73% say is their biggest motivator to share first-party data, according to The Rules of the Marketing Game, a 2023 report from Pantheon ...

April 08, 2024

Digital experience monitoring is the practice of monitoring and analyzing the complete digital user journey of your applications, websites, APIs, and other digital services. It involves tracking the performance of your web application from the perspective of the end user, providing detailed insights on user experience, app performance, and customer satisfaction ...

April 04, 2024
Modern organizations race to launch their high-quality cloud applications as soon as possible. On the other hand, time to market also plays an essential role in determining the application's success. However, without effective testing, it's hard to be confident in the final product ...