Measure What You Need and No More
March 24, 2014

Tom Fleck
OC Systems

Share this

Projects collect lots of metrics that they do not need. All on this forum would agree that measurement is critical. But not all metrics are useful, and too many metrics can be confusing and obscure what's important.

Furthermore, measuring takes time and space resources away from doing. As computers get faster, storage gets cheaper, metrics and logging frameworks come built-in and data analysis and display becomes more powerful, the temptation grows to collect everything, just in case you need it.

Here are some observations on why we collect too many metrics, and how we can avoid it.

1. If your job is collecting data, collecting more data makes you look more productive

Collecting metrics is a means to an end, not an end unto itself. If you don't get paid unless you find more numbers to squeeze from an application then your organization needs some adjustment.

Depending on your level in the organization, the jobs should be:

- Ask a question that a metric could answer

- Decide what metric answers a question

- Implement the collection of a requested metric

- Answer a question using the collected metric values

The end goal of metrics is either to identify a problem, or fix a problem.

2. Sometimes you can see "anomalies" looking at other metrics you might not think relevant

This is actually the most compelling argument for collecting a lot of metrics. But this should be done by choice, in a purposeful way, in a non-production but realistically-loaded environment, and the result should be analyzed by somebody with the time and qualifications to judge the value of these metrics. Just turning on all the metrics all the time and hoping the bug will jump out at you when you need it, is not an engineering approach.

3. It's easier to browse existing metrics than to figure out how to enable a new metric

It shouldn't be, especially if it's one of the many that you would have been collecting already. Good tools and infrastructure should make the mechanics easy, and their use is something your developers and operations people should know: How do I enable/disable specific metrics and adjust their collection frequency and persistence? Whether it's one app-server's JMX metrics or your external network bandwidth, somebody around there should know the points at which metrics are collected, how these are configured, and where the results go. If not, then that's a problem to address.

When the person who knows is explicitly asked to look at the metrics being collected, chances are they'll see some that are not used or useful. Or, they might see metrics or logging that are not enabled, but would have been useful in the past, and that's even better. Either way: a requirement of your application's implementation and documentation should be how to easily control metrics collection.

4. It's easier to collect all the metrics than to figure out which are the right few

How do you know which few metrics you need? Of course you don't, always, in advance. This is the hardest problem and the biggest reason why we collect too much. There are two main approaches to identifying what to measure:

- negative or problem-focused

- positive or goal-focused

The negative approach might alternatively be called the House, MD approach, where we do differential diagnosis to decide which tests to run on the patient. We build a diagnostic handbook for our application by listing problems, symptoms, metrics and value ranges which confirm the problem exists; and/or metrics and value ranges which exclude that problem.

This process has the added advantage of forcing us to identify potential problems, so our QA department can test for these in advance (see The AntifragileOrganization). If testing or production shows additional problems, we add that problem, along with the metrics we used to identify and diagnose it, to our diagnostic handbook, and keep the collection of those useful metrics enabled, if possible.

The positive approach is the more familiar one: the SLA. Quantify what we want to achieve as metrics and measure that. We then use externally visible goals like the SLA to drive internal metrics, like measuring every operation comprising a transaction. Then measuring the resources used by the operations comprising a transaction. Then measuring the resources that compete with the resources that impact the operations that comprise a transaction ... And this is the trap. Everything in the entire system contributes to your SLA, so it's tempting to measure and report on everything.

However, considering both approaches together suggests a solution:

1. Measure what you want to achieve

Record user experience, transaction frequency, error rates, availability, system correctness. If you don't measure that, you can't know you have a problem. These metrics are generally those worth reporting to management and your team. (Metrics reporting follies are a topic worthy of a separate post, or book).

2. Measure what you need to know to solve the problems shown by point #1

Let diagnostic need drive the rest of your metrics, as well as your logging. When a metric proves useful, keep it enabled if it's not costly (and if it is, see if you can get it another way for next time). But don't bother producing reports about these metrics.

3. Disable all the metrics and logging that aren't either (a) identifying problems or (b) helping you solve them

You'll be amazed at how much lighter your load is.

Tom Fleck is Senior Software Engineer at OC Systems.

Share this

The Latest

August 17, 2018

As a Network Operations professional, you know how hard it is to ensure optimal network performance when you’re unsure of how end-user devices, application code, and infrastructure affect performance. Identifying your important applications and prioritizing their performance is more difficult than ever, especially when much of an organization’s web-based traffic appears the same to the network. You need insight to maximize performance — not inefficient troubleshooting, longer time to resolution, and an overall lack of application intelligence. But you can stay ahead. Follow these 10 steps to maximize the performance of your applications and underlying network infrastructure ...

August 16, 2018

IT organizations are constantly trying to optimize operations and troubleshooting activities and for good reason. Let's look at one example for the medical industry. Networked applications, such as electronic medical records (EMR), are vital for hospitals to provide outstanding service to their patients and physicians. However, a networking team can often not be aware of slow response times on the remotely hosted EMR application until a physician or someone else calls in to complain ...

August 15, 2018

In 2014, AWS Lambda introduced serverless architecture. Since then, many other cloud providers have developed serverless options. What’s behind this rapid growth? ...

August 14, 2018

This question is really two questions. The first would be: What's really going on in terms of a confusion of terms? — as we wrestle with AIOps, IT Operational Analytics, big data, AI bots, machine learning, and more generically stated "AI platforms" (… and the list is far from complete). The second might be phrased as: What's really going on in terms of real-world advanced IT analytics deployments — where are they succeeding, and where are they not? This blog will look at both questions as a way of introducing EMA's newest research with data ...

August 13, 2018

Consumers will now trade app convenience for security, according to a study commissioned by F5 Networks, The Curve of Convenience – The Trade-Off between Security and Convenience ...

August 10, 2018

Gartner unveiled the CX Pyramid, a new methodology to test organizations’ customer journeys and forge more powerful experiences that deliver greater customer loyalty and brand advocacy ...

August 09, 2018

Nearly half (48 percent) of consumers report that they currently use, or have used in the past, services of organizations that were involved in a publicly disclosed data breach and, of those, 48 percent have stopped using the services of an organization because of a breach, according to Global State of Digital Trust Survey and Index 2018, a new report from CA Technologies ...

August 08, 2018

Here's the problem: IT teams are in the dark. The only information they have available to them is based on what users decide to tell them about through calls to the help desk ...

August 07, 2018

Over the past year, the enterprise network grew significantly more complicated, creating new challenges for network professionals, according to IDG’s 8th annual State of the Network study. Internet of Things (IoT) projects, the demands of an increasingly mobile workforce, and an explosion of apps prompted network professionals to enhance their network infrastructure and the skillsets needed to support it. Network professionals are now being asked to help shape IT strategy ...

August 06, 2018

Retailers are already busy prepping to avoid an Amazon Prime type meltdown during the holiday shopping season. However, rather than focusing efforts on coping with surges in traffic to your website, you also need to be thinking about the ongoing speed of your site ...