Projects collect lots of metrics that they do not need. All on this forum would agree that measurement is critical. But not all metrics are useful, and too many metrics can be confusing and obscure what's important.
Furthermore, measuring takes time and space resources away from doing. As computers get faster, storage gets cheaper, metrics and logging frameworks come built-in and data analysis and display becomes more powerful, the temptation grows to collect everything, just in case you need it.
Here are some observations on why we collect too many metrics, and how we can avoid it.
1. If your job is collecting data, collecting more data makes you look more productive
Collecting metrics is a means to an end, not an end unto itself. If you don't get paid unless you find more numbers to squeeze from an application then your organization needs some adjustment.
Depending on your level in the organization, the jobs should be:
- Ask a question that a metric could answer
- Decide what metric answers a question
- Implement the collection of a requested metric
- Answer a question using the collected metric values
The end goal of metrics is either to identify a problem, or fix a problem.
2. Sometimes you can see "anomalies" looking at other metrics you might not think relevant
This is actually the most compelling argument for collecting a lot of metrics. But this should be done by choice, in a purposeful way, in a non-production but realistically-loaded environment, and the result should be analyzed by somebody with the time and qualifications to judge the value of these metrics. Just turning on all the metrics all the time and hoping the bug will jump out at you when you need it, is not an engineering approach.
3. It's easier to browse existing metrics than to figure out how to enable a new metric
It shouldn't be, especially if it's one of the many that you would have been collecting already. Good tools and infrastructure should make the mechanics easy, and their use is something your developers and operations people should know: How do I enable/disable specific metrics and adjust their collection frequency and persistence? Whether it's one app-server's JMX metrics or your external network bandwidth, somebody around there should know the points at which metrics are collected, how these are configured, and where the results go. If not, then that's a problem to address.
When the person who knows is explicitly asked to look at the metrics being collected, chances are they'll see some that are not used or useful. Or, they might see metrics or logging that are not enabled, but would have been useful in the past, and that's even better. Either way: a requirement of your application's implementation and documentation should be how to easily control metrics collection.
4. It's easier to collect all the metrics than to figure out which are the right few
How do you know which few metrics you need? Of course you don't, always, in advance. This is the hardest problem and the biggest reason why we collect too much. There are two main approaches to identifying what to measure:
- negative or problem-focused
- positive or goal-focused
The negative approach might alternatively be called the House, MD approach, where we do differential diagnosis to decide which tests to run on the patient. We build a diagnostic handbook for our application by listing problems, symptoms, metrics and value ranges which confirm the problem exists; and/or metrics and value ranges which exclude that problem.
This process has the added advantage of forcing us to identify potential problems, so our QA department can test for these in advance (see The AntifragileOrganization). If testing or production shows additional problems, we add that problem, along with the metrics we used to identify and diagnose it, to our diagnostic handbook, and keep the collection of those useful metrics enabled, if possible.
The positive approach is the more familiar one: the SLA. Quantify what we want to achieve as metrics and measure that. We then use externally visible goals like the SLA to drive internal metrics, like measuring every operation comprising a transaction. Then measuring the resources used by the operations comprising a transaction. Then measuring the resources that compete with the resources that impact the operations that comprise a transaction ... And this is the trap. Everything in the entire system contributes to your SLA, so it's tempting to measure and report on everything.
However, considering both approaches together suggests a solution:
1. Measure what you want to achieve
Record user experience, transaction frequency, error rates, availability, system correctness. If you don't measure that, you can't know you have a problem. These metrics are generally those worth reporting to management and your team. (Metrics reporting follies are a topic worthy of a separate post, or book).
2. Measure what you need to know to solve the problems shown by point #1
Let diagnostic need drive the rest of your metrics, as well as your logging. When a metric proves useful, keep it enabled if it's not costly (and if it is, see if you can get it another way for next time). But don't bother producing reports about these metrics.
3. Disable all the metrics and logging that aren't either (a) identifying problems or (b) helping you solve them
You'll be amazed at how much lighter your load is.
Tom Fleck is Senior Software Engineer at OC Systems.
Looking back on this year, we can see threads of what the future holds in enterprise networking. Specifically, taking a closer look at the biggest news and trends of this year, IT areas where businesses are investing and perspectives from the analyst community, as well as our own experiences, here are five network predictions for the coming year ...
As we enter 2018, businesses are busy anticipating what the new year will bring in terms of industry developments, growing trends, and hidden surprises. In 2017, the increased use of automation within testing teams (where Agile development boosted speed of release), led to QA becoming much more embedded within development teams than would have been the case a few years ago. As a result, proper software testing and monitoring assumes ever greater importance. The natural question is – what next? Here are some of the changes we believe will happen within our industry in 2018 ...
Application Performance Monitoring (APM) has become a must-have technology for IT organizations. In today’s era of digital transformation, distributed computing and cloud-native services, APM tools enable IT organizations to measure the real experience of users, trace business transactions to identify slowdowns and deliver the code-level visibility needed for optimizing the performance of applications. 2018 will see the requirements and expectations from APM solutions increase in the following ways ...
We don't often enough look back at the prior year’s predictions to see if they actually came to fruition. That is the purpose of this analysis. I have picked out a few key areas in APMdigest's 2017 Application Performance Management Predictions, and analyzed which predictions actually came true ...
Planning for a new year often includes predicting what’s going to happen. However, we don't often enough look back at the prior year’s predictions to see if they actually came to fruition. That is the purpose of this analysis. I have picked out a few key areas in APMdigest's 2017 Application Performance Management Predictions, and analyzed which predictions actually came true ...
The annual list of DevOps Predictions is now a DEVOPSdigest tradition. DevOps experts — analysts and consultants, users and the top vendors — offer predictions on how DevOps and related technologies will evolve and impact business in 2018 ...
Industry experts offer predictions on how Network Performance Management (NPM) and related technologies will evolve and impact business in 2018 ...
Industry experts offer predictions on how APM and related technologies will evolve and impact business in 2018. Part 6 covers ITOA and data ...
Industry experts offer predictions on how APM and related technologies will evolve and impact business in 2018. Part 5 covers NoOps, Analytics, Machine Learning and AI ...
Industry experts offer predictions on how APM and related technologies will evolve and impact business in 2018. Part 4 covers the end user experience ...