Projects collect lots of metrics that they do not need. All on this forum would agree that measurement is critical. But not all metrics are useful, and too many metrics can be confusing and obscure what's important.
Furthermore, measuring takes time and space resources away from doing. As computers get faster, storage gets cheaper, metrics and logging frameworks come built-in and data analysis and display becomes more powerful, the temptation grows to collect everything, just in case you need it.
Here are some observations on why we collect too many metrics, and how we can avoid it.
1. If your job is collecting data, collecting more data makes you look more productive
Collecting metrics is a means to an end, not an end unto itself. If you don't get paid unless you find more numbers to squeeze from an application then your organization needs some adjustment.
Depending on your level in the organization, the jobs should be:
- Ask a question that a metric could answer
- Decide what metric answers a question
- Implement the collection of a requested metric
- Answer a question using the collected metric values
The end goal of metrics is either to identify a problem, or fix a problem.
2. Sometimes you can see "anomalies" looking at other metrics you might not think relevant
This is actually the most compelling argument for collecting a lot of metrics. But this should be done by choice, in a purposeful way, in a non-production but realistically-loaded environment, and the result should be analyzed by somebody with the time and qualifications to judge the value of these metrics. Just turning on all the metrics all the time and hoping the bug will jump out at you when you need it, is not an engineering approach.
3. It's easier to browse existing metrics than to figure out how to enable a new metric
It shouldn't be, especially if it's one of the many that you would have been collecting already. Good tools and infrastructure should make the mechanics easy, and their use is something your developers and operations people should know: How do I enable/disable specific metrics and adjust their collection frequency and persistence? Whether it's one app-server's JMX metrics or your external network bandwidth, somebody around there should know the points at which metrics are collected, how these are configured, and where the results go. If not, then that's a problem to address.
When the person who knows is explicitly asked to look at the metrics being collected, chances are they'll see some that are not used or useful. Or, they might see metrics or logging that are not enabled, but would have been useful in the past, and that's even better. Either way: a requirement of your application's implementation and documentation should be how to easily control metrics collection.
4. It's easier to collect all the metrics than to figure out which are the right few
How do you know which few metrics you need? Of course you don't, always, in advance. This is the hardest problem and the biggest reason why we collect too much. There are two main approaches to identifying what to measure:
- negative or problem-focused
- positive or goal-focused
The negative approach might alternatively be called the House, MD approach, where we do differential diagnosis to decide which tests to run on the patient. We build a diagnostic handbook for our application by listing problems, symptoms, metrics and value ranges which confirm the problem exists; and/or metrics and value ranges which exclude that problem.
This process has the added advantage of forcing us to identify potential problems, so our QA department can test for these in advance (see The AntifragileOrganization). If testing or production shows additional problems, we add that problem, along with the metrics we used to identify and diagnose it, to our diagnostic handbook, and keep the collection of those useful metrics enabled, if possible.
The positive approach is the more familiar one: the SLA. Quantify what we want to achieve as metrics and measure that. We then use externally visible goals like the SLA to drive internal metrics, like measuring every operation comprising a transaction. Then measuring the resources used by the operations comprising a transaction. Then measuring the resources that compete with the resources that impact the operations that comprise a transaction ... And this is the trap. Everything in the entire system contributes to your SLA, so it's tempting to measure and report on everything.
However, considering both approaches together suggests a solution:
1. Measure what you want to achieve
Record user experience, transaction frequency, error rates, availability, system correctness. If you don't measure that, you can't know you have a problem. These metrics are generally those worth reporting to management and your team. (Metrics reporting follies are a topic worthy of a separate post, or book).
2. Measure what you need to know to solve the problems shown by point #1
Let diagnostic need drive the rest of your metrics, as well as your logging. When a metric proves useful, keep it enabled if it's not costly (and if it is, see if you can get it another way for next time). But don't bother producing reports about these metrics.
3. Disable all the metrics and logging that aren't either (a) identifying problems or (b) helping you solve them
You'll be amazed at how much lighter your load is.
Tom Fleck is Senior Software Engineer at OC Systems.
APMdigest invited industry experts — from analysts and consultants to users and the top vendors — to predict how APM and related technologies will evolve and impact business in 2019. Part 2 covers more about APM, monitoring and ecommerce ...
The Holiday Season is the time for the annual list of Application Performance Management (APM) predictions. Industry experts offer thoughtful, insightful, and often controversial predictions on how APM and related technologies will evolve and impact business in 2019. A forecast by the top minds in Application Performance Management today, here are the predictions ...
Nine out of ten (89 percent) companies expect their IT budgets to grow or remain flat in 2019. Although factors driving budget increases vary significantly by company size, 64 percent of those planning to increase budgets are doing so to upgrade outdated IT infrastructure, according to the 2019 State of IT Budgets report from Spiceworks ...
Gartner highlighted the key technologies and trends that infrastructure and operations (I&O) leaders must start preparing for to support digital infrastructure in 2019 ...
As more organizations embrace digital business, infrastructure and operations (I&O) leaders will need to evolve their strategies and skills to provide an agile infrastructure for their business. In fact, Gartner, Inc. said that 75 percent of I&O leaders are not prepared with the skills, behaviors or cultural presence needed over the next two to three years ...
Today there is an urgent need for Agents of Transformation, a new breed of technologist, primed to drive innovation and enable companies to thrive in the face of rapid technological advancement, according to The Agents of Transformation Report from AppDynamics, a Cisco company ...
One in four Global Fortune 2000 enterprises rank Internet of Things (IoT) deployment as the most important initiative in their organization, yet 90% experience barriers to effective implementation and expansion due to lack of IoT expertise and skills in-house, according to a new independent survey from VansonBourne ...
Nearly three-quarters (74%) of IT leaders are concerned that Internet of Things (IoT) performance problems could directly impact business operations and significantly damage revenues, according to a new report, entitled Overcoming the Complexity of Web-Scale IoT Applications: The Top 5 Challenges ...
Gartner highlighted the top strategic Internet of Things (IoT) technology trends that will drive digital business innovation from 2018 through 2023 ...
While 95 percent of organizations have a disaster recovery plan in place, 23 percent never test their plan, according to a new survey from Spiceworks ...