One of the common questions that every IT manager asks on a regular basis is, “Why is my application so slow today when everything was fine yesterday?” Application Performance Management (APM) is the only way to truly answer that question, and it is one of the must-have tools for every IT manager.
With this APM imperative in mind, the following are 10 capabilities every IT manager should look for when choosing an APM solution:
1. Real-time monitoring
Real-time monitoring is a must. When digging into a problem, tracking events in real-time as they occur is by far more effective than doing so via “post-mortem” analysis. There are many APM vendors that claim to provide real-time monitoring but sometimes they really mean “near real-time”, with delays from 30 seconds to five minutes, typically. This restricts your ability to analyze and react to events in real-time. Make sure real-time is truly real-time. Real-time monitoring should provide you with important metrics such as: Who is doing what, how much resources are being taken, and who is affecting who right now?
2. Rich data repository
Sometimes you get lucky and witness a problem in real-time. But in most cases, this doesn’t happen. This is why a good APM solution must be able to collect all transaction activity and performance metrics into a rich, but light-weight repository.
3. “Single anomaly” granularity
Some APM vendors store the statistics they gather but they aggregate it to save disk space or because they just can’t handle too much data in a reasonable amount of time. Analyzing performance incidents based on aggregated data is similar to assessing a book by reading only its rear cover. You get the general idea but you have no ability to understand what really happened. That’s why good APM solutions must give you all of the granular information including individual transactions and their characteristics, resource consumption, traffic order (chain of events) etc.
4. Measuring Quality of Service (QoS) and Service Level Agreements (SLAs)
APM solutions are designed to improve the end user experience. Improving user experience starts by measuring it and identifying QoS and SLA anomalies. Only then can you make informed decisions and take action. You should also have the ability to compare user experience before and after a change is applied to your systems.
5. Performance proactivity – enforcing QoS and SLA
Some APM solutions enable users to analyze performance data and identify root problems retroactively, but do nothing to enable real-time resolution of performance issues. Because these solutions are fundamentally passive by nature, you have no choice but to wait for application performance to nosedive before corrective action can be taken. And in these cases, the wait time from issue identification to resolution can be hours or even days. Avoiding QoS problems can be achieved only if you take proactive steps. Proactive APM solution can turn this: “I got a text message at 2:00AM from our APM tool that indicated that we had a QoS problem so I logged into the system and solved it,” into: “I got a text message at 8:00 AM from our APM tool letting me know that at 1:50 AM a QoS problem was about to occur and it took care of it automatically.” Being proactivite can be achieved in many ways: by activating automatic scripts, managing system resources, and triggering third party tools, etc.
6. Detecting bottlenecks and root cause analysis
If an APM tool only notifies you that you ran out of system resources because of job X, then you don’t really have root cause analysis capabilities. Root cause analysis is when your APM tool tells you that this job usually runs at 8:00 PM but because of problem on a secondary system, it has started 1 hour later and collided with another job that was scheduled to run at the same time. APM tools must do the hard work of correlating many little pieces of data so that you can get to the source of the problem. Otherwise you will find yourself trying to assemble a 1,000 piece puzzle while your CEO knocks on your door every 5 minutes looking for answers.
7. Chain reaction analysis
Analyzing a problem can take many shapes. The conventional way is by digging into the top-10 hit lists. But those top-10 lists always miss something - the chain of events. Who came first, who came after, “it was all fine until this transaction came in”, etc. Analyzing the chain of events before the system crashed is crucial if you wish to avoid this problem in the future. An APM tool should give you the ability to travel back in time and look into the granular metrics second by second as if you were watching a movie in slow motion. This is possible only if the APM tool collects data at a very high level of granularity and does not lose it over time (i.e. it retains the raw collected metrics).
8. Performance comparisons
There are two main performance troubleshooting approaches that an APM tool should support. Performance drill downs to a specific period of time, and performance comparison. If you have a performance problem now, but all was fine yesterday, you must assume that something has changed. Hunting for those changes will lead you to the root cause much quicker than a conventional drill down into the current problem's performance metrics. You should have the ability to answer questions like these in seconds: “Is this new storage system I just implemented faster than the old one we had?” and “why is it working very well in QA but not in production?” If your APM tool collects and stores raw performance metrics, by comparing those metrics you can easily answer all these questions and dramatically shorten your mean time to recovery.
9. Business Intelligence-like dashboard
When an APM tool stores millions of pieces of raw (and aggregated) data, it should also deliver a convenient way to slice and dice this data. Some APM tools will decide for you the best way to process this data by providing a pre-defined set of graph and report templates. A good APM tool will let you decide how you want to slice and dice this data by giving you a flexible and easy to use BI-like dashboard where you can drag and drop dimensions and drill down by double clicking in order to answer questions like, “What user consumed most of my CPU and what is the top program he/she has been using that caused the most impact?”
10.Charge back capability
Bad performance usually starts with bad design or bad coding and very rarely stems from hardware faults. If a developer writes a poor piece of code, the IT division needs to spend more money on hardware or software licenses to deal with it. This is why it’s becoming popular in many organizations to turn this dynamic upside down - here the annual budgets are distributed between the application development divisions that use this money to buy IT services from their IT division. If they write poor code they ultimately need to pay more. This is workable only if the IT department has an APM tool that can measure and enforce resources usage by ‘tenant’. This approach has proven to be effective in helping companies reduce their IT budget quite significantly.
ABOUT Irad Deutsch
Irad Deutsch is a CTO at Veracity group, an international software infrastructure integrator. Irad is also the CTO of MORE IT Resources - MoreVRP, a provider of application and database performance optimization solutions.
The journey of maturing observability practices for users entails navigating peaks and valleys. Users have clearly witnessed the maturation of their monitoring capabilities, embraced DevOps practices, and adopted cloud and cloud-native technologies. Notwithstanding that, we witness the gradual increase of the Mean Time To Recovery (MTTR) for production issues year over year ...
Optimizing existing use of cloud is the top initiative — for the seventh year in a row, reported by 62% of respondents in the Flexera 2023 State of the Cloud Report ...
Gartner highlighted four trends impacting cloud, data center and edge infrastructure in 2023, as infrastructure and operations teams pivot to support new technologies and ways of working during a year of economic uncertainty ...
Developers need a tool that can be portable and vendor agnostic, given the advent of microservices. It may be clear an issue is occurring; what may not be clear is if it's part of a distributed system or the app itself. Enter OpenTelemetry, commonly referred to as OTel, an open-source framework that provides a standardized way of collecting and exporting telemetry data (logs, metrics, and traces) from cloud-native software ...
As SLOs grow in popularity their usage is becoming more mature. For example, 82% of respondents intend to increase their use of SLOs, and 96% have mapped SLOs directly to their business operations or already have a plan to, according to The State of Service Level Objectives 2023 from Nobl9 ...
Observability has matured beyond its early adopter position and is now foundational for modern enterprises to achieve full visibility into today's complex technology environments, according to The State of Observability 2023, a report released by Splunk in collaboration with Enterprise Strategy Group ...
Before network engineers even begin the automation process, they tend to start with preconceived notions that oftentimes, if acted upon, can hinder the process. To prevent that from happening, it's important to identify and dispel a few common misconceptions currently out there and how networking teams can overcome them. So, let's address the three most common network automation myths ...
Many IT organizations apply AI/ML and AIOps technology across domains, correlating insights from the various layers of IT infrastructure and operations. However, Enterprise Management Associates (EMA) has observed significant interest in applying these AI technologies narrowly to network management, according to a new research report, titled AI-Driven Networks: Leveling Up Network Management with AI/ML and AIOps ...
When it comes to system outages, AIOps solutions with the right foundation can help reduce the blame game so the right teams can spend valuable time restoring the impacted services rather than improving their MTTI score (mean time to innocence). In fact, much of today's innovation around ChatGPT-style algorithms can be used to significantly improve the triage process and user experience ...
Gartner identified the top 10 data and analytics (D&A) trends for 2023 that can guide D&A leaders to create new sources of value by anticipating change and transforming extreme uncertainty into new business opportunities ...