One of the common questions that every IT manager asks on a regular basis is, “Why is my application so slow today when everything was fine yesterday?” Application Performance Management (APM) is the only way to truly answer that question, and it is one of the must-have tools for every IT manager.
With this APM imperative in mind, the following are 10 capabilities every IT manager should look for when choosing an APM solution:
1. Real-time monitoring
Real-time monitoring is a must. When digging into a problem, tracking events in real-time as they occur is by far more effective than doing so via “post-mortem” analysis. There are many APM vendors that claim to provide real-time monitoring but sometimes they really mean “near real-time”, with delays from 30 seconds to five minutes, typically. This restricts your ability to analyze and react to events in real-time. Make sure real-time is truly real-time. Real-time monitoring should provide you with important metrics such as: Who is doing what, how much resources are being taken, and who is affecting who right now?
2. Rich data repository
Sometimes you get lucky and witness a problem in real-time. But in most cases, this doesn’t happen. This is why a good APM solution must be able to collect all transaction activity and performance metrics into a rich, but light-weight repository.
3. “Single anomaly” granularity
Some APM vendors store the statistics they gather but they aggregate it to save disk space or because they just can’t handle too much data in a reasonable amount of time. Analyzing performance incidents based on aggregated data is similar to assessing a book by reading only its rear cover. You get the general idea but you have no ability to understand what really happened. That’s why good APM solutions must give you all of the granular information including individual transactions and their characteristics, resource consumption, traffic order (chain of events) etc.
4. Measuring Quality of Service (QoS) and Service Level Agreements (SLAs)
APM solutions are designed to improve the end user experience. Improving user experience starts by measuring it and identifying QoS and SLA anomalies. Only then can you make informed decisions and take action. You should also have the ability to compare user experience before and after a change is applied to your systems.
5. Performance proactivity – enforcing QoS and SLA
Some APM solutions enable users to analyze performance data and identify root problems retroactively, but do nothing to enable real-time resolution of performance issues. Because these solutions are fundamentally passive by nature, you have no choice but to wait for application performance to nosedive before corrective action can be taken. And in these cases, the wait time from issue identification to resolution can be hours or even days. Avoiding QoS problems can be achieved only if you take proactive steps. Proactive APM solution can turn this: “I got a text message at 2:00AM from our APM tool that indicated that we had a QoS problem so I logged into the system and solved it,” into: “I got a text message at 8:00 AM from our APM tool letting me know that at 1:50 AM a QoS problem was about to occur and it took care of it automatically.” Being proactivite can be achieved in many ways: by activating automatic scripts, managing system resources, and triggering third party tools, etc.
6. Detecting bottlenecks and root cause analysis
If an APM tool only notifies you that you ran out of system resources because of job X, then you don’t really have root cause analysis capabilities. Root cause analysis is when your APM tool tells you that this job usually runs at 8:00 PM but because of problem on a secondary system, it has started 1 hour later and collided with another job that was scheduled to run at the same time. APM tools must do the hard work of correlating many little pieces of data so that you can get to the source of the problem. Otherwise you will find yourself trying to assemble a 1,000 piece puzzle while your CEO knocks on your door every 5 minutes looking for answers.
7. Chain reaction analysis
Analyzing a problem can take many shapes. The conventional way is by digging into the top-10 hit lists. But those top-10 lists always miss something - the chain of events. Who came first, who came after, “it was all fine until this transaction came in”, etc. Analyzing the chain of events before the system crashed is crucial if you wish to avoid this problem in the future. An APM tool should give you the ability to travel back in time and look into the granular metrics second by second as if you were watching a movie in slow motion. This is possible only if the APM tool collects data at a very high level of granularity and does not lose it over time (i.e. it retains the raw collected metrics).
8. Performance comparisons
There are two main performance troubleshooting approaches that an APM tool should support. Performance drill downs to a specific period of time, and performance comparison. If you have a performance problem now, but all was fine yesterday, you must assume that something has changed. Hunting for those changes will lead you to the root cause much quicker than a conventional drill down into the current problem's performance metrics. You should have the ability to answer questions like these in seconds: “Is this new storage system I just implemented faster than the old one we had?” and “why is it working very well in QA but not in production?” If your APM tool collects and stores raw performance metrics, by comparing those metrics you can easily answer all these questions and dramatically shorten your mean time to recovery.
9. Business Intelligence-like dashboard
When an APM tool stores millions of pieces of raw (and aggregated) data, it should also deliver a convenient way to slice and dice this data. Some APM tools will decide for you the best way to process this data by providing a pre-defined set of graph and report templates. A good APM tool will let you decide how you want to slice and dice this data by giving you a flexible and easy to use BI-like dashboard where you can drag and drop dimensions and drill down by double clicking in order to answer questions like, “What user consumed most of my CPU and what is the top program he/she has been using that caused the most impact?”
10.Charge back capability
Bad performance usually starts with bad design or bad coding and very rarely stems from hardware faults. If a developer writes a poor piece of code, the IT division needs to spend more money on hardware or software licenses to deal with it. This is why it’s becoming popular in many organizations to turn this dynamic upside down - here the annual budgets are distributed between the application development divisions that use this money to buy IT services from their IT division. If they write poor code they ultimately need to pay more. This is workable only if the IT department has an APM tool that can measure and enforce resources usage by ‘tenant’. This approach has proven to be effective in helping companies reduce their IT budget quite significantly.
ABOUT Irad Deutsch
Irad Deutsch is a CTO at Veracity group, an international software infrastructure integrator. Irad is also the CTO of MORE IT Resources - MoreVRP, a provider of application and database performance optimization solutions.
Over the last few decades, IT departments have decreased budgets in part because of recession. As a result, they have are being asked to do more with less. The increase in work has amplified the need for automation ...
Many variables must align for optimum APM, and security is certainly among them. I offer the following APM predictions for 2020, which revolve around the reality that we will definitely begin to see much deeper integration of WAN technology on the security front. Look for this integration to take shape in the following ways ...
When it comes to growing a successful company, research shows it isn't about getting the most out of employees, but delivering an experience that empowers them to be and do their best. And according to Priming a New Era of Digital Wellness, a new study conducted by Quartz Insights in partnership with Citrix Systems, technology is the secret to doing so ...
Only 11% of website decision-makers feel that they have complete insight into the scripts that they use on their websites. However, industry estimates state that about 70% of the code on a website comes from a third-party library or service. Research highlights a clear need to raise awareness of the potential threats associated with the vulnerabilities inherent in third-party code ...
The ever-increasing access and speeds offered by today's modern networks offer many advantages to businesses and consumers, but also make the integrity of their performance and security more paramount than ever before. Organizations are struggling to manage the constant fluctuations in network conditions and security threats. This has prompted many to explore how automation can help to streamline network management and security processes ...
The demand to deliver a consistently positive and innovative customer experience is something that many companies — more specifically, their DevOps teams — are currently grappling with. While the ability to push out multiple features a week may appear as a great accomplishment for DevOps teams, our survey showed that 82% commonly discover bugs in production ...
Ensuring reliable data security is a critical part of Application Performance Management (APM) — or at least it should be. The fact is, as a result of our need for speed, increasingly development teams are confronted with the problem of releasing applications faster without compromising security ...
To effectively deliver a great CX requires that the CX team, which represents the business requirements, and the IT/ digital team, which represents the technological possibilities and can execute on those, collaborate effectively. To better understand this dynamic, Cyara fielded research on the state of collaboration between IT/digital teams and CX professionals in North America ...
In response to noisier and more complex IT environments, operations teams are growing in size and using more monitoring tools. But timely outage detection, investigation and resolution are still a major challenge ...
This year, enterprises that have not yet moved to the cloud will need to take a look at their current strategy and make critical decisions as moving to the cloud is now a business imperative. Embracing a cloud native strategy will create new and exciting business opportunities and insights, however, there are also many complexities and obstacles standing in the way of success. The following are five critical elements needed for long term cloud native transformation success ...