Chasing a Moving Target: APM in the Cloud - Part 2
Detection, Analysis and Action
February 21, 2013

Albert Mavashev
Nastel Technologies

Share this

In my last blog, I discussed strategies for dealing with the complexities of monitoring performance in the various stacks that make up a cloud implementation. Here, we will look at ways to detect trends, analyze your data, and act on it.

The first requirement for detecting trends in application performance in the cloud is to have good information delivered in a timely manner about each stack as well as the application.
  
We acquire this information via data collectors that harvest all relevant indicators within the same clock tick. For example: response time, GC activity, memory usage, CPU Usage. Doing this within the same clock tick is called serialization. It is of little use to know I have a failed transaction at time X, but only have CPU and memory data from X minus 10 minutes.

Next, we require a history for each metric. This can be maintained in memory for near real-time analysis, but we also need to use slower storage for longer-term views.

Finally, we apply pattern matching to the data. We might scan and match metrics such as “find all applications whose GC is above High Bollinger Band for 2+ samples.” Doing this in memory can enable very fast detection across a large number of indicators.

Here are three steps you can use to detect performance trends

1. Measure the relevant application performance indicators on the business side such as orders filled, failed or missed. And then, measure the ones on the IT side such as  JVM GC activity, memory, I/O rates.

2. Create a base line for each relevant indicator. This could a 1- to60-second sampling for near real-time monitoring. In addition set up a 1-, 10- and 15-minute sample or even daily, weekly or monthly for those longer in duration. You need both.

3. Apply analytics to determine trends and behavior

Keeping it Simple

Applying analytics can be easier than you expect. In fact, the more simple you keep it, the better.

The following three simple analytical techniques can be used in order to detect anomalies:

1. Bollinger Bands – 2 standard deviations off the mean – low and high. The normal is 2 standard deviations from the mean.

2. Percent of Change – This means comparing sample to sample, day to day or week to week, and calculating the percentage of change.

3. Velocity – Essentially this measures how fast indicators are changing. For example, you might be measuring response time and it drops from 10 to 20 seconds over a five-second interval or (20-10)/5 = 2 units/sec. With this technique, we are expecting a certain amount of change; however, when the amount of change is changing at an abnormal rate, we have most likely detected an anomaly.

Now That You Know ... Act On It

After the analysis, the next activity is to take action. This could be alerts, notification or system actions such as restarting processes or even resubmitting orders. Here, we are connecting the dots between IT and the business and alerting the appropriate owners. 

And In Conclusion

Elastic cloud-based applications can’t be monitored effectively using static models, as these models assume constancy. And the one thing constant about these applications is their volatility. In these environments, what was abnormal yesterday might likely be normal today. As a result, what static models indicate may be wrong. 

However, using a methodology comprised of gathering both business and IT metrics, creating automated base lines and applying analytics to them in real time can produce effective results and predict behavior. 

Albert Mavashev is Chief Technology Officer at Nastel Technologies.

Share this

The Latest

September 23, 2022

In order to properly sort through all monitoring noise and identify true problems, their causes, and to prioritize them for response by the IT team, they have created and built a revolutionary new system using a meta-cognitive model ...

September 22, 2022

As we shift further into a digital-first world, where having a reliable online experience becomes more essential, Site Reliability Engineers remain in-demand among organizations of all sizes ... This diverse set of skills and values can be difficult to interview for. In this blog, we'll get you started with some example questions and processes to find your ideal SRE ...

September 21, 2022

US government agencies are bringing more of their employees back into the office and implementing hybrid work schedules, but federal workers are worried that their agencies' IT architectures aren't built to handle the "new normal." They fear that the reactive, manual methods used by the current systems in dealing with user, IT architecture and application problems will degrade the user experience and negatively affect productivity. In fact, according to a recent survey, many federal employees are concerned that they won't work as effectively back in the office as they did at home ...

September 20, 2022

Users today expect a seamless, uninterrupted experience when interacting with their web and mobile apps. Their expectations have continued to grow in tandem with their appetite for new features and consistent updates. Mobile apps have responded by increasing their release cadence by up to 40%, releasing a new full version of their app every 4-5 days, as determined in this year's SmartBear State of Software Quality | Application Stability Index report ...

September 19, 2022

In this second part of the blog series, we look at how adopting AIOps capabilities can drive business value for an organization ...

September 16, 2022

ITOPS and DevOps is in the midst of a surge of innovation. New devices and new systems are appearing at an unprecedented rate. There are many drivers of this phenomenon, from virtualization and containerization of applications and services to the need for improved security and the proliferation of 5G and IOT devices. The interconnectedness and the interdependencies of these technologies also greatly increase systems complexity and therefore increase the sheer volume of things that need to be integrated, monitored, and maintained ...

September 15, 2022

IT talent acquisition challenges are now heavily influencing technology investment decisions, according to new research from Salesforce's MuleSoft. The 2022 IT Leaders Pulse Report reveals that almost three quarters (73%) of senior IT leaders agree that acquiring IT talent has never been harder, and nearly all (98%) respondents say attracting IT talent influences their organization's technology investment choices ...

September 14, 2022

The findings of the 2022 Observability Forecast offer a detailed view of how this practice is shaping engineering and the technologies of the future. Here are 10 key takeaways from the forecast ...

September 13, 2022

Data professionals are spending 40% of their time evaluating or checking data quality and that poor data quality impacts 26% of their companies' revenue, according to The State of Data Quality 2022, a report commissioned by Monte Carlo and conducted by Wakefield Research ...

September 12, 2022
Statistically speaking, every year, slow-loading websites cost a staggering $2.6 billion in losses to their owners. Also, about 53% of the website visitors on mobile are likely to abandon the site if it takes more than 3 seconds to load. This is the reason why performance testing should be conducted rigorously on a website or web application in the SDLC before deployment ...