Alerting Survival Strategies
July 24, 2014

Larry Haig
Intechnica

Share this

(aka – “If that monitoring system wakes me up at 3am one more time … !”)

In considering alerting, the core issue is not whether a given tool will generate alerts, as anything sensible certainly will. Rather, the central problem is what could be termed the actionability of the alerts generated. Failure to flag issues related to poor performance is a clear no-no, but unfortunately over-alerting has the same effect, as these will rapidly be ignored.

Effective alert definition hinges on the determination of “normal” performance. Simplistically, this can be understood by testing across a business cycle (ideally, a minimum of 3-4 weeks). That is fine providing performance is reasonably stable. However, that is often not the case, particularly for applications experiencing large fluctuations in demand at different times of the day, week or year.

In such cases (which are extremely common), the difficulty becomes “at which point of the demand cycle should I base my alert threshold?” Too low, and your system is simply telling you that it’s lunchtime (or the weekend, or whenever greatest demand occurs). Too high, and you will miss issues occurring during periods of lower demand.

There are several approaches to this difficulty, of varying degrees of elegance:

■ Select tooling incorporating a sophisticated baseline algorithm - capable of applying alert thresholds dynamically based on time of day/week/month etc. Surprisingly, many major tools use extremely simplistic baseline models, but some (e.g. App Dynamics APM) certainly have an approach that assists. When selecting tooling, this is definitely an area that repays investigation.

■ Set up independent parallel (active monitoring) tests separated by “maintenance windows”, with different alert thresholds applied depending upon when they are run. This is a messy approach which comes with its own problems.

■ Look for proxies other than pure performance as alert metrics. Using this approach, a “catchall” performance threshold is set for performance that is manifestly poor regardless of when it is generated. This is supplemented by alerting based upon other factors flagging delivery issues – always providing that your monitoring system permits these. Examples include:

- Payload – error pages or partial downloads will have lower byte counts. Redirect failures (e.g. to mobile devices) will have higher than expected page weights.

- Number of objects

- Specific “flag” objects

■ Ensure confirmation before triggering alert. Some tooling will automatically generate confirmatory repeat testing; others enable triggers to be based on a specified number or percentage of total node results.

■ Gotchas – take account of these. Good test design, for example by controlling the bandwidth of end user testing to screen out results based on low connectivity tests, will improve the reliability of both alerts and results generally. As a more recent innovation, the advent of long polling / server push content can be extremely distortive of synthetic external responses, especially if not consistently included. In this case, page load end points need to be defined and incorporated into test scripts to prevent false positive alerts.

RUM based alerting presents its own difficulties. Because it is visitor traffic based, alert triggers based on a certain percentage of outliers may become distorted in very low traffic conditions. For example, a single long delivery time in a 10 minute timeslot where there are only 4 other “normal” visits would represent 20% of total traffic, whereas the same outlier recorded during a peak business period with 200 normal results is less than 1% of the total. RUM tooling that enables alert thresholds to be modified based on traffic are advantageous.

Although it does not address the “normal variation” issue, replacing binary trigger thresholds with dynamic ones (i.e. an alert state exists when the page/transaction slows by more than x% compared to its average over the past) can sometimes be useful.

Some form of trend state messaging (that is, condition worsening/improving) subsequent to initial alerting can serve to mitigate the amount of physical and emotional energy invoked by simple “fire alarm” alerting, particularly in the middle of the night.

An interesting (and long overdue) approach is to work directly on the source of the problem – download raw baseline data to a data warehouse, and apply sophisticated pattern recognition analysis. These algorithms can be developed in-house if time and appropriate skills are available, but unfortunately the mathematics is not necessarily trivial. Some standalone tooling exists and it is expected that more will follow as this approach proves its worth – the baseline management of most APM vendors represents an open goal at present.

Incidentally, such analysis is valuable not only for alerting but also for demand projection and capacity planning.

A few final thoughts on alerts post-generation. The more evolved alert management systems will permit conditional escalation of alerts – that is: alert this primary group first, then inform group B if the condition persists/worsens etc. Systems allowing custom coding around alerts (such as Neustar) are useful here, as are the specific third party alert handling systems available. If using tooling that only permits basic alerting, it is worth considering integration with external alerting, either of the “standalone service” type, or (in larger corporates) integral with central infrastructure management software.

Lastly, delivery mode. Email is the basis for many systems. It is tempting to regard SMS texting as beneficial, particularly in extreme cases. However, as anyone who has been sent a text on New Year’s Eve, only to have it show up 12 hours later knows, such store and forward systems can be false friends.

Larry Haig is Senior Consultant at Intechnica.

Share this

The Latest

October 18, 2021

Distributed tracing has been growing in popularity as a primary tool for investigating performance issues in microservices systems. Our recent DevOps Pulse survey shows a 38% increase year-over-year in organizations' tracing use. Furthermore, 64% of those respondents who are not yet using tracing indicated plans to adopt it in the next two years ...

October 14, 2021

Businesses are embracing artificial intelligence (AI) technologies to improve network performance and security, according to a new State of AIOps Study, conducted by ZK Research and Masergy ...

October 13, 2021

What may have appeared to be a stopgap solution in the spring of 2020 is now clearly our new workplace reality: It's impossible to walk back so many of the developments in workflow we've seen since then. The question is no longer when we'll all get back to the office, but how the companies that are lagging in their technological ability to facilitate remote work can catch up ...

October 12, 2021

The pandemic accelerated organizations' journey to the cloud to enable agile, on-demand, flexible access to resources, helping them align with a digital business's dynamic needs. We heard from many of our customers at the start of lockdown last year, saying they had to shift to a remote work environment, seemingly overnight, and this effort was heavily cloud-reliant. However, blindly forging ahead can backfire ...

October 07, 2021

SmartBear recently released the results of its 2021 State of Software Quality | Testing survey. I doubt you'll be surprised to hear that a "lack of time" was reported as the number one challenge to doing more testing, especially as release frequencies continue to increase. However, it was disheartening to see that a lack of time was also the number one response when we asked people to identify the biggest blocker to professional development ...

October 06, 2021

The role of the CIO is evolving with an increased focus on unlocking customer connections through service innovation, according to the 2021 Global CIO Survey. The study reveals the shift in the role of the CIO with the majority of CIO respondents stating innovation, operational efficiency, and customer experience as their top priorities ...

October 05, 2021

The perception of IT support has dramatically improved thanks to the successful response of service desks to the pandemic, lockdowns and working from home, according to new research from the Service Desk Institute (SDI), sponsored by Sunrise Software ...

October 04, 2021

Is your company trying to use artificial intelligence (AI) for business purposes like sales and marketing, finance or customer experience? If not, why not? If so, has it struggled to start AI projects and get them to work effectively? ...

September 30, 2021

As remote work persists, and organizations take advantage of hire-from-anywhere models — in addition to facing other challenges like extreme weather events — companies across industries are continuing to re-evaluate the effectiveness of their tech stack. Today's increasingly distributed workforce has put a much greater emphasis on network availability across more endpoints as well as increased the bandwidth required for voice and video. For many, this has posed the question of whether to switch to a new network monitoring system ...

September 29, 2021

When a website or app fails or falters, the standard operating procedure is to assemble a sizable team to quickly "divide and conquer" to find a solution. The details of the problem can usually be found somewhere among millions of log events and metrics, leading to slow and painstaking searches that can take hours and often involve handoffs between experts in different areas of the software. The immediate goal in these situations is not to be comprehensive, but rather to troubleshoot until you find a solution that remedies the symptom, even if the underlying root cause is not addressed ...