Bringing Sanity Back to Performance Monitoring
August 10, 2017

Mehdi Daoudi

Share this

Performance monitoring tools have traditionally worked by keeping a constant pulse on internal computer systems and networks for sluggish or failing components, proactively alerting IT administrators to outages, slowdowns and other troubles. Several years ago, this approach was sufficient, enabling IT teams to make direct correlations between problematic datacenter elements and application and site performance (speed, availability) degradations.

As user performance demands have skyrocketed in recent years, organizations have expanded their infrastructures. Ironically, many have found that these build-outs – designed to deliver extremely fast, reliable experiences for users around the world – are actually making this task much harder. The volume of performance monitoring information and alerts creates a confusing cacophony, like being at a party and trying to listen to ten conversations at once.

This kind of environment could be prime for alert fatigue – issues being raised but ignored due to burnout. It's no wonder that user calls continue to be the top way organizations find out about IT-related performance issues, according to EMA. In our view, this is complete insanity in the 21st century. A new approach to managing IT alerts and issues raised through performance monitoring is needed, encompassing the following:

Canvas the Entire Landscape of Performance-Impacting Variables

As noted, it used to be that IT teams could get away with monitoring just their internal datacenter elements, but this is no longer the case. IT infrastructures for delivering digital services have quickly evolved into complex apparatuses including not just on-premise systems and networks but external third-party infrastructures (like CDNs, DNS providers, API providers) and services. If any third-party element slows down, it can degrade performance for all dependent websites and applications.

No company, no matter how big or small, is immune to this type of infection – and it requires all external third parties to be included in the monitoring process. This month's Amazon Prime Day provided a case in point. While Amazon did a great job overall, at one point in the day the site search function slowed to 14 seconds – meaning it took site visitors 30 to 40 percent longer than normal to complete a search. This was likely the result of a failing external third-party search function – even though Amazon could support the crushing traffic load, the third-party service wasn't as adept.

Apply Advanced Analytics

At this point you're likely saying, "So you're telling me I need to be monitoring more elements – I thought we were trying to reduce the noise?" You are right – these messages may seem contradictory – but the reality is, organizations cannot afford to not be monitoring all the elements impacting the user experience. This is a fact of life as performance monitoring transitions to what Gartner calls digital experience monitoring, where the quality (speed, availability, reachability and reliability) of the user experience is the ultimate metric and takes center stage. If it impacts what your users experience, it must be included in the monitoring strategy – period.

More expansive infrastructures, and the mountains of monitoring telemetry data they generate, are useless if they are void of useful, actionable insights. The key is combining this data with advanced analytics that enable organizations to precisely and accurately identify the root cause, whether it's inside or beyond the firewall. This capability is critical, particularly in DevOps environments where timeframes for implementing needed modifications are dramatically collapsed.

Identify and Prioritize True Hot Spots

It is human nature to conclude that any symptom must have an underlying cause – but that's not always the case, and random events can happen. Just because you sneeze, doesn't necessarily mean you have a cold. The same concept applies to enterprise IT: a random, isolated application or site slowdown can occur and it's not necessarily a cause for concern, until/unless a clear pattern emerges – the slowdowns become more frequent or longer in duration, for example.

Given the sheer volume of alerts and potential issues, it's not surprising that many IT teams have gradually become desensitized. Machine learning and artificial intelligence (AI) can reduce the sheer number of alerts, by distinguishing between isolated anomalies and trends or patterns. Ultimately this can help keep alerts and issue escalations limited only to those instances where they're really warranted.

Put AI to Use – But Know Its Limits

In addition to identifying what are true trends worthy of concern, AI can deliver valuable predictive insights – for example, if performance for this particular server and resident application keeps degrading, which geographic customer segments will be impacted? How will business suffer?

AI can help, but we don't believe issue escalation and resolution will ever be a completely hands-off process. A machine can't "learn" to communicate earnestly with customers, nor can it "learn" when the business impact may be tolerable or not, which dictates the appropriate response (i.e., do on-call staffers really need to be called in the middle of the night?). If it's a clear pattern, and the revenue impact is big, the answer is yes. Otherwise, it may just be something that needs to be watched, and can wait until the morning.

Today, with so many elements to monitor and so much data being generated, performance monitoring initiatives can quickly devolve from a helpful, purposeful mechanism to a vortex of confusion and chaos. As performance monitoring becomes, by necessity, more comprehensive, it requires a more decisive, refined and sophisticated approach to managing alerts and escalating issues. Otherwise, we are in danger of performance monitoring tools controlling us, instead of guiding and serving us - their true intended purpose.

Mehdi Daoudi is CEO and Co-Founder of Catchpoint
Share this

The Latest

August 21, 2018

High availability's (HA) primary objective has historically been focused on ensuring continuous operations and performance. HA was built on a foundation of redundancy and failover technologies and methodologies to ensure business continuity in the event of workload spikes, planned maintenance, and unplanned downtime. Today, HA methodologies have been superseded by intelligent workload routing automation (i.e., intelligent availability), in that data and their processing are consistently directed to the proper place at the right time ...

August 20, 2018

You need insight to maximize performance — not inefficient troubleshooting, longer time to resolution, and an overall lack of application intelligence. Steps 5 through 10 will help you maximize the performance of your applications and underlying network infrastructure ...

August 17, 2018

As a Network Operations professional, you know how hard it is to ensure optimal network performance when you’re unsure of how end-user devices, application code, and infrastructure affect performance. Identifying your important applications and prioritizing their performance is more difficult than ever, especially when much of an organization’s web-based traffic appears the same to the network. You need insight to maximize performance — not inefficient troubleshooting, longer time to resolution, and an overall lack of application intelligence. But you can stay ahead. Follow these 10 steps to maximize the performance of your applications and underlying network infrastructure ...

August 16, 2018

IT organizations are constantly trying to optimize operations and troubleshooting activities and for good reason. Let's look at one example for the medical industry. Networked applications, such as electronic medical records (EMR), are vital for hospitals to provide outstanding service to their patients and physicians. However, a networking team can often not be aware of slow response times on the remotely hosted EMR application until a physician or someone else calls in to complain ...

August 15, 2018

In 2014, AWS Lambda introduced serverless architecture. Since then, many other cloud providers have developed serverless options. What’s behind this rapid growth? ...

August 14, 2018

This question is really two questions. The first would be: What's really going on in terms of a confusion of terms? — as we wrestle with AIOps, IT Operational Analytics, big data, AI bots, machine learning, and more generically stated "AI platforms" (… and the list is far from complete). The second might be phrased as: What's really going on in terms of real-world advanced IT analytics deployments — where are they succeeding, and where are they not? This blog will look at both questions as a way of introducing EMA's newest research with data ...

August 13, 2018

Consumers will now trade app convenience for security, according to a study commissioned by F5 Networks, The Curve of Convenience – The Trade-Off between Security and Convenience ...

August 10, 2018

Gartner unveiled the CX Pyramid, a new methodology to test organizations’ customer journeys and forge more powerful experiences that deliver greater customer loyalty and brand advocacy ...

August 09, 2018

Nearly half (48 percent) of consumers report that they currently use, or have used in the past, services of organizations that were involved in a publicly disclosed data breach and, of those, 48 percent have stopped using the services of an organization because of a breach, according to Global State of Digital Trust Survey and Index 2018, a new report from CA Technologies ...

August 08, 2018

Here's the problem: IT teams are in the dark. The only information they have available to them is based on what users decide to tell them about through calls to the help desk ...