Bringing Sanity Back to Performance Monitoring
August 10, 2017

Mehdi Daoudi

Share this

Performance monitoring tools have traditionally worked by keeping a constant pulse on internal computer systems and networks for sluggish or failing components, proactively alerting IT administrators to outages, slowdowns and other troubles. Several years ago, this approach was sufficient, enabling IT teams to make direct correlations between problematic datacenter elements and application and site performance (speed, availability) degradations.

As user performance demands have skyrocketed in recent years, organizations have expanded their infrastructures. Ironically, many have found that these build-outs – designed to deliver extremely fast, reliable experiences for users around the world – are actually making this task much harder. The volume of performance monitoring information and alerts creates a confusing cacophony, like being at a party and trying to listen to ten conversations at once.

This kind of environment could be prime for alert fatigue – issues being raised but ignored due to burnout. It's no wonder that user calls continue to be the top way organizations find out about IT-related performance issues, according to EMA. In our view, this is complete insanity in the 21st century. A new approach to managing IT alerts and issues raised through performance monitoring is needed, encompassing the following:

Canvas the Entire Landscape of Performance-Impacting Variables

As noted, it used to be that IT teams could get away with monitoring just their internal datacenter elements, but this is no longer the case. IT infrastructures for delivering digital services have quickly evolved into complex apparatuses including not just on-premise systems and networks but external third-party infrastructures (like CDNs, DNS providers, API providers) and services. If any third-party element slows down, it can degrade performance for all dependent websites and applications.

No company, no matter how big or small, is immune to this type of infection – and it requires all external third parties to be included in the monitoring process. This month's Amazon Prime Day provided a case in point. While Amazon did a great job overall, at one point in the day the site search function slowed to 14 seconds – meaning it took site visitors 30 to 40 percent longer than normal to complete a search. This was likely the result of a failing external third-party search function – even though Amazon could support the crushing traffic load, the third-party service wasn't as adept.

Apply Advanced Analytics

At this point you're likely saying, "So you're telling me I need to be monitoring more elements – I thought we were trying to reduce the noise?" You are right – these messages may seem contradictory – but the reality is, organizations cannot afford to not be monitoring all the elements impacting the user experience. This is a fact of life as performance monitoring transitions to what Gartner calls digital experience monitoring, where the quality (speed, availability, reachability and reliability) of the user experience is the ultimate metric and takes center stage. If it impacts what your users experience, it must be included in the monitoring strategy – period.

More expansive infrastructures, and the mountains of monitoring telemetry data they generate, are useless if they are void of useful, actionable insights. The key is combining this data with advanced analytics that enable organizations to precisely and accurately identify the root cause, whether it's inside or beyond the firewall. This capability is critical, particularly in DevOps environments where timeframes for implementing needed modifications are dramatically collapsed.

Identify and Prioritize True Hot Spots

It is human nature to conclude that any symptom must have an underlying cause – but that's not always the case, and random events can happen. Just because you sneeze, doesn't necessarily mean you have a cold. The same concept applies to enterprise IT: a random, isolated application or site slowdown can occur and it's not necessarily a cause for concern, until/unless a clear pattern emerges – the slowdowns become more frequent or longer in duration, for example.

Given the sheer volume of alerts and potential issues, it's not surprising that many IT teams have gradually become desensitized. Machine learning and artificial intelligence (AI) can reduce the sheer number of alerts, by distinguishing between isolated anomalies and trends or patterns. Ultimately this can help keep alerts and issue escalations limited only to those instances where they're really warranted.

Put AI to Use – But Know Its Limits

In addition to identifying what are true trends worthy of concern, AI can deliver valuable predictive insights – for example, if performance for this particular server and resident application keeps degrading, which geographic customer segments will be impacted? How will business suffer?

AI can help, but we don't believe issue escalation and resolution will ever be a completely hands-off process. A machine can't "learn" to communicate earnestly with customers, nor can it "learn" when the business impact may be tolerable or not, which dictates the appropriate response (i.e., do on-call staffers really need to be called in the middle of the night?). If it's a clear pattern, and the revenue impact is big, the answer is yes. Otherwise, it may just be something that needs to be watched, and can wait until the morning.

Today, with so many elements to monitor and so much data being generated, performance monitoring initiatives can quickly devolve from a helpful, purposeful mechanism to a vortex of confusion and chaos. As performance monitoring becomes, by necessity, more comprehensive, it requires a more decisive, refined and sophisticated approach to managing alerts and escalating issues. Otherwise, we are in danger of performance monitoring tools controlling us, instead of guiding and serving us - their true intended purpose.

Mehdi Daoudi is CEO and Co-Founder of Catchpoint
Share this

The Latest

October 23, 2018

For anyone that's been in a war room, there's no denying that it can be an intense place. Teams go to the war room to win. But, the ideal outcome is a solid plan or solution designed to deliver the best outcome while utilizing the least resources. What are some of the key triggers that drive IT teams into the war room and how can you prepare yourself to contribute in a positive way? ...

October 22, 2018

With Black Friday and Cyber Monday just weeks away, Catchpoint has identified the top five technical items most likely to cause web or mobile shopping sites to perform poorly ...

October 19, 2018

APM is becoming more complex as the days go by. Server virtualization and cloud-based systems with containers and orchestration layers are part of this growing complexity, especially as the number of data sources increases and continues to change dynamically. To keep up with this changing environment, you will need to automate as many of your systems as possible. Open APIs can be an effective way to combat this scenario ...

October 18, 2018

Two years ago, Amazon, Comcast, Twitter and Netflix were effectively taken off the Internet for multiple hours by a DDoS attack because they all relied on a single DNS provider. Can it happen again? ...

October 17, 2018

We're seeing artificial intelligence for IT operations or "AIOps" take center stage in the IT industry. If AIOps hasn't been on your horizon yet, look closely and expect it soon. So what can we expect from automation and AIOps as it becomes more commonplace? ...

October 15, 2018

Use of artificial intelligence (AI) in digital commerce is generally considered a success, according to a survey by Gartner, Inc. About 70 percent of digital commerce organizations surveyed report that their AI projects are very or extremely successful ...

October 12, 2018

Most organizations are adopting or considering adopting machine learning due to its benefits, rather than with the intention to cut people’s jobs, according to the Voice of the Enterprise (VoTE): AI & Machine Learning – Adoption, Drivers and Stakeholders 2018 survey conducted by 451 Research ...

October 11, 2018

AI (Artificial Intelligence) and ML (Machine Learning) are the number one strategic enterprise IT investment priority in 2018 (named by 33% of enterprises), taking the top spot from container management (28%), and clearly leaving behind DevOps pipeline automation (13%), according to new EMA research ...

October 09, 2018

Although Windows and Linux were historically viewed as competitors, modern IT advancements have ensured much needed network availability between these ecosystems for redundancy, fault tolerance, and competitive advantage. Software that offers intelligent availability enables the dynamic transfer of data and its processing to the best execution environment for any given purpose. That may be on-premises, in the cloud, in containers, in Windows, or in Linux ...

October 04, 2018

TEKsystems released the results of its 2018 Forecast Reality Check, measuring the current impact of market conditions on IT initiatives, hiring, salaries and skill needs. Here are some key results ...