Bringing Sanity Back to Performance Monitoring
August 10, 2017

Mehdi Daoudi

Share this

Performance monitoring tools have traditionally worked by keeping a constant pulse on internal computer systems and networks for sluggish or failing components, proactively alerting IT administrators to outages, slowdowns and other troubles. Several years ago, this approach was sufficient, enabling IT teams to make direct correlations between problematic datacenter elements and application and site performance (speed, availability) degradations.

As user performance demands have skyrocketed in recent years, organizations have expanded their infrastructures. Ironically, many have found that these build-outs – designed to deliver extremely fast, reliable experiences for users around the world – are actually making this task much harder. The volume of performance monitoring information and alerts creates a confusing cacophony, like being at a party and trying to listen to ten conversations at once.

This kind of environment could be prime for alert fatigue – issues being raised but ignored due to burnout. It's no wonder that user calls continue to be the top way organizations find out about IT-related performance issues, according to EMA. In our view, this is complete insanity in the 21st century. A new approach to managing IT alerts and issues raised through performance monitoring is needed, encompassing the following:

Canvas the Entire Landscape of Performance-Impacting Variables

As noted, it used to be that IT teams could get away with monitoring just their internal datacenter elements, but this is no longer the case. IT infrastructures for delivering digital services have quickly evolved into complex apparatuses including not just on-premise systems and networks but external third-party infrastructures (like CDNs, DNS providers, API providers) and services. If any third-party element slows down, it can degrade performance for all dependent websites and applications.

No company, no matter how big or small, is immune to this type of infection – and it requires all external third parties to be included in the monitoring process. This month's Amazon Prime Day provided a case in point. While Amazon did a great job overall, at one point in the day the site search function slowed to 14 seconds – meaning it took site visitors 30 to 40 percent longer than normal to complete a search. This was likely the result of a failing external third-party search function – even though Amazon could support the crushing traffic load, the third-party service wasn't as adept.

Apply Advanced Analytics

At this point you're likely saying, "So you're telling me I need to be monitoring more elements – I thought we were trying to reduce the noise?" You are right – these messages may seem contradictory – but the reality is, organizations cannot afford to not be monitoring all the elements impacting the user experience. This is a fact of life as performance monitoring transitions to what Gartner calls digital experience monitoring, where the quality (speed, availability, reachability and reliability) of the user experience is the ultimate metric and takes center stage. If it impacts what your users experience, it must be included in the monitoring strategy – period.

More expansive infrastructures, and the mountains of monitoring telemetry data they generate, are useless if they are void of useful, actionable insights. The key is combining this data with advanced analytics that enable organizations to precisely and accurately identify the root cause, whether it's inside or beyond the firewall. This capability is critical, particularly in DevOps environments where timeframes for implementing needed modifications are dramatically collapsed.

Identify and Prioritize True Hot Spots

It is human nature to conclude that any symptom must have an underlying cause – but that's not always the case, and random events can happen. Just because you sneeze, doesn't necessarily mean you have a cold. The same concept applies to enterprise IT: a random, isolated application or site slowdown can occur and it's not necessarily a cause for concern, until/unless a clear pattern emerges – the slowdowns become more frequent or longer in duration, for example.

Given the sheer volume of alerts and potential issues, it's not surprising that many IT teams have gradually become desensitized. Machine learning and artificial intelligence (AI) can reduce the sheer number of alerts, by distinguishing between isolated anomalies and trends or patterns. Ultimately this can help keep alerts and issue escalations limited only to those instances where they're really warranted.

Put AI to Use – But Know Its Limits

In addition to identifying what are true trends worthy of concern, AI can deliver valuable predictive insights – for example, if performance for this particular server and resident application keeps degrading, which geographic customer segments will be impacted? How will business suffer?

AI can help, but we don't believe issue escalation and resolution will ever be a completely hands-off process. A machine can't "learn" to communicate earnestly with customers, nor can it "learn" when the business impact may be tolerable or not, which dictates the appropriate response (i.e., do on-call staffers really need to be called in the middle of the night?). If it's a clear pattern, and the revenue impact is big, the answer is yes. Otherwise, it may just be something that needs to be watched, and can wait until the morning.

Today, with so many elements to monitor and so much data being generated, performance monitoring initiatives can quickly devolve from a helpful, purposeful mechanism to a vortex of confusion and chaos. As performance monitoring becomes, by necessity, more comprehensive, it requires a more decisive, refined and sophisticated approach to managing alerts and escalating issues. Otherwise, we are in danger of performance monitoring tools controlling us, instead of guiding and serving us - their true intended purpose.

Mehdi Daoudi is CEO and Co-Founder of Catchpoint
Share this

The Latest

July 18, 2024

As software development grows more intricate, the challenge for observability engineers tasked with ensuring optimal system performance becomes more daunting. Current methodologies are struggling to keep pace, with the annual Observability Pulse surveys indicating a rise in Mean Time to Remediation (MTTR). According to this survey, only a small fraction of organizations, around 10%, achieve full observability today. Generative AI, however, promises to significantly move the needle ...

July 17, 2024

While nearly all data leaders surveyed are building generative AI applications, most don't believe their data estate is actually prepared to support them, according to the State of Reliable AI report from Monte Carlo Data ...

July 16, 2024

Enterprises are putting a lot of effort into improving the digital employee experience (DEX), which has become essential to both improving organizational performance and attracting and retaining talented workers. But to date, most efforts to deliver outstanding DEX have focused on people working with laptops, PCs, or thin clients. Employees on the frontlines, using mobile devices to handle logistics ... have been largely overlooked ...

July 15, 2024

The average customer-facing incident takes nearly three hours to resolve (175 minutes) while the estimated cost of downtime is $4,537 per minute, meaning each incident can cost nearly $794,000, according to new research from PagerDuty ...

July 12, 2024

In MEAN TIME TO INSIGHT Episode 8, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses AutoCon with the conference founders Scott Robohn and Chris Grundemann ...

July 11, 2024

Numerous vendors and service providers have recently embraced the NaaS concept, yet there is still no industry consensus on its definition or the types of networks it involves. Furthermore, providers have varied in how they define the NaaS service delivery model. I conducted research for a new report, Network as a Service: Understanding the Cloud Consumption Model in Networking, to refine the concept of NaaS and reduce buyer confusion over what it is and how it can offer value ...

July 10, 2024

Containers are a common theme of wasted spend among organizations, according to the State of Cloud Costs 2024 report from Datadog. In fact, 83% of container costs were associated with idle resources ...

July 10, 2024

Companies prefer a mix of on-prem and cloud environments, according to the 2024 Global State of IT Automation Report from Stonebranch. In only one year, hybrid IT usage has doubled from 34% to 68% ...

July 09, 2024

At the forefront of this year's findings, from the Flexera 2024 State of ITAM Report, is the critical gap between software asset management (SAM) and FinOps (cloud financial management) teams. This year, 32% of SAM teams reported having significant interactions with FinOps teams. While this marks an improvement from last year's 25%, it highlights the persistent challenge of integrating these two essential functions ...

July 08, 2024

Information technology serves as the digital backbone for doctors, nurses, and technicians to deliver quality patient care by sharing data and applications over secure IT networks. To help understand the top IT trends that are impacting the healthcare industry today, Auvik recently released a companion analysis for its 2024 IT Trends Report ...