Bringing Sanity Back to Performance Monitoring
August 10, 2017

Mehdi Daoudi

Share this

Performance monitoring tools have traditionally worked by keeping a constant pulse on internal computer systems and networks for sluggish or failing components, proactively alerting IT administrators to outages, slowdowns and other troubles. Several years ago, this approach was sufficient, enabling IT teams to make direct correlations between problematic datacenter elements and application and site performance (speed, availability) degradations.

As user performance demands have skyrocketed in recent years, organizations have expanded their infrastructures. Ironically, many have found that these build-outs – designed to deliver extremely fast, reliable experiences for users around the world – are actually making this task much harder. The volume of performance monitoring information and alerts creates a confusing cacophony, like being at a party and trying to listen to ten conversations at once.

This kind of environment could be prime for alert fatigue – issues being raised but ignored due to burnout. It's no wonder that user calls continue to be the top way organizations find out about IT-related performance issues, according to EMA. In our view, this is complete insanity in the 21st century. A new approach to managing IT alerts and issues raised through performance monitoring is needed, encompassing the following:

Canvas the Entire Landscape of Performance-Impacting Variables

As noted, it used to be that IT teams could get away with monitoring just their internal datacenter elements, but this is no longer the case. IT infrastructures for delivering digital services have quickly evolved into complex apparatuses including not just on-premise systems and networks but external third-party infrastructures (like CDNs, DNS providers, API providers) and services. If any third-party element slows down, it can degrade performance for all dependent websites and applications.

No company, no matter how big or small, is immune to this type of infection – and it requires all external third parties to be included in the monitoring process. This month's Amazon Prime Day provided a case in point. While Amazon did a great job overall, at one point in the day the site search function slowed to 14 seconds – meaning it took site visitors 30 to 40 percent longer than normal to complete a search. This was likely the result of a failing external third-party search function – even though Amazon could support the crushing traffic load, the third-party service wasn't as adept.

Apply Advanced Analytics

At this point you're likely saying, "So you're telling me I need to be monitoring more elements – I thought we were trying to reduce the noise?" You are right – these messages may seem contradictory – but the reality is, organizations cannot afford to not be monitoring all the elements impacting the user experience. This is a fact of life as performance monitoring transitions to what Gartner calls digital experience monitoring, where the quality (speed, availability, reachability and reliability) of the user experience is the ultimate metric and takes center stage. If it impacts what your users experience, it must be included in the monitoring strategy – period.

More expansive infrastructures, and the mountains of monitoring telemetry data they generate, are useless if they are void of useful, actionable insights. The key is combining this data with advanced analytics that enable organizations to precisely and accurately identify the root cause, whether it's inside or beyond the firewall. This capability is critical, particularly in DevOps environments where timeframes for implementing needed modifications are dramatically collapsed.

Identify and Prioritize True Hot Spots

It is human nature to conclude that any symptom must have an underlying cause – but that's not always the case, and random events can happen. Just because you sneeze, doesn't necessarily mean you have a cold. The same concept applies to enterprise IT: a random, isolated application or site slowdown can occur and it's not necessarily a cause for concern, until/unless a clear pattern emerges – the slowdowns become more frequent or longer in duration, for example.

Given the sheer volume of alerts and potential issues, it's not surprising that many IT teams have gradually become desensitized. Machine learning and artificial intelligence (AI) can reduce the sheer number of alerts, by distinguishing between isolated anomalies and trends or patterns. Ultimately this can help keep alerts and issue escalations limited only to those instances where they're really warranted.

Put AI to Use – But Know Its Limits

In addition to identifying what are true trends worthy of concern, AI can deliver valuable predictive insights – for example, if performance for this particular server and resident application keeps degrading, which geographic customer segments will be impacted? How will business suffer?

AI can help, but we don't believe issue escalation and resolution will ever be a completely hands-off process. A machine can't "learn" to communicate earnestly with customers, nor can it "learn" when the business impact may be tolerable or not, which dictates the appropriate response (i.e., do on-call staffers really need to be called in the middle of the night?). If it's a clear pattern, and the revenue impact is big, the answer is yes. Otherwise, it may just be something that needs to be watched, and can wait until the morning.

Today, with so many elements to monitor and so much data being generated, performance monitoring initiatives can quickly devolve from a helpful, purposeful mechanism to a vortex of confusion and chaos. As performance monitoring becomes, by necessity, more comprehensive, it requires a more decisive, refined and sophisticated approach to managing alerts and escalating issues. Otherwise, we are in danger of performance monitoring tools controlling us, instead of guiding and serving us - their true intended purpose.

Mehdi Daoudi is CEO and Co-Founder of Catchpoint
Share this

The Latest

May 26, 2020

Nearly 3,700 people told GitLab about their DevOps journeys. Respondents shared that their roles are changing dramatically, no matter where they sit in the organization. The lines surrounding the traditional definitions of dev, sec, ops and test have blurred, and as we enter the second half of 2020, it is perhaps more important than ever for companies to understand how these roles are evolving ...

May 21, 2020

As cloud computing continues to grow, tech pros say they are increasingly prioritizing areas like hybrid infrastructure management, application performance management (APM), and security management to optimize delivery for the organizations they serve, according to SolarWinds IT Trends Report 2020: The Universal Language of IT ...

May 20, 2020

Businesses see digital experience as a growing priority and a key to their success, with execution requiring a more integrated approach across development, IT and business users, according to Digital Experiences: Where the Industry Stands ...

May 19, 2020

Fully 90% of those who use observability tooling say those tools are important to their team's software development success, including 39% who say observability tools are very important ...

May 18, 2020

As our production application systems continuously increase in complexity, the challenges of understanding, debugging, and improving them keep growing by orders of magnitude. The practice of Observability addresses both the social and the technological challenges of wrangling complexity and working toward achieving production excellence. New research shows how observable systems and practices are changing the APM landscape ...

May 14, 2020
Digital technologies have enveloped our lives like never before. Be it on the personal or professional front, we have become dependent on the accurate functioning of digital devices and the software running them. The performance of the software is critical in running the components and levers of the new digital ecosystem. And to ensure our digital ecosystem delivers the required outcomes, a robust performance testing strategy should be instituted ...
May 13, 2020

The enforced change to working from home (WFH) has had a massive impact on businesses, not just in the way they manage their employees and IT systems. As the COVID-19 pandemic progresses, enterprise IT teams are looking to answer key questions such as: Which applications have become more critical for working from home? ...

May 12, 2020

In ancient times — February 2020 — EMA research found that more than 50% of IT leaders surveyed were considering new ITSM platforms in the near future. The future arrived with a bang as IT organizations turbo-pivoted to deliver and support unprecedented levels and types of services to a global workplace suddenly working from home ...

May 11, 2020

The Internet of Things (IoT) is changing the world. From augmented reality advanced analytics to new consumer solutions, IoT and the cloud are together redefining both how we work and how we engage with our audiences. They are changing how we live, as well ...

May 07, 2020

Despite IT professionals' confidence in their ability to support today's much greater dependence on digital services, there is a rise in application performance errors reported by more than half of consumers, according to the Impact of COVID-19 on Digital Transformation survey from xMatters ...