Skip to main content

Bringing Sanity Back to Performance Monitoring

Mehdi Daoudi

Performance monitoring tools have traditionally worked by keeping a constant pulse on internal computer systems and networks for sluggish or failing components, proactively alerting IT administrators to outages, slowdowns and other troubles. Several years ago, this approach was sufficient, enabling IT teams to make direct correlations between problematic datacenter elements and application and site performance (speed, availability) degradations.

As user performance demands have skyrocketed in recent years, organizations have expanded their infrastructures. Ironically, many have found that these build-outs – designed to deliver extremely fast, reliable experiences for users around the world – are actually making this task much harder. The volume of performance monitoring information and alerts creates a confusing cacophony, like being at a party and trying to listen to ten conversations at once.

This kind of environment could be prime for alert fatigue – issues being raised but ignored due to burnout. It's no wonder that user calls continue to be the top way organizations find out about IT-related performance issues, according to EMA. In our view, this is complete insanity in the 21st century. A new approach to managing IT alerts and issues raised through performance monitoring is needed, encompassing the following:

Canvas the Entire Landscape of Performance-Impacting Variables

As noted, it used to be that IT teams could get away with monitoring just their internal datacenter elements, but this is no longer the case. IT infrastructures for delivering digital services have quickly evolved into complex apparatuses including not just on-premise systems and networks but external third-party infrastructures (like CDNs, DNS providers, API providers) and services. If any third-party element slows down, it can degrade performance for all dependent websites and applications.

No company, no matter how big or small, is immune to this type of infection – and it requires all external third parties to be included in the monitoring process. This month's Amazon Prime Day provided a case in point. While Amazon did a great job overall, at one point in the day the site search function slowed to 14 seconds – meaning it took site visitors 30 to 40 percent longer than normal to complete a search. This was likely the result of a failing external third-party search function – even though Amazon could support the crushing traffic load, the third-party service wasn't as adept.

Apply Advanced Analytics

At this point you're likely saying, "So you're telling me I need to be monitoring more elements – I thought we were trying to reduce the noise?" You are right – these messages may seem contradictory – but the reality is, organizations cannot afford to not be monitoring all the elements impacting the user experience. This is a fact of life as performance monitoring transitions to what Gartner calls digital experience monitoring, where the quality (speed, availability, reachability and reliability) of the user experience is the ultimate metric and takes center stage. If it impacts what your users experience, it must be included in the monitoring strategy – period.

More expansive infrastructures, and the mountains of monitoring telemetry data they generate, are useless if they are void of useful, actionable insights. The key is combining this data with advanced analytics that enable organizations to precisely and accurately identify the root cause, whether it's inside or beyond the firewall. This capability is critical, particularly in DevOps environments where timeframes for implementing needed modifications are dramatically collapsed.

Identify and Prioritize True Hot Spots

It is human nature to conclude that any symptom must have an underlying cause – but that's not always the case, and random events can happen. Just because you sneeze, doesn't necessarily mean you have a cold. The same concept applies to enterprise IT: a random, isolated application or site slowdown can occur and it's not necessarily a cause for concern, until/unless a clear pattern emerges – the slowdowns become more frequent or longer in duration, for example.

Given the sheer volume of alerts and potential issues, it's not surprising that many IT teams have gradually become desensitized. Machine learning and artificial intelligence (AI) can reduce the sheer number of alerts, by distinguishing between isolated anomalies and trends or patterns. Ultimately this can help keep alerts and issue escalations limited only to those instances where they're really warranted.

Put AI to Use – But Know Its Limits

In addition to identifying what are true trends worthy of concern, AI can deliver valuable predictive insights – for example, if performance for this particular server and resident application keeps degrading, which geographic customer segments will be impacted? How will business suffer?

AI can help, but we don't believe issue escalation and resolution will ever be a completely hands-off process. A machine can't "learn" to communicate earnestly with customers, nor can it "learn" when the business impact may be tolerable or not, which dictates the appropriate response (i.e., do on-call staffers really need to be called in the middle of the night?). If it's a clear pattern, and the revenue impact is big, the answer is yes. Otherwise, it may just be something that needs to be watched, and can wait until the morning.

Today, with so many elements to monitor and so much data being generated, performance monitoring initiatives can quickly devolve from a helpful, purposeful mechanism to a vortex of confusion and chaos. As performance monitoring becomes, by necessity, more comprehensive, it requires a more decisive, refined and sophisticated approach to managing alerts and escalating issues. Otherwise, we are in danger of performance monitoring tools controlling us, instead of guiding and serving us - their true intended purpose.

Hot Topics

The Latest

AI is the catalyst for significant investment in data teams as enterprises require higher-quality data to power their AI applications, according to the State of Analytics Engineering Report from dbt Labs ...

Misaligned architecture can lead to business consequences, with 93% of respondents reporting negative outcomes such as service disruptions, high operational costs and security challenges ...

A Gartner analyst recently suggested that GenAI tools could create 25% time savings for network operational teams. Where might these time savings come from? How are GenAI tools helping NetOps teams today, and what other tasks might they take on in the future as models continue improving? In general, these savings come from automating or streamlining manual NetOps tasks ...

IT and line-of-business teams are increasingly aligned in their efforts to close the data gap and drive greater collaboration to alleviate IT bottlenecks and offload growing demands on IT teams, according to The 2025 Automation Benchmark Report: Insights from IT Leaders on Enterprise Automation & the Future of AI-Driven Businesses from Jitterbit ...

A large majority (86%) of data management and AI decision makers cite protecting data privacy as a top concern, with 76% of respondents citing ROI on data privacy and AI initiatives across their organization, according to a new Harris Poll from Collibra ...

According to Gartner, Inc. the following six trends will shape the future of cloud over the next four years, ultimately resulting in new ways of working that are digital in nature and transformative in impact ...

2020 was the equivalent of a wedding with a top-shelf open bar. As businesses scrambled to adjust to remote work, digital transformation accelerated at breakneck speed. New software categories emerged overnight. Tech stacks ballooned with all sorts of SaaS apps solving ALL the problems — often with little oversight or long-term integration planning, and yes frequently a lot of duplicated functionality ... But now the music's faded. The lights are on. Everyone from the CIO to the CFO is checking the bill. Welcome to the Great SaaS Hangover ...

Regardless of OpenShift being a scalable and flexible software, it can be a pain to monitor since complete visibility into the underlying operations is not guaranteed ... To effectively monitor an OpenShift environment, IT administrators should focus on these five key elements and their associated metrics ...

An overwhelming majority of IT leaders (95%) believe the upcoming wave of AI-powered digital transformation is set to be the most impactful and intensive seen thus far, according to The Science of Productivity: AI, Adoption, And Employee Experience, a new report from Nexthink ...

Overall outage frequency and the general level of reported severity continue to decline, according to the Outage Analysis 2025 from Uptime Institute. However, cyber security incidents are on the rise and often have severe, lasting impacts ...

Bringing Sanity Back to Performance Monitoring

Mehdi Daoudi

Performance monitoring tools have traditionally worked by keeping a constant pulse on internal computer systems and networks for sluggish or failing components, proactively alerting IT administrators to outages, slowdowns and other troubles. Several years ago, this approach was sufficient, enabling IT teams to make direct correlations between problematic datacenter elements and application and site performance (speed, availability) degradations.

As user performance demands have skyrocketed in recent years, organizations have expanded their infrastructures. Ironically, many have found that these build-outs – designed to deliver extremely fast, reliable experiences for users around the world – are actually making this task much harder. The volume of performance monitoring information and alerts creates a confusing cacophony, like being at a party and trying to listen to ten conversations at once.

This kind of environment could be prime for alert fatigue – issues being raised but ignored due to burnout. It's no wonder that user calls continue to be the top way organizations find out about IT-related performance issues, according to EMA. In our view, this is complete insanity in the 21st century. A new approach to managing IT alerts and issues raised through performance monitoring is needed, encompassing the following:

Canvas the Entire Landscape of Performance-Impacting Variables

As noted, it used to be that IT teams could get away with monitoring just their internal datacenter elements, but this is no longer the case. IT infrastructures for delivering digital services have quickly evolved into complex apparatuses including not just on-premise systems and networks but external third-party infrastructures (like CDNs, DNS providers, API providers) and services. If any third-party element slows down, it can degrade performance for all dependent websites and applications.

No company, no matter how big or small, is immune to this type of infection – and it requires all external third parties to be included in the monitoring process. This month's Amazon Prime Day provided a case in point. While Amazon did a great job overall, at one point in the day the site search function slowed to 14 seconds – meaning it took site visitors 30 to 40 percent longer than normal to complete a search. This was likely the result of a failing external third-party search function – even though Amazon could support the crushing traffic load, the third-party service wasn't as adept.

Apply Advanced Analytics

At this point you're likely saying, "So you're telling me I need to be monitoring more elements – I thought we were trying to reduce the noise?" You are right – these messages may seem contradictory – but the reality is, organizations cannot afford to not be monitoring all the elements impacting the user experience. This is a fact of life as performance monitoring transitions to what Gartner calls digital experience monitoring, where the quality (speed, availability, reachability and reliability) of the user experience is the ultimate metric and takes center stage. If it impacts what your users experience, it must be included in the monitoring strategy – period.

More expansive infrastructures, and the mountains of monitoring telemetry data they generate, are useless if they are void of useful, actionable insights. The key is combining this data with advanced analytics that enable organizations to precisely and accurately identify the root cause, whether it's inside or beyond the firewall. This capability is critical, particularly in DevOps environments where timeframes for implementing needed modifications are dramatically collapsed.

Identify and Prioritize True Hot Spots

It is human nature to conclude that any symptom must have an underlying cause – but that's not always the case, and random events can happen. Just because you sneeze, doesn't necessarily mean you have a cold. The same concept applies to enterprise IT: a random, isolated application or site slowdown can occur and it's not necessarily a cause for concern, until/unless a clear pattern emerges – the slowdowns become more frequent or longer in duration, for example.

Given the sheer volume of alerts and potential issues, it's not surprising that many IT teams have gradually become desensitized. Machine learning and artificial intelligence (AI) can reduce the sheer number of alerts, by distinguishing between isolated anomalies and trends or patterns. Ultimately this can help keep alerts and issue escalations limited only to those instances where they're really warranted.

Put AI to Use – But Know Its Limits

In addition to identifying what are true trends worthy of concern, AI can deliver valuable predictive insights – for example, if performance for this particular server and resident application keeps degrading, which geographic customer segments will be impacted? How will business suffer?

AI can help, but we don't believe issue escalation and resolution will ever be a completely hands-off process. A machine can't "learn" to communicate earnestly with customers, nor can it "learn" when the business impact may be tolerable or not, which dictates the appropriate response (i.e., do on-call staffers really need to be called in the middle of the night?). If it's a clear pattern, and the revenue impact is big, the answer is yes. Otherwise, it may just be something that needs to be watched, and can wait until the morning.

Today, with so many elements to monitor and so much data being generated, performance monitoring initiatives can quickly devolve from a helpful, purposeful mechanism to a vortex of confusion and chaos. As performance monitoring becomes, by necessity, more comprehensive, it requires a more decisive, refined and sophisticated approach to managing alerts and escalating issues. Otherwise, we are in danger of performance monitoring tools controlling us, instead of guiding and serving us - their true intended purpose.

Hot Topics

The Latest

AI is the catalyst for significant investment in data teams as enterprises require higher-quality data to power their AI applications, according to the State of Analytics Engineering Report from dbt Labs ...

Misaligned architecture can lead to business consequences, with 93% of respondents reporting negative outcomes such as service disruptions, high operational costs and security challenges ...

A Gartner analyst recently suggested that GenAI tools could create 25% time savings for network operational teams. Where might these time savings come from? How are GenAI tools helping NetOps teams today, and what other tasks might they take on in the future as models continue improving? In general, these savings come from automating or streamlining manual NetOps tasks ...

IT and line-of-business teams are increasingly aligned in their efforts to close the data gap and drive greater collaboration to alleviate IT bottlenecks and offload growing demands on IT teams, according to The 2025 Automation Benchmark Report: Insights from IT Leaders on Enterprise Automation & the Future of AI-Driven Businesses from Jitterbit ...

A large majority (86%) of data management and AI decision makers cite protecting data privacy as a top concern, with 76% of respondents citing ROI on data privacy and AI initiatives across their organization, according to a new Harris Poll from Collibra ...

According to Gartner, Inc. the following six trends will shape the future of cloud over the next four years, ultimately resulting in new ways of working that are digital in nature and transformative in impact ...

2020 was the equivalent of a wedding with a top-shelf open bar. As businesses scrambled to adjust to remote work, digital transformation accelerated at breakneck speed. New software categories emerged overnight. Tech stacks ballooned with all sorts of SaaS apps solving ALL the problems — often with little oversight or long-term integration planning, and yes frequently a lot of duplicated functionality ... But now the music's faded. The lights are on. Everyone from the CIO to the CFO is checking the bill. Welcome to the Great SaaS Hangover ...

Regardless of OpenShift being a scalable and flexible software, it can be a pain to monitor since complete visibility into the underlying operations is not guaranteed ... To effectively monitor an OpenShift environment, IT administrators should focus on these five key elements and their associated metrics ...

An overwhelming majority of IT leaders (95%) believe the upcoming wave of AI-powered digital transformation is set to be the most impactful and intensive seen thus far, according to The Science of Productivity: AI, Adoption, And Employee Experience, a new report from Nexthink ...

Overall outage frequency and the general level of reported severity continue to decline, according to the Outage Analysis 2025 from Uptime Institute. However, cyber security incidents are on the rise and often have severe, lasting impacts ...