How to Detect (and Resolve) IT Ops/APM Issues Before Your Users Do
September 19, 2014

Kevin Conklin
Ipswitch

Share this

Among the most embarrassing situations for application support teams is first hearing about a critical performance issue from their users. With technology getting increasingly complex and IT environments changing almost overnight, the reality is that even the most experienced support teams are bound to miss a major problem with a critical application or service. One of the contributing factors is their continued reliance on traditional monitoring approaches.

Traditional tools limit us to monitoring for a combination of key performance indicator thresholds and failure modes that have already been experienced. So when it comes to finding new problems, the best case is alerts that describe the symptom (slow response time, transaction fails, etc.). A very experienced IT professional will have seen many behaviors, and consequently can employ monitoring based on best practices and past experiences. But even the most experienced IT professional will have a hard time designing rules and thresholds that can monitor for new, unknown problems without generating a number of noisy false alerts. Anomaly detection goes beyond the limits of traditional approaches because it sees and learns everything in the data provided, whether it has happened before or not.

Anomaly detection works by identifying unusual behaviors in data generated by an application or service delivery environment. The technology uses machine learning predictive analytics to establish baselines in the data and automatically learn what normal behavior is. The technology then identifies deviations in behavior that are unusually severe or maybe causal to other anomalies – a clear indication that something is wrong. And the best part? This technology works in real-time as well as in troubleshooting mode, so it's proactively monitoring your IT environment. With this approach, real problems can be identified and acted upon faster than before.

More advanced anomaly detection technologies can run multiple analyses in parallel, and are capable of analyzing multiple data sources simultaneously, identifying related, anomalous relationships within the system. Thus, when a chain of events is causal to a performance issue, the alerts contain all the related anomalies. This helps support teams zero in on the cause of the problem immediately.

Traditional approaches are also known to generate huge volumes of false alerts. Anomaly detection, on the other hand, uses advanced statistical analyses to minimize false alerts. Those few alerts that are generated provide more data, which results in faster troubleshooting.

Anomaly detection looks for significant variations from the norm and ranks severity by probability. Machine learning technology helps the system learn the difference between commonly occurring errors as well as spikes and drops in metrics, and true anomalies that are more accurate indicators of a problem. This can mean the difference between tens of thousands of alerts each day, most of which are false, and a dozen or so a week that should be pursued.

Anomaly detection can identify the early signs of developing problems in massive volumes of data before they turn into real, big problems. Enabling IT teams to slash troubleshooting time and decrease the noise from false alarms empowers them to attack and resolve any issues before they reach critical proportions.

If users do become aware of a problem, the IT team can respond "we're on it" instead of saying "thanks for letting us know."

Kevin Conklin is VP of Product Marketing at Ipswitch
Share this

The Latest

July 23, 2024

The rapid rise of generative AI (GenAI) has caught everyone's attention, leaving many to wonder if the technology's impact will live up to the immense hype. A recent survey by Alteryx provides valuable insights into the current state of GenAI adoption, revealing a shift from inflated expectations to tangible value realization across enterprises ... Here are five key takeaways that underscore GenAI's progression from hype to real-world impact ...

July 22, 2024
A defective software update caused what some experts are calling the largest IT outage in history on Friday, July 19. The impact reverberated through multiple industries around the world ...
July 18, 2024

As software development grows more intricate, the challenge for observability engineers tasked with ensuring optimal system performance becomes more daunting. Current methodologies are struggling to keep pace, with the annual Observability Pulse surveys indicating a rise in Mean Time to Remediation (MTTR). According to this survey, only a small fraction of organizations, around 10%, achieve full observability today. Generative AI, however, promises to significantly move the needle ...

July 17, 2024

While nearly all data leaders surveyed are building generative AI applications, most don't believe their data estate is actually prepared to support them, according to the State of Reliable AI report from Monte Carlo Data ...

July 16, 2024

Enterprises are putting a lot of effort into improving the digital employee experience (DEX), which has become essential to both improving organizational performance and attracting and retaining talented workers. But to date, most efforts to deliver outstanding DEX have focused on people working with laptops, PCs, or thin clients. Employees on the frontlines, using mobile devices to handle logistics ... have been largely overlooked ...

July 15, 2024

The average customer-facing incident takes nearly three hours to resolve (175 minutes) while the estimated cost of downtime is $4,537 per minute, meaning each incident can cost nearly $794,000, according to new research from PagerDuty ...

July 12, 2024

In MEAN TIME TO INSIGHT Episode 8, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses AutoCon with the conference founders Scott Robohn and Chris Grundemann ...

July 11, 2024

Numerous vendors and service providers have recently embraced the NaaS concept, yet there is still no industry consensus on its definition or the types of networks it involves. Furthermore, providers have varied in how they define the NaaS service delivery model. I conducted research for a new report, Network as a Service: Understanding the Cloud Consumption Model in Networking, to refine the concept of NaaS and reduce buyer confusion over what it is and how it can offer value ...

July 10, 2024

Containers are a common theme of wasted spend among organizations, according to the State of Cloud Costs 2024 report from Datadog. In fact, 83% of container costs were associated with idle resources ...

July 10, 2024

Companies prefer a mix of on-prem and cloud environments, according to the 2024 Global State of IT Automation Report from Stonebranch. In only one year, hybrid IT usage has doubled from 34% to 68% ...