Skip to main content

APM and Observability: Cutting Through the Confusion — Part 10

Pete Goldin
APMdigest

One more important point for the experts to consider is the impact of Artificial Intelligence (AI) on APM and Observability.

Start with: APM and Observability - Cutting Through the Confusion - Part 9

AI plays a transformative role in both APM and observability by turning raw data into actionable insights, enabling faster, more accurate detection and resolution of issues, according to Nigel Hickey, Senior Technical Marketing Manager at NetBrain.

Today, the most efficient and effective observability platforms leverage AI and ML to automate IT root cause analysis, sift through vast numbers of log files, and interpret data with higher speed and accuracy than any human IT professional or team could do independently, says Douglas James, Vice President, Solutions & Ecosystem at ScienceLogic.

"Application Performance Monitoring is no longer a siloed step — it's now integrated into broader observability workflows," explains Bahubali Shetti, Senior Director, Product Marketing, Elastic. "Instead of manually sifting through traces, logs, and metrics, AI-powered tools like an Observability AI Assistant can quickly identify high latency, failed transactions, or Kubernetes scaling issues, all with the right context."

Shetti continues, "What makes these assistants especially powerful is their use of Retrieval-Augmented Generation (RAG), which combines large language models with your organization's data, such as GitHub issues, runbooks, and documentation, to deliver smart, context-aware responses. These assistants connect the dots across all signals (logs, metrics, traces) and all sources (application, Kubernetes, cloud, etc., helping users focus more on improving systems, not just troubleshooting them."

The following are some of the many capabilities AI can provide for both APM and Observability, according to the experts:

Automating Manual Tasks

In APM, AI accelerates diagnostics work and helps teams optimize application performance through more automation and less manual sweat. 
Bryan Cole
Director of Customer Engineering, Tricentis

The biggest promise of AI is to reduce or eliminate toil by automating tasks that aren't genuinely creative, but have traditionally required humans for various reasons. Unfortunately APM and Observability are rife with these kinds of tasks. Spotting anomalies, configuring alerts, scanning changes for relevant issues, assessing impact of incidents, validating deploys. All of these are things that humans routinely do, but are easy to forget or do incorrectly, which will cause or prolong an incident. Leveraging AI, an intelligent platform can automate much of that burden.
Nic Benders
Chief Technical Strategist, New Relic

Decision-Making Guidance

AI helps to explain findings to make it easier and more understandable for the human that needs to act upon the observability data. An example would be: Explain the best steps to mitigate the system outage that was identified by the observability platform! With Agentic AI this use case goes even a step further where one could ask: Open up a Pull Request with the suggested remediation steps and assign it to the team that owns the problematic system component!
Andreas Grabner
Fellow DevRel and CNCF Ambassador, Dynatrace

Conversational interfaces are emerging, allowing practitioners to essentially "chat" with their systems about performance and health. The pace of improvement is incredibly fast, and I'm genuinely excited about the capabilities that will blossom in the next one to two years.
Juraci Paixão Kröhling
Software Engineer, OllyGarden

Resource Issue Identification

AI can identify inefficiencies, such as slow query patterns or resource bottlenecks, and monitor overall system health to spot anomalies.
Ajay Khanna
CMO, Yugabyte

AI can automate alerting, optimize performance based on data and forecast potential performance issues such as resource exhaustion based on historical trends.
Varma Kunaparaju
SVP and GM for Cloud Platform and OpsRamp Software, HPE

Anomaly Detection

With anomaly detection, AI analyzes metrics, logs, and traces to identify unusual patterns, such as sudden spikes in error rates or latency, faster than manual thresholds.
Varma Kunaparaju
SVP and GM for Cloud Platform and OpsRamp Software, HPE

AI can automatically flag unusual patterns in metrics or traces that humans might miss, especially in complex distributed systems.
Rakesh Gupta
Head of Product Management, Observe

Alert Noise Reduction

As telemetry grows, AI will be essential in automating the separation of signal from noise.
Gurjeet Arora
CEO and Co-Founder, Observo AI

Alert noise reduction: Instead of getting 50 alerts when something breaks, AI can group related symptoms and surface the most likely root cause indicators.
Rakesh Gupta
Head of Product Management, Observe

AI-powered tools can help understand the telemetry data profile and separate signal from noise. 
Ajay Khanna
CMO, Yugabyte

Streamlined Troubleshooting

In the APM space, AI helps automate tasks like anomaly detection, spotting performance degradation patterns, and linking incidents to specific code changes or deployments — all of which significantly speed up troubleshooting.
Arun Balachandran
Senior Product Marketing Manager, ManageEngine APM Solutions

Faster MTTR

AI facilitates intelligent automation, transforming insights into actionable steps and significantly reducing mean time to resolution (MTTR). It's about harnessing AI to not only understand, but to act swiftly and decisively. 
Gab Menachem
VP ITOM, ServiceNow

Observability and APM are the best use cases of Agentic AI. LLM technology and agentic workflows can pass through massive amounts of metrics events, logs, and traces to improve the signal noise ratio accelerating MTTR/D and therefore resolution, minimizing human triage time and increasing application uptime.
Bill Lobig
VP of Observability, IBM Automation

Event Correlation

In observability, AI correlates events across logs, metrics, and traces to highlight causality. 
Hugo Kaczmarek
Director of Product, APM Suite, Datadog

Root Cause Analysis

We're seeing AI assist with constructing queries, generating dashboards, interpreting raw telemetry signals, and pointing towards the likely direction of a problem's root cause.
Juraci Paixão Kröhling
Software Engineer, OllyGarden

Root cause suggestions: When an incident occurs, AI can correlate across different data sources and suggest probable causes based on historical patterns.
Rakesh Gupta
Head of Product Management, Observe

By using AI and ML to gain complete visibility and automated root cause analysis, observability solutions improve customer experiences, enhance employee productivity, and optimize digital infrastructure at profound levels.
Douglas James
VP, Solutions & Ecosystem, ScienceLogic

AI, often understood today as LLMs, can assist by enabling natural language querying and summarizing telemetry data for faster exploration. However, LLMs fall short when it comes to accurately identifying root causes, as they lack an understanding of system causality. This is where causal reasoning becomes essential. By modeling how components influence one another, causal analysis can pinpoint the actual source of incidents, not just symptoms. It provides precise, explainable insights that go beyond what LLMs can infer from surface-level patterns.
Severin Neumann
Head of Community & Developer Relations, Causely

Prioritizing Likely Problems

AI excels at text but is still evolving for data-rich environments. It should be used to guide and narrow down the search for issues rather than fully automating diagnoses or replacing human expertise. I view AI's role as being strongest when it helps prioritize likely problems, allowing humans to focus their efforts.
Jeff Cobb
Global Head of Product & Design, Chronosphere

Predicting Potential Problems

AI-powered observability processes vast volumes of telemetry data in real-time, automatically detecting anomalies, pinpointing root causes, and anticipating issues before they occur. It allows teams to shift from reactive troubleshooting to proactive, preventative operations — saving time, reducing alert fatigue, and improving reliability across complex environments. 
Andreas Grabner
Fellow DevRel and CNCF Ambassador, Dynatrace

In APM, AI is increasingly used to detect unusual application behavior, user drop-off patterns, or performance degradations before they impact SLAs. 
Gurjeet Arora
CEO and Co-Founder, Observo AI

AI's role is growing fast here. It's great for spotting patterns you might miss, flagging anomalies in real time and even predicting potential failures before they cause real issues. In APM, that means catching performance slowdowns early. Observability means making sense of a flood of data (logs, traces, metrics) and connecting the dots quickly.
Tanner Burson
Engineering Leader, Prismatic

In APM, AI helps baseline normal application behavior and detects anomalies in real time and accelerates root cause analysis by correlating signals across the application stack, predicting potential failures before they impact end users.
Nigel Hickey
Senior Technical Marketing Manager, NetBrain

Autoremediation

We're seeing a rise in AI-driven observability tools that not only recommend fixes but can proactively trigger automated remediation, helping teams resolve problems faster and build more resilient systems.
Arun Balachandran
Senior Product Marketing Manager, ManageEngine APM Solutions

Incident Documentation

GenAI can provide support for documentation of issues and, when included within the organizations documentation, provide better responses for future issues using retrieval augmented generation (RAG). The next step would be Agentic AI through which incidents could be automatically resolved and documented.
Harald Burose
Director, Product Management, Research & Development – Engineering, OpenText

Visibility into Business Impact

AI helps bridge the technical nature of telemetry and observability data to people outside engineering. AI allows users to get real-time answers in their context, tied to business impact. 
Ariel Assaraf
CEO, Coralogix

Observability-Driven Development

AI supports observability-driven development, providing automated feedback to catch performance issues early, shifting observability from reactive troubleshooting to proactive optimization.
Ajay Khanna
CMO, Yugabyte

Cost Reduction

For observability, AI can filter out low-value data to reduce storage and licensing costs'
Gurjeet Arora
CEO and Co-Founder, Observo AI

Conclusion: Telemetry Is Key

If you're working with sampled traces and aggregated metrics, AI can't provide the full picture. The real opportunity comes from having comprehensive, unified telemetry data that enables correlation across your entire technology stack.
Rakesh Gupta
Head of Product Management, Observe

AI can accelerate incident response by surfacing anomalies, correlating patterns, and even suggesting root causes. But for AI to be meaningful, it needs structured, and rich telemetry, not black-box outputs. This is where OpenTelemetry shines. By standardizing the way metrics, logs, and traces are collected and annotated, it provides high-quality input for AI systems to reason over.
Brian Douglas
Head of Ecosystem, Cloud Native Computing Foundation (CNCF)

Go to: APM and Observability - Cutting Through the Confusion - Part 11, presenting predictions about the future of APM and Observability.

Pete Goldin is Editor and Publisher of APMdigest

APM and Observability: Cutting Through the Confusion — Part 10

Pete Goldin
APMdigest

One more important point for the experts to consider is the impact of Artificial Intelligence (AI) on APM and Observability.

Start with: APM and Observability - Cutting Through the Confusion - Part 9

AI plays a transformative role in both APM and observability by turning raw data into actionable insights, enabling faster, more accurate detection and resolution of issues, according to Nigel Hickey, Senior Technical Marketing Manager at NetBrain.

Today, the most efficient and effective observability platforms leverage AI and ML to automate IT root cause analysis, sift through vast numbers of log files, and interpret data with higher speed and accuracy than any human IT professional or team could do independently, says Douglas James, Vice President, Solutions & Ecosystem at ScienceLogic.

"Application Performance Monitoring is no longer a siloed step — it's now integrated into broader observability workflows," explains Bahubali Shetti, Senior Director, Product Marketing, Elastic. "Instead of manually sifting through traces, logs, and metrics, AI-powered tools like an Observability AI Assistant can quickly identify high latency, failed transactions, or Kubernetes scaling issues, all with the right context."

Shetti continues, "What makes these assistants especially powerful is their use of Retrieval-Augmented Generation (RAG), which combines large language models with your organization's data, such as GitHub issues, runbooks, and documentation, to deliver smart, context-aware responses. These assistants connect the dots across all signals (logs, metrics, traces) and all sources (application, Kubernetes, cloud, etc., helping users focus more on improving systems, not just troubleshooting them."

The following are some of the many capabilities AI can provide for both APM and Observability, according to the experts:

Automating Manual Tasks

In APM, AI accelerates diagnostics work and helps teams optimize application performance through more automation and less manual sweat. 
Bryan Cole
Director of Customer Engineering, Tricentis

The biggest promise of AI is to reduce or eliminate toil by automating tasks that aren't genuinely creative, but have traditionally required humans for various reasons. Unfortunately APM and Observability are rife with these kinds of tasks. Spotting anomalies, configuring alerts, scanning changes for relevant issues, assessing impact of incidents, validating deploys. All of these are things that humans routinely do, but are easy to forget or do incorrectly, which will cause or prolong an incident. Leveraging AI, an intelligent platform can automate much of that burden.
Nic Benders
Chief Technical Strategist, New Relic

Decision-Making Guidance

AI helps to explain findings to make it easier and more understandable for the human that needs to act upon the observability data. An example would be: Explain the best steps to mitigate the system outage that was identified by the observability platform! With Agentic AI this use case goes even a step further where one could ask: Open up a Pull Request with the suggested remediation steps and assign it to the team that owns the problematic system component!
Andreas Grabner
Fellow DevRel and CNCF Ambassador, Dynatrace

Conversational interfaces are emerging, allowing practitioners to essentially "chat" with their systems about performance and health. The pace of improvement is incredibly fast, and I'm genuinely excited about the capabilities that will blossom in the next one to two years.
Juraci Paixão Kröhling
Software Engineer, OllyGarden

Resource Issue Identification

AI can identify inefficiencies, such as slow query patterns or resource bottlenecks, and monitor overall system health to spot anomalies.
Ajay Khanna
CMO, Yugabyte

AI can automate alerting, optimize performance based on data and forecast potential performance issues such as resource exhaustion based on historical trends.
Varma Kunaparaju
SVP and GM for Cloud Platform and OpsRamp Software, HPE

Anomaly Detection

With anomaly detection, AI analyzes metrics, logs, and traces to identify unusual patterns, such as sudden spikes in error rates or latency, faster than manual thresholds.
Varma Kunaparaju
SVP and GM for Cloud Platform and OpsRamp Software, HPE

AI can automatically flag unusual patterns in metrics or traces that humans might miss, especially in complex distributed systems.
Rakesh Gupta
Head of Product Management, Observe

Alert Noise Reduction

As telemetry grows, AI will be essential in automating the separation of signal from noise.
Gurjeet Arora
CEO and Co-Founder, Observo AI

Alert noise reduction: Instead of getting 50 alerts when something breaks, AI can group related symptoms and surface the most likely root cause indicators.
Rakesh Gupta
Head of Product Management, Observe

AI-powered tools can help understand the telemetry data profile and separate signal from noise. 
Ajay Khanna
CMO, Yugabyte

Streamlined Troubleshooting

In the APM space, AI helps automate tasks like anomaly detection, spotting performance degradation patterns, and linking incidents to specific code changes or deployments — all of which significantly speed up troubleshooting.
Arun Balachandran
Senior Product Marketing Manager, ManageEngine APM Solutions

Faster MTTR

AI facilitates intelligent automation, transforming insights into actionable steps and significantly reducing mean time to resolution (MTTR). It's about harnessing AI to not only understand, but to act swiftly and decisively. 
Gab Menachem
VP ITOM, ServiceNow

Observability and APM are the best use cases of Agentic AI. LLM technology and agentic workflows can pass through massive amounts of metrics events, logs, and traces to improve the signal noise ratio accelerating MTTR/D and therefore resolution, minimizing human triage time and increasing application uptime.
Bill Lobig
VP of Observability, IBM Automation

Event Correlation

In observability, AI correlates events across logs, metrics, and traces to highlight causality. 
Hugo Kaczmarek
Director of Product, APM Suite, Datadog

Root Cause Analysis

We're seeing AI assist with constructing queries, generating dashboards, interpreting raw telemetry signals, and pointing towards the likely direction of a problem's root cause.
Juraci Paixão Kröhling
Software Engineer, OllyGarden

Root cause suggestions: When an incident occurs, AI can correlate across different data sources and suggest probable causes based on historical patterns.
Rakesh Gupta
Head of Product Management, Observe

By using AI and ML to gain complete visibility and automated root cause analysis, observability solutions improve customer experiences, enhance employee productivity, and optimize digital infrastructure at profound levels.
Douglas James
VP, Solutions & Ecosystem, ScienceLogic

AI, often understood today as LLMs, can assist by enabling natural language querying and summarizing telemetry data for faster exploration. However, LLMs fall short when it comes to accurately identifying root causes, as they lack an understanding of system causality. This is where causal reasoning becomes essential. By modeling how components influence one another, causal analysis can pinpoint the actual source of incidents, not just symptoms. It provides precise, explainable insights that go beyond what LLMs can infer from surface-level patterns.
Severin Neumann
Head of Community & Developer Relations, Causely

Prioritizing Likely Problems

AI excels at text but is still evolving for data-rich environments. It should be used to guide and narrow down the search for issues rather than fully automating diagnoses or replacing human expertise. I view AI's role as being strongest when it helps prioritize likely problems, allowing humans to focus their efforts.
Jeff Cobb
Global Head of Product & Design, Chronosphere

Predicting Potential Problems

AI-powered observability processes vast volumes of telemetry data in real-time, automatically detecting anomalies, pinpointing root causes, and anticipating issues before they occur. It allows teams to shift from reactive troubleshooting to proactive, preventative operations — saving time, reducing alert fatigue, and improving reliability across complex environments. 
Andreas Grabner
Fellow DevRel and CNCF Ambassador, Dynatrace

In APM, AI is increasingly used to detect unusual application behavior, user drop-off patterns, or performance degradations before they impact SLAs. 
Gurjeet Arora
CEO and Co-Founder, Observo AI

AI's role is growing fast here. It's great for spotting patterns you might miss, flagging anomalies in real time and even predicting potential failures before they cause real issues. In APM, that means catching performance slowdowns early. Observability means making sense of a flood of data (logs, traces, metrics) and connecting the dots quickly.
Tanner Burson
Engineering Leader, Prismatic

In APM, AI helps baseline normal application behavior and detects anomalies in real time and accelerates root cause analysis by correlating signals across the application stack, predicting potential failures before they impact end users.
Nigel Hickey
Senior Technical Marketing Manager, NetBrain

Autoremediation

We're seeing a rise in AI-driven observability tools that not only recommend fixes but can proactively trigger automated remediation, helping teams resolve problems faster and build more resilient systems.
Arun Balachandran
Senior Product Marketing Manager, ManageEngine APM Solutions

Incident Documentation

GenAI can provide support for documentation of issues and, when included within the organizations documentation, provide better responses for future issues using retrieval augmented generation (RAG). The next step would be Agentic AI through which incidents could be automatically resolved and documented.
Harald Burose
Director, Product Management, Research & Development – Engineering, OpenText

Visibility into Business Impact

AI helps bridge the technical nature of telemetry and observability data to people outside engineering. AI allows users to get real-time answers in their context, tied to business impact. 
Ariel Assaraf
CEO, Coralogix

Observability-Driven Development

AI supports observability-driven development, providing automated feedback to catch performance issues early, shifting observability from reactive troubleshooting to proactive optimization.
Ajay Khanna
CMO, Yugabyte

Cost Reduction

For observability, AI can filter out low-value data to reduce storage and licensing costs'
Gurjeet Arora
CEO and Co-Founder, Observo AI

Conclusion: Telemetry Is Key

If you're working with sampled traces and aggregated metrics, AI can't provide the full picture. The real opportunity comes from having comprehensive, unified telemetry data that enables correlation across your entire technology stack.
Rakesh Gupta
Head of Product Management, Observe

AI can accelerate incident response by surfacing anomalies, correlating patterns, and even suggesting root causes. But for AI to be meaningful, it needs structured, and rich telemetry, not black-box outputs. This is where OpenTelemetry shines. By standardizing the way metrics, logs, and traces are collected and annotated, it provides high-quality input for AI systems to reason over.
Brian Douglas
Head of Ecosystem, Cloud Native Computing Foundation (CNCF)

Go to: APM and Observability - Cutting Through the Confusion - Part 11, presenting predictions about the future of APM and Observability.

Pete Goldin is Editor and Publisher of APMdigest