Skip to main content

How AIOps Solutions with the Right Foundation Can Help Reduce the IT Blame Game

Haroon Ahmed

In the news last year, a core network change at a large service provider resulted in a major outage that not only impacted end consumers and businesses, but also critical services like 911 and Interac. At home, phone and internet service were down for the workday as millions were without service.

Modern distributed and ephemeral systems have connected us better than ever before, and the latest ChatGPT phenomenon has opened the possibility for new and mind-blowing innovations. However, at the same time, our dependency on this connected world, along with its nonstop innovations, challenges our ethos with important questions and concerns around privacy, ethics, and security, and challenges our IT teams with outages of often unknown origin.

When it comes to system outages, AIOps solutions with the right foundation can help reduce the blame game so the right teams can spend valuable time restoring the impacted services rather than improving their MTTI score (mean time to innocence). In fact, much of today's innovation around ChatGPT-style algorithms can be used to significantly improve the triage process and user experience.

In the monitoring space, impact analysis for services spanning application to network or cloud to mainframe has known gaps that, if solved, can have a big impact on service availability. Today, these gaps require human intervention and result in never-ending bridge calls and the blame game where each siloed team responsible for applications, infrastructure, network, and mainframe are in a race to improve their MTTI score. This, unfortunately, also has a direct impact on customer experience and brand quality.

The challenge faced by teams is a layering issue, or "Layeritis." For each layer, different kinds of monitoring solutions are used. Each solution, in turn, has its own team and applies different techniques like code injection, polling, or network taps. This wide spectrum of monitoring techniques eventually generates key artifacts like metrics, alarms, events, and topology that are unique and useful in the given solution but operate in silos and do not provide an end-to-end impact flow.

Tool spam leads to a noise reduction challenge, which many AIOps tools solve today with algorithmic event noise reduction using proven clustering algorithms. However, in practice, this has not been proven to get to root cause. The hard problem is root cause isolation across the layers, which requires a connected topology (knowledge graph) that spans the multiple layers and can deterministically reconcile devices/configuration items (CIs) across the different layers.


Challenge of root cause isolation across the siloed application, infrastructure, network, and mainframe layers.

Seven Steps to Cure Layeritis

The solution to an AIOps Layeritis challenge requires planning and multiple iterations to get right. Once steps 1-3 are in a good state, steps 4-7 are left to AI/machine learning (ML) algorithms to decipher the signal from the noise and provide actionable insights. The seven steps are as follows:

1. Data ingestion from monitoring tools representing the different layers to a common data lake that includes metrics, events, topology, and logs

2. Automatic reconciliation across the different layers to establish end-to-end connectivity.

■ Since end user experience is tied to service health score, include key browser performance or voice quality metrics.

■ Application topology to underlying virtual and physical infrastructure for cloud, containers, and private data centers (e.g., APM tools may connect to the virtual host, but will not provide visibility to the underlying physical infrastructure used to run the virtual hosts).

■ Infrastructure connectivity to the underlying virtual and physical network devices like switches, routers, firewalls, and load balancers.

■ Virtual and physical infrastructure connectivity to the mainframe services like DB2, MQ, IMS, and CICS.

3. Dynamic service modeling to draw boundaries and build business services based on reconciled layers.

4. Clustering algorithm for noise reduction of events from metrics, logs, and alarms within a service boundary.

5. Page ranking and network centricity algorithms for root cause isolation using the connected topology and historical knowledge graph.

6. Large Language Model (LLM)/Generative AI (GPT) algorithm to build human-readable problem summaries. This helps less technical help desk resources quickly understand the issue.

7. Knowledge graph updated with the causal series of events (aka a fingerprint). Fingerprints are compared with historical occurrences to help make informed decisions on root cause, determine the next best action, or take proactive action on issues that could become major incidents.

For algorithms to give positive results with a high level of confidence, good data ingestion is required. Garbage data will always give bad results. For data, organizations rely on proven monitoring tools for the different layers to provide artifacts like topology, metrics, events, and logs. Additionally, with metrics and logs, it's possible to create meaningful events based on anomaly detection and advanced log processing.

Below are three use cases that focus on common issues today's IT teams face, which can be resolved using AIOps in a single consolidated view to identify the root cause and automate the next steps:

Use Case 1: Application issue where infrastructure and network are not impacted. Here, AIOps will only identify the impacted application software components.

Use Case 2: Network issue where infrastructure and application are impacted, but not at fault.

Use Case 3: Mainframe database issue where connected application running on distributed infrastructure is impacted.

In each use case above, AIOps removes the need for time-intensive investigation and guesswork so your team can see and respond to issues — even before they affect the business — and focus on higher-value projects. 

Overall, AIOps solutions can provide visibility and generate proactive insights across the entire application structure, from end user to cloud to data center to mainframe.

Hot Topics

The Latest

Enterprises are under pressure to scale AI quickly. Yet despite considerable investment, adoption continues to stall. One of the most overlooked reasons is vendor sprawl ... In reality, no organization deliberately sets out to create sprawling vendor ecosystems. More often, complexity accumulates over time through well-intentioned initiatives, such as enterprise-wide digital transformation efforts, point solutions, or decentralized sourcing strategies ...

Nearly every conversation about AI eventually circles back to compute. GPUs dominate the headlines while cloud platforms compete for workloads and model benchmarks drive investment decisions. But underneath that noise, a quieter infrastructure challenge is taking shape. The real bottleneck in enterprise AI is not processing power, it is the ability to store, manage and retrieve the relentless volumes of data that AI systems generate, consume and multiply ...

The 2026 Observability Survey from Grafana Labs paints a vivid picture of an industry maturing fast, where AI is welcomed with careful conditions, SaaS economics are reshaping spending decisions, complexity remains a defining challenge, and open standards continue to underpin it all ...

The observability industry has an evolving relationship with AI. We're not skeptics, but it's clear that trust in AI must be earned ... In Grafana Labs' annual Observability Survey, 92% said they see real value in AI surfacing anomalies before they cause downtime. Another 91% endorsed AI for forecasting and root cause analysis. So while the demand is there, customers need it to be trustworthy, as the survey also found that the practitioners most enthusiastic about AI are also the most insistent on explainability ...

In the modern enterprise, the conversation around AI has moved past skepticism toward a stage of active adoption. According to our 2026 State of IT Trends Report: The Human Side of Autonomous AI, nearly 90% of IT professionals view AI as a net positive, and this optimism is well-founded. We are seeing agentic AI move beyond simple automation to actively streamlining complex data insights and eliminating the manual toil that has long hindered innovation. However, as we integrate these autonomous agents into our ecosystems, the fundamental DNA of the IT role is evolving ...

AI workloads require an enormous amount of computing power ... What's also becoming abundantly clear is just how quickly AI's computing needs are leading to enterprise systems failure. According to Cockroach Labs' State of AI Infrastructure 2026 report, enterprise systems are much closer to failure than their organizations realize. The report ... suggests AI scale could cause widespread failures in as little as one year — making it a clear risk for business performance and reliability.

The quietest week your engineering team has ever had might also be its best. No alarms going off. No escalations. No frantic Teams or Slack threads at 2 a.m. Everything humming along exactly as it should. And somewhere in a leadership meeting, someone looks at the metrics dashboard, sees a flat line of incidents and says: "Seems like things are pretty calm over there. Do we really need all those people?" ... I've spent many years in engineering, and this pattern keeps repeating ...

The gap is widening between what teams spend on observability tools and the value they receive amid surging data volumes and budget pressures, according to The Breaking Point for Observability Leaders, a report from Imply ...

Seamless shopping is a basic demand of today's boundaryless consumer — one with little patience for friction, limited tolerance for disconnected experiences and minimal hesitation in switching brands. Customers expect intuitive, highly personalized experiences and the ability to move effortlessly across physical and digital channels within the same journey. Failure to deliver can cost dearly ...

If your best engineers spend their days sorting tickets and resetting access, you are wasting talent. New global data shows that employees in the IT sector rank among the least motivated across industries. They're under a lot of pressure from many angles. Pressure to upskill and uncertainty around what agentic AI means for job security is creating anxiety. Meanwhile, these roles often function like an on-call job and require many repetitive tasks ...

How AIOps Solutions with the Right Foundation Can Help Reduce the IT Blame Game

Haroon Ahmed

In the news last year, a core network change at a large service provider resulted in a major outage that not only impacted end consumers and businesses, but also critical services like 911 and Interac. At home, phone and internet service were down for the workday as millions were without service.

Modern distributed and ephemeral systems have connected us better than ever before, and the latest ChatGPT phenomenon has opened the possibility for new and mind-blowing innovations. However, at the same time, our dependency on this connected world, along with its nonstop innovations, challenges our ethos with important questions and concerns around privacy, ethics, and security, and challenges our IT teams with outages of often unknown origin.

When it comes to system outages, AIOps solutions with the right foundation can help reduce the blame game so the right teams can spend valuable time restoring the impacted services rather than improving their MTTI score (mean time to innocence). In fact, much of today's innovation around ChatGPT-style algorithms can be used to significantly improve the triage process and user experience.

In the monitoring space, impact analysis for services spanning application to network or cloud to mainframe has known gaps that, if solved, can have a big impact on service availability. Today, these gaps require human intervention and result in never-ending bridge calls and the blame game where each siloed team responsible for applications, infrastructure, network, and mainframe are in a race to improve their MTTI score. This, unfortunately, also has a direct impact on customer experience and brand quality.

The challenge faced by teams is a layering issue, or "Layeritis." For each layer, different kinds of monitoring solutions are used. Each solution, in turn, has its own team and applies different techniques like code injection, polling, or network taps. This wide spectrum of monitoring techniques eventually generates key artifacts like metrics, alarms, events, and topology that are unique and useful in the given solution but operate in silos and do not provide an end-to-end impact flow.

Tool spam leads to a noise reduction challenge, which many AIOps tools solve today with algorithmic event noise reduction using proven clustering algorithms. However, in practice, this has not been proven to get to root cause. The hard problem is root cause isolation across the layers, which requires a connected topology (knowledge graph) that spans the multiple layers and can deterministically reconcile devices/configuration items (CIs) across the different layers.


Challenge of root cause isolation across the siloed application, infrastructure, network, and mainframe layers.

Seven Steps to Cure Layeritis

The solution to an AIOps Layeritis challenge requires planning and multiple iterations to get right. Once steps 1-3 are in a good state, steps 4-7 are left to AI/machine learning (ML) algorithms to decipher the signal from the noise and provide actionable insights. The seven steps are as follows:

1. Data ingestion from monitoring tools representing the different layers to a common data lake that includes metrics, events, topology, and logs

2. Automatic reconciliation across the different layers to establish end-to-end connectivity.

■ Since end user experience is tied to service health score, include key browser performance or voice quality metrics.

■ Application topology to underlying virtual and physical infrastructure for cloud, containers, and private data centers (e.g., APM tools may connect to the virtual host, but will not provide visibility to the underlying physical infrastructure used to run the virtual hosts).

■ Infrastructure connectivity to the underlying virtual and physical network devices like switches, routers, firewalls, and load balancers.

■ Virtual and physical infrastructure connectivity to the mainframe services like DB2, MQ, IMS, and CICS.

3. Dynamic service modeling to draw boundaries and build business services based on reconciled layers.

4. Clustering algorithm for noise reduction of events from metrics, logs, and alarms within a service boundary.

5. Page ranking and network centricity algorithms for root cause isolation using the connected topology and historical knowledge graph.

6. Large Language Model (LLM)/Generative AI (GPT) algorithm to build human-readable problem summaries. This helps less technical help desk resources quickly understand the issue.

7. Knowledge graph updated with the causal series of events (aka a fingerprint). Fingerprints are compared with historical occurrences to help make informed decisions on root cause, determine the next best action, or take proactive action on issues that could become major incidents.

For algorithms to give positive results with a high level of confidence, good data ingestion is required. Garbage data will always give bad results. For data, organizations rely on proven monitoring tools for the different layers to provide artifacts like topology, metrics, events, and logs. Additionally, with metrics and logs, it's possible to create meaningful events based on anomaly detection and advanced log processing.

Below are three use cases that focus on common issues today's IT teams face, which can be resolved using AIOps in a single consolidated view to identify the root cause and automate the next steps:

Use Case 1: Application issue where infrastructure and network are not impacted. Here, AIOps will only identify the impacted application software components.

Use Case 2: Network issue where infrastructure and application are impacted, but not at fault.

Use Case 3: Mainframe database issue where connected application running on distributed infrastructure is impacted.

In each use case above, AIOps removes the need for time-intensive investigation and guesswork so your team can see and respond to issues — even before they affect the business — and focus on higher-value projects. 

Overall, AIOps solutions can provide visibility and generate proactive insights across the entire application structure, from end user to cloud to data center to mainframe.

Hot Topics

The Latest

Enterprises are under pressure to scale AI quickly. Yet despite considerable investment, adoption continues to stall. One of the most overlooked reasons is vendor sprawl ... In reality, no organization deliberately sets out to create sprawling vendor ecosystems. More often, complexity accumulates over time through well-intentioned initiatives, such as enterprise-wide digital transformation efforts, point solutions, or decentralized sourcing strategies ...

Nearly every conversation about AI eventually circles back to compute. GPUs dominate the headlines while cloud platforms compete for workloads and model benchmarks drive investment decisions. But underneath that noise, a quieter infrastructure challenge is taking shape. The real bottleneck in enterprise AI is not processing power, it is the ability to store, manage and retrieve the relentless volumes of data that AI systems generate, consume and multiply ...

The 2026 Observability Survey from Grafana Labs paints a vivid picture of an industry maturing fast, where AI is welcomed with careful conditions, SaaS economics are reshaping spending decisions, complexity remains a defining challenge, and open standards continue to underpin it all ...

The observability industry has an evolving relationship with AI. We're not skeptics, but it's clear that trust in AI must be earned ... In Grafana Labs' annual Observability Survey, 92% said they see real value in AI surfacing anomalies before they cause downtime. Another 91% endorsed AI for forecasting and root cause analysis. So while the demand is there, customers need it to be trustworthy, as the survey also found that the practitioners most enthusiastic about AI are also the most insistent on explainability ...

In the modern enterprise, the conversation around AI has moved past skepticism toward a stage of active adoption. According to our 2026 State of IT Trends Report: The Human Side of Autonomous AI, nearly 90% of IT professionals view AI as a net positive, and this optimism is well-founded. We are seeing agentic AI move beyond simple automation to actively streamlining complex data insights and eliminating the manual toil that has long hindered innovation. However, as we integrate these autonomous agents into our ecosystems, the fundamental DNA of the IT role is evolving ...

AI workloads require an enormous amount of computing power ... What's also becoming abundantly clear is just how quickly AI's computing needs are leading to enterprise systems failure. According to Cockroach Labs' State of AI Infrastructure 2026 report, enterprise systems are much closer to failure than their organizations realize. The report ... suggests AI scale could cause widespread failures in as little as one year — making it a clear risk for business performance and reliability.

The quietest week your engineering team has ever had might also be its best. No alarms going off. No escalations. No frantic Teams or Slack threads at 2 a.m. Everything humming along exactly as it should. And somewhere in a leadership meeting, someone looks at the metrics dashboard, sees a flat line of incidents and says: "Seems like things are pretty calm over there. Do we really need all those people?" ... I've spent many years in engineering, and this pattern keeps repeating ...

The gap is widening between what teams spend on observability tools and the value they receive amid surging data volumes and budget pressures, according to The Breaking Point for Observability Leaders, a report from Imply ...

Seamless shopping is a basic demand of today's boundaryless consumer — one with little patience for friction, limited tolerance for disconnected experiences and minimal hesitation in switching brands. Customers expect intuitive, highly personalized experiences and the ability to move effortlessly across physical and digital channels within the same journey. Failure to deliver can cost dearly ...

If your best engineers spend their days sorting tickets and resetting access, you are wasting talent. New global data shows that employees in the IT sector rank among the least motivated across industries. They're under a lot of pressure from many angles. Pressure to upskill and uncertainty around what agentic AI means for job security is creating anxiety. Meanwhile, these roles often function like an on-call job and require many repetitive tasks ...