How AIOps Solutions with the Right Foundation Can Help Reduce the IT Blame Game
May 17, 2023

Haroon Ahmed
BMC Software

Share this

In the news last year, a core network change at a large service provider resulted in a major outage that not only impacted end consumers and businesses, but also critical services like 911 and Interac. At home, phone and internet service were down for the workday as millions were without service.

Modern distributed and ephemeral systems have connected us better than ever before, and the latest ChatGPT phenomenon has opened the possibility for new and mind-blowing innovations. However, at the same time, our dependency on this connected world, along with its nonstop innovations, challenges our ethos with important questions and concerns around privacy, ethics, and security, and challenges our IT teams with outages of often unknown origin.

When it comes to system outages, AIOps solutions with the right foundation can help reduce the blame game so the right teams can spend valuable time restoring the impacted services rather than improving their MTTI score (mean time to innocence). In fact, much of today's innovation around ChatGPT-style algorithms can be used to significantly improve the triage process and user experience.

In the monitoring space, impact analysis for services spanning application to network or cloud to mainframe has known gaps that, if solved, can have a big impact on service availability. Today, these gaps require human intervention and result in never-ending bridge calls and the blame game where each siloed team responsible for applications, infrastructure, network, and mainframe are in a race to improve their MTTI score. This, unfortunately, also has a direct impact on customer experience and brand quality.

The challenge faced by teams is a layering issue, or "Layeritis." For each layer, different kinds of monitoring solutions are used. Each solution, in turn, has its own team and applies different techniques like code injection, polling, or network taps. This wide spectrum of monitoring techniques eventually generates key artifacts like metrics, alarms, events, and topology that are unique and useful in the given solution but operate in silos and do not provide an end-to-end impact flow.

Tool spam leads to a noise reduction challenge, which many AIOps tools solve today with algorithmic event noise reduction using proven clustering algorithms. However, in practice, this has not been proven to get to root cause. The hard problem is root cause isolation across the layers, which requires a connected topology (knowledge graph) that spans the multiple layers and can deterministically reconcile devices/configuration items (CIs) across the different layers.

Challenge of root cause isolation across the siloed application, infrastructure, network, and mainframe layers.

Seven Steps to Cure Layeritis

The solution to an AIOps Layeritis challenge requires planning and multiple iterations to get right. Once steps 1-3 are in a good state, steps 4-7 are left to AI/machine learning (ML) algorithms to decipher the signal from the noise and provide actionable insights. The seven steps are as follows:

1. Data ingestion from monitoring tools representing the different layers to a common data lake that includes metrics, events, topology, and logs

2. Automatic reconciliation across the different layers to establish end-to-end connectivity.

■ Since end user experience is tied to service health score, include key browser performance or voice quality metrics.

■ Application topology to underlying virtual and physical infrastructure for cloud, containers, and private data centers (e.g., APM tools may connect to the virtual host, but will not provide visibility to the underlying physical infrastructure used to run the virtual hosts).

■ Infrastructure connectivity to the underlying virtual and physical network devices like switches, routers, firewalls, and load balancers.

■ Virtual and physical infrastructure connectivity to the mainframe services like DB2, MQ, IMS, and CICS.

3. Dynamic service modeling to draw boundaries and build business services based on reconciled layers.

4. Clustering algorithm for noise reduction of events from metrics, logs, and alarms within a service boundary.

5. Page ranking and network centricity algorithms for root cause isolation using the connected topology and historical knowledge graph.

6. Large Language Model (LLM)/Generative AI (GPT) algorithm to build human-readable problem summaries. This helps less technical help desk resources quickly understand the issue.

7. Knowledge graph updated with the causal series of events (aka a fingerprint). Fingerprints are compared with historical occurrences to help make informed decisions on root cause, determine the next best action, or take proactive action on issues that could become major incidents.

For algorithms to give positive results with a high level of confidence, good data ingestion is required. Garbage data will always give bad results. For data, organizations rely on proven monitoring tools for the different layers to provide artifacts like topology, metrics, events, and logs. Additionally, with metrics and logs, it's possible to create meaningful events based on anomaly detection and advanced log processing.

Below are three use cases that focus on common issues today's IT teams face, which can be resolved using AIOps in a single consolidated view to identify the root cause and automate the next steps:

Use Case 1: Application issue where infrastructure and network are not impacted. Here, AIOps will only identify the impacted application software components.

Use Case 2: Network issue where infrastructure and application are impacted, but not at fault.

Use Case 3: Mainframe database issue where connected application running on distributed infrastructure is impacted.

In each use case above, AIOps removes the need for time-intensive investigation and guesswork so your team can see and respond to issues — even before they affect the business — and focus on higher-value projects. 

Overall, AIOps solutions can provide visibility and generate proactive insights across the entire application structure, from end user to cloud to data center to mainframe.

Haroon is VP of Research and Development at BMC Software
Share this

The Latest

February 29, 2024

Despite the growth in popularity of artificial intelligence (AI) and ML across a number of industries, there is still a huge amount of unrealized potential, with many businesses playing catch-up and still planning how ML solutions can best facilitate processes. Further progression could be limited without investment in specialized technical teams to drive development and integration ...

February 28, 2024

With over 200 streaming services to choose from, including multiple platforms featuring similar types of entertainment, users have little incentive to remain loyal to any given platform if it exhibits performance issues. Big names in streaming like Hulu, Amazon Prime and HBO Max invest thousands of hours into engineering observability and closed-loop monitoring to combat infrastructure and application issues, but smaller platforms struggle to remain competitive without access to the same resources ...

February 27, 2024

Generative AI has recently experienced unprecedented dramatic growth, making it one of the most exciting transformations the tech industry has seen in some time. However, this growth also poses a challenge for tech leaders who will be expected to deliver on the promise of new technology. In 2024, delivering tangible outcomes that meet the potential of AI, and setting up incubator projects for the future will be key tasks ...

February 26, 2024

SAP is a tool for automating business processes. Managing SAP solutions, especially with the shift to the cloud-based S/4HANA platform, can be intricate. To explore the concerns of SAP users during operational transformations and automation, a survey was conducted in mid-2023 by Digitate and Americas' SAP Users' Group ...

February 22, 2024

Some companies are just starting to dip their toes into developing AI capabilities, while (few) others can claim they have built a truly AI-first product. Regardless of where a company is on the AI journey, leaders must understand what it means to build every aspect of their product with AI in mind ...

February 21, 2024

Generative AI will usher in advantages within various industries. However, the technology is still nascent, and according to the recent Dynatrace survey there are many challenges and risks that organizations need to overcome to use this technology effectively ...

February 20, 2024

In today's digital era, monitoring and observability are indispensable in software and application development. Their efficacy lies in empowering developers to swiftly identify and address issues, enhance performance, and deliver flawless user experiences. Achieving these objectives requires meticulous planning, strategic implementation, and consistent ongoing maintenance. In this blog, we're sharing our five best practices to fortify your approach to application performance monitoring (APM) and observability ...

February 16, 2024

In MEAN TIME TO INSIGHT Episode 3, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at Enterprise Management Associates (EMA) discusses network security with Chris Steffen, VP of Research Covering Information Security, Risk, and Compliance Management at EMA ...

February 15, 2024

In a time where we're constantly bombarded with new buzzwords and technological advancements, it can be challenging for businesses to determine what is real, what is useful, and what they truly need. Over the years, we've witnessed the rise and fall of various tech trends, such as the promises (and fears) of AI becoming sentient and replacing humans to the declaration that data is the new oil. At the end of the day, one fundamental question remains: How can companies navigate through the tech buzz and make informed decisions for their future? ...

February 14, 2024

We increasingly see companies using their observability data to support security use cases. It's not entirely surprising given the challenges that organizations have with legacy SIEMs. We wanted to dig into this evolving intersection of security and observability, so we surveyed 500 security professionals — 40% of whom were either CISOs or CSOs — for our inaugural State of Security Observability report ...