IT organizations seeking an AIOps investment should look for the solution fitting their particular needs. It is not a linear choice, warns Dennis Drogseth, VP at Enterprise Management Associates (EMA).
"In our upcoming AIOps Radar, we will be tracking more than twenty analytics options ranging from case-based reasoning and predictive algorithms to natural language search and generative AI," Drogseth explains. "We will also be looking at visualization and data sharing, as well as levels of integrated automation, as well as roles supported. Based on past radars, this varies a great deal across vendors."
"Before investing, you should define your needs and priorities both within and across silos," Drogseth adds.
Start with: Discovering AIOps - Part 1
In Part 2 of this blog series, the experts provide a range of "must-have" capabilities to consider when selecting an AIOps solution, including:
Connectivity with Multiple Data Sources
A modern approach to AIOps should address the volume, velocity, and variety of data in complex multi-cloud environments with advanced AI techniques that provide precise answers and intelligent automation, says Bob Wambach, VP of Product Marketing at Dynatrace.
"True AIOps should have versatility in data collection from multiple sources," Drogseth advises. "Versatility in data collection is a key factor in AIOps as it evolves, looking at technical and business data in many cases, as well as a wide variety of applications, cloud-based and more traditional infrastructure, and interdependencies across the entire business/application fabric."
Carlos Casanova, Principal Analyst at Forrester Research, elaborates, "The need for a broad array of connectivity options both natively and via third-party is the foundation. OpenTelemetry is widely adopted by vendors which helps eliminate many proprietary connective technologies for the enterprise to manage."
Bharani Kumar Kulasekaran, Product Manager at ManageEngine, agrees, "First and foremost, an AIOps platform must be able to seamlessly integrate with a wide range of data sources, including monitoring tools, logs, and other IT systems."
Payal Kindiger, Senior Director of Product Marketing at Riverbed adds that AIOps should be able to collect information such as telemetry data and ticketing data from sources including networks, applications, hypervisors, containers, and clouds.
Thomas LaRock, Principal Developer Evangelist at Selector also notes that AIOps should be able to ingest a variety of datasets without the need for additional administrative overhead and configuration.
Built-in monitoring/native instrumentation ranked as the most important feature of an AIOps solution, cited by nearly 55% of respondents in a study from OpsRamp, The State of AIOps 2023.
Rich Set of Analytics Options
"AIOps must draw creatively from an increasingly richer set of advanced analytic options — EMA has identified more than 20 — from more traditional rule-based analytics, to predictive and prescriptive analytics, to case-based reasoning and fuzzy logic, to recent influxes of generative AI and even ChatGPT," says Dennis Drogseth from EMA.
Big Data Processing
"Ingesting and processing massive amounts of data in real-time before identifying patterns/trends in the context of business operations is crucial for determining what actions should be taken," says Carlos Casanova from Forrester.
Gagan Singh, VP of Product Marketing, Observability, at Elastic, adds, "For AIOps to be effective it needs to access comprehensive and accurate data, otherwise, there will be blind spots preventing accurate correlation, pattern identification, predicting potential issues, and root cause analysis. Hence, it's important to have a system capable of ingesting and processing large amounts of data in real time."
"AI doesn't function without large datasets, so data privacy has to be part of any AI conversation. Personally identifiable information used in conjunction with AIOps tools needs to be protected and obscured, whether that's by limiting the PII you collect and store or implementing security controls like two-factor authentication. At the end of the day, responsible AI usage is something that only happens when you pay close attention to how AI is used at all levels of our organization. It has to be an executive-level priority," counsels Brian Emerson, VP & GM, IT Operations Management at ServiceNow.
"Correlation is very much the cornerstone of AIOPS. It seeks to automate the linking of related alerts so that they can be remediated as a whole — reducing the time it takes to understand and then begin working an IT issue," adds Charles Burnham, Director, AIOps Engineering at LogicMonitor.
IT Environment Mapping
AIOps should create a complete view of how everything is connected, like drawing a map of the network, applications, and other parts, says Kindiger from Riverbed.
"AIOps tools can automatically identify abnormal patterns and deviations in system behavior. When an anomaly is detected, IT teams receive alerts and notifications, enabling them to investigate and address potential issues before they escalate," says Scott Likens, Global AI and Innovation Technology Leader at PwC.
"Find a solution that finds anomalies that you're not looking for and issues that haven't occurred in the past. The best AIOps solutions will find things you didn't even know to look for," adds Emerson from ServiceNow.
"Enterprises today should seek a modern AIOps platform that combines observability with causal AI capabilities to gather precise, continuous, and actionable insights in real-time. Causality is an imperative when evaluating AIOps providers because only causal AIOps technology enables fully automated cloud operations across the entire enterprise development lifecycle," explains Wambach from Dynatrace.
It is important to look for an AIOps platform that provides sophisticated machine learning algorithms capable of identifying complex patterns and anomalies in data. A good AIOps platform should be able to use this data to forecast future issues and provide insights on how to prevent them, according to Kulasekaran from ManageEngine.
"AIOps can predict and forecast potential incidents or performance degradation based on historical data and patterns. This enables IT teams to take proactive measures to prevent problems and optimize resource allocation," Scott Likens from PwC adds.
Root Cause Analysis
AIOps can help identify the underlying causes of incidents by analyzing data across various sources. This allows IT teams to quickly pinpoint the origin of problems and implement effective solutions, says Likens from PwC.
"Root cause analysis is a crucial feature, because it provides IT admins with better context into issues that occurred previously," says Kulasekaran from ManageEngine.
AIOps platforms can provide recommendations for optimizing IT infrastructure and services. These insights help IT teams make informed decisions to improve efficiency and resource utilization, says Scott Likens from PwC.
"The presentation of actionable insights by an AIOps tool varies," Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at Enterprise Management Associates (EMA), points out. "My favorite example is a tool that explains what it has discovered and offers recommendations on how to resolve an issue. For example: Based on analysis, end users in X location are failing to connect to your Wi-Fi network because a local DHCP server is unavailable. We recommend rebooting the DHCP server."
"AIOps can prioritize incidents based on their potential impact and urgency. This helps IT teams allocate resources effectively and focus on the most critical issues," says Likens from PwC.
Burnham from LogicMonitor adds that AIOps starts with better quality alerting based on intelligent thresholds that remove false positives. It then acts as a reducer — connecting related alerts into a single narrative, as well as a prioritizer — helping users to determine which alerts are the most important for their organization.
Shamus McGillicuddy from EMA recommends selecting an AIOps solution with intelligent escalation: guidance on which specialist you should forward a ticket to.
"Self-service is critical to getting AIOps sticky. If we look at all the other parts of what we are doing with CI/CD, infrastructure as code, monitoring definitions as code and moving at speed, treating AIOps like ITSM where only a few of the high priests have access to it or you have to submit a request for someone to add how you want events enriched, what orchestration should be executed, what diagnostics or remediation should be run, etc., it breaks the model. These should just be Terraformed as part of your CI/CD pipeline," says Heath Newburn, Distinguished Field Engineer at PagerDuty.
"The system should allow for terraform configurations so that when the application is deployed, how that application can be managed from an AIOps perspective is treated the same way as the underlying infrastructure as code."
"AIOps requires significant levels of advanced automation for faster and more effective deployment and data collection on the one hand, while also promoting more constructive data sharing and communication, as well as enabling supervised and increasingly unsupervised actions," says Dennis Drogseth from EMA.
Carlos Casanova from Forrester adds, "Intertwined through all the areas is an underpinning of data-driven, dynamic threshold based automation that can help expedite one simple step or autonomously remediate an issue from detection to resolution before users or operators even knew anything was happening."
Read the Forrester Report, AIOps Reference Architecture: Defined, for more information.
Monika Bhave, Product Manager at Digitate says, "A feature that stands out is the ability to prescribe fixes and resolve problems automatically. Too many AIOps solutions just focus on delivering visibility and insights — they tell IT when and where an issue happens. AIOps should take that a step further and provide actual repairs. Except for some of the more significant events that require human intervention, AIOps technology should automatically make these repairs on its own."
"MTTR is not mean time to root cause–the problem still has to be fixed," adds Newburn from PagerDuty. "The ideal solution should be focused on driving automated responses. Of course the ideal is automated remediation, but automated diagnostics can deliver a better understanding of the problem for both auto-remediation as well as human remediation."
APMdigest will dive deeper into auto-remediation later in this blog series.
AIOps should not require additional training or deep understanding of underlying models, according to Newburn from PagerDuty.
"Users should be able to understand why a model makes a particular choice, recommendation or action. Black box AI/ML approaches make user acceptance a challenge," notes Kindiger from Riverbed.
Camden Swita, Senior Product Manager at New Relic says AIOps should present a user experience (UX) that allows you to understand the "why" behind a deviation. If the algorithm detected a deviation, are you shown possible or probable causes? Can you see deployments? Outages? Other incidents?
Users need to be able to understand how and why an AIOps algorithm reaches an output or decision so that it can't, for example, insert bias or perpetuate bad practices based on faulty data, warns Emerson from ServiceNow.
"If you simply trust the algorithm is right, you open yourself up to both errors and distrust," Emerson continues. "It's important to remember that bad, biased data leads to bad, biased decisions. Explainability is paramount."
"That's why AIOps technology must employ explainable AI," Bhave from Digitate concludes. "Explainability means that the tech can clearly communicate to business users why it made a specific decision. Explainability ultimately builds trust in users by making sure they understand why it's behaving the way it does."
"Anyone who consumes the advanced algorithms that come with AIOps needs to be aware that they aren't always going to be correct. These tools need to learn and then be refined, and the more context and tuning they have, the more intelligent they'll be," Burnham from LogicMonitor asserts. "The ability to tune these systems is vital. AIOps automates a company's IT operations processes, so it is crucial that users can influence how the system makes its decisions in order to achieve the desired level of control."
"AIOps solutions must always provide hooks for users to intervene. Users should be able to see the insights generated by the machines and be able to communicate to the machine if they agree or disagree with the insights," adds Bhave from Digitate.
According to Swita from New Relic, there are ways to improve an AIOps system's success rate. If you can layer in additional context, such as a map of dependencies in a software system/topology, for example, you can increase the probability that one thing caused another by helping the models account for relationships between the parts.
You can further improve the accuracy of an AIOps implementation by constantly adjusting or training the ML models to fine-tune their understanding of what's "normal" based on more recent data. All of which will eventually pay off as you avoid false positives and keep alert proliferation down.
Kulasekaran from ManageEngine says it is vital to look for a flexible and scalable AIOps platform that allows customization to match specific business needs and IT environments. This ensures the AIOps solution can handle the scale of an organization's infrastructure and data.
AIOps should also offer customizable alert conditions that can tell you what happens when a deviation is detected, in case you want to do something that the vendor's "out of the box" detection features don't support, adds Swita from New Relic.
Support for Multiple Roles
"All technology has a human dimension, and AIOps is no exception, and so we look for support across more traditional siloed roles, to a growing cross-siloed approach to collaboration, to support for executive IT leadership in key decision making, to reaching out to business stakeholders to better understand how IT services are shaping business outcomes," says Dennis Drogseth from EMA.
Fast Time to Value
"Time to value is a key area for AIOps growth and brand distinction, with a growing number of vendors offering both SaaS and on-premise options, while seeking to streamline observability pipelines and time to analytic relevance, and also providing more out-of-the-box support for use-case priorities and role-based awareness," Drogseth reveals.
Go to: Discovering AIOps - Part 3, discussing the users of AIOps.