Discovering AIOps - Part 6: Challenges
October 18, 2023

Pete Goldin
APMdigest

Share this

Although AIOps offers many advantages, outlined in Part 4 and Part 5 of this blog series, the experts say it also poses challenges for IT teams:

Start with: Discovering AIOps - Part 1

Start with: Discovering AIOps - Part 2: Must-Have Capabilities

Start with: Discovering AIOps - Part 3: The Users

Start with: Discovering AIOps - Part 4: Advantages

Start with: Discovering AIOps - Part 5: More Advantages

Understanding AIOps

"According to our recent survey, 66% of organizations still struggle with understanding the concept of AIOps and its value in modern IT environments," says Bharani Kumar Kulasekaran, Product Manager at ManageEngine.

"The biggest challenge companies need to face head-on is education, and knowing what they're dealing with versus what they aren't," says Charles Burnham, Director, AIOps Engineering at LogicMonitor.

Skills Gap

"AIOps might require specialized skills, such as machine learning and data analysis, which may not be readily available in the market," says Gagan Singh, VP of Product Marketing, Observability, at Elastic.

"On the skills side, IT organizations don't feel equipped to evaluate this stuff. They also feel like they don't know how to use it properly. AIOPs transforms workflows, and some people who are used to highly manual tasks struggled to adjust," adds Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at Enterprise Management Associates (EMA).

"AIOps involves the integration of machine learning algorithms and big data, which can be tricky to understand and implement without a certain level of technical expertise. To mitigate this, it is important for organizations to start with phased implementations, prioritize critical deployment areas, and ensure strong vendor support," advises Bharani Kumar Kulasekaran, Product Manager at ManageEngine.

"Adopting a tool that streamlines signal ingest and model training without requiring a dedicated data science team, accelerates the value of AIOps to the organization," adds Singh, VP of Product Marketing, Observability, at Elastic.

Cultural Shift

"There's a cultural shift that needs to happen when going in this direction so, like all other culture changes, it's a challenge," warns Carlos Casanova, Principal Analyst at Forrester Research.

Integration

"One of the challenges facing buyers seeking to move to AIOps is how and where to integrate their existing legacy monitoring and other tools — as some platforms prioritize displacement, while others prioritize integration, and some seek a more balanced approach to both," counsels Dennis Drogseth, VP at EMA.

Singh from Elastic also points out that disparate data sources can create integration challenges, making obtaining a unified view of operations difficult and reducing the effectiveness of AIOps. Implementing an integrated platform that can connect multiple data sources and provide a unified view can help mitigate that problem.

Security

"Security concerns remain a challenge in AIOps priorities, just as they are currently across all of high technology — both in terms of ensuring that the AIOps solution is itself secure, and in terms of leveraging advanced technology for more predictive, prescriptive, and context-aware insights and actions to promote improved security across the application/infrastructure," says Dennis Drogseth from EMA.

His EMA colleague Shamus McGillicuddy adds, "People are worried that AI poses a security risk because many solutions are cloud-based. They vacuum so much data into the cloud. They are also worried that AI mistakes could open up security vulnerabilities."

AIOps involves the use of sensitive data, which can create security concerns if not managed properly, says Singh from Elastic, who recommends choosing an AIOps solution that can provide the ability to obfuscate any sensitive data.

Big Data

"Another challenge organizations face is with processing the volume and variety of data required for an AIOps platform," says Kulasekaran from ManageEngine. "One way to overcome this challenge is to establish robust data management practices, break down data silos, and choose AIOps platforms with strong data integration capabilities."

"It all goes back to ease of use and cost effectiveness," Asaf Yigal, CTO of Logz.io explains. "We feel that using AI to help organizations actively reduce the level of data being sent into the platform — based on what is known to be important or more than likely not to be important — is a great example of using a practical AIOps approach. So rather than saying, here's a mountain of data, go find the needle in the haystack, taking the approach of filtering the data before it gets into the platform. This ensures only the most important insights are elevated, and you're only paying for the data you need. We think that's just a far more pragmatic approach."

Data Quality

Poor data quality can lead to inaccurate predictions and unreliable insights, making it difficult to make informed decisions, says Singh from Elastic. A possible solution is to invest in data standardization and validation to ensure high-quality data feeds your AIOps capabilities.

"AIOps usually sits on top of disparate monitoring tools, and these tools vary in capability and the data quality they provide, so the biggest challenge we see is a difficulty in making AIOps solutions effective because of the lack of high quality data" explains Spiros Xanthos, SVP and General Manager of Observability at Splunk. "Users who adopt fully integrated observability solutions usually have a much easier time implementing AIOps on top, as the quality of data is much much higher."

Michael Gerstenhaber, VP of Product Management at Datadog, adds, "One of the biggest challenges in trying to adopt AIOps is the lack of observability. AIOps hinges on continuously evaluating a large volume of high-quality telemetry from every part of the stack — not just alerting events. Only by having this data can an AIOps practice derive insights about distributed systems, root causes of degradation, or forecast future performance."

Building Anomaly Detection

"One challenge is that in building anomaly detection, much of the current AIOps tooling uses unsupervised machine learning on data to create baselines, clustering, and other aggregations. The frequency of anomalies creates toil for engineering teams, so anomaly detection doesn't end up working very well. So, moving from this pure contextual-based intelligence to something that is actively informed by the experiences of human users is a big deal, and finding the tools that can do that for you effectively is still a relatively new consideration," says Yigal from Logz.io.

Training Models

Vendors spend a lot of time training their AI solutions on general IT concepts before they roll it out to customers. Then it needs some time to learn the individual environment, according to Shamus McGillicuddy from EMA.

"Unfortunately, more often than not, companies think that tools like this will work straight out of the box, but it's important that they are being fine tuned and given feedback before they're able to capture the right information," says Burnham from LogicMonitor.

"At least today, the work still needs to be done by humans," notes Camden Swita, Senior Product Manager at New Relic. "Most of the cost of training your own model will almost certainly be in the preparation/pre-processing of data that isn't already in a well-organized database and in providing an in-training or newly-trained model with input to adjust its weights (RLHF, etc.). Both of these tasks are still typically done by humans and take much time to do right."

"That said, I think we will see the costs associated with training, or at least tuning, models on your own data start to go down," Swita continues. "And soon it may even be possible to use ML to execute more of the pre-training and post-training work."

So how long does it take to train models for AIOps?

Shamus McGillicuddy from EMA says, "Based on my conversations, this training time on an individual network can take anywhere from a week to a couple months, depending on the vendor and the size of the enterprise."

"In past research we have seen time-to-value range from a few days to many months for training analytic models to fit specific IT and business environments. However, in recent data the most prevalent answer was one-to-two weeks for addressing an environment with 5,000 managed entities," adds Dennis Drogseth from EMA.

"This can vary considerably for the same reasons as any other model training exercise. It's ultimately about the maturity of the organization with regard to their data quality and process effectiveness. The concept of garbage-in, garbage-out still holds true," Carlos Casanova from Forrester points out.

"The other aspect is whether the organization is willing to invest enough of the right resources to help train it," Casanova continues. "The straight-up technical data from the discovery tools should be far more reliable than manually entered and updated data, so the discovery data should enable an organization to get started fairly quickly with some limited insights that can't be 'discovered' automatically."

Yigal from Logz.io concludes, "The more time the model has to train at the hands of the user, get smarter context, and build an understanding of the environment, the greater the value and precision. The balance is trying to provide real value from the start as the system is being informed by the real experts."

Trust

"Many IT pros don't trust this stuff. It's understandable," says Shamus McGillicuddy from EMA.

"Similar to AI, trust in AIOps can be a concern," Kulasekaran from ManageEngine acknowledges. "IT professionals might be hesitant to rely fully on AI-driven insights for critical decision-making."

Kulasekaran adds that building trust involves transparently explaining how AIOps algorithms work, validating their accuracy, and showcasing successful use cases. Organizations will trust AIOps more once they start seeing actual results after implementation.

It also helps to start small and take a strategic approach to adopting AIOps into IT operations. To help build trust over time, organizations should avoid siloed data management processes and push for data-driven results.

Yigal from Logz.io adds, "It's important to be wary and ask vendors questions about what data they trained on, how they built their models, the role of the human in the process (AI as advisor), and how they'll support customers in real-life scenarios."

APMdigest delves a little deeper into the AI trust issue in Part 7 of this blog series.

Meeting Expectations

"The biggest challenge is having the chosen AIOps platform meet the myriad of expectations across the business units," says Thomas LaRock, Principal Developer Evangelist at Selector. "Many AIOps platforms offer little more than some rules-based recommendations, baselining, and are focused on only one part of the entire application stack. This leads to confusion and frustration after the platform is deployed, as different teams start wondering why they do not see the benefit of AIOps. To combat this, you should always do a thorough POC before making any purchase decision. Every department should be allowed to provide input and ask questions about the platform, how it works, and how it will provide direct value to their team."

Brian Emerson, VP & GM, IT Operations Management at ServiceNow says, "In the latest research from TechTarget's Enterprise Strategy Group, they found that 55% of organizations with observability practices use AIOps. The same study also revealed that just 40% of AIOps tool users report that AIOps has simplified operations to the point where they have freed up resources and expanded opportunities."

"The biggest challenge for organizations is expecting too much. Organizations are still seeking silver bullets in generalized predictive models and it's a fool's errand," Heath Newburn, Distinguished Field Engineer at PagerDuty admonishes.

Gaining Actionable Insights

"Actionable Insights are the crux of the problem for AIOps," explains Newburn from PagerDuty. "Even if the solution could deliver 100% certainty of the root cause, in many cases it is a random error message. MSSQL Error 2006. Now what? What does this mean? Is it customer impacting? Is an SLA about to be breached? Where is the runbook? What diagnostics are needed?"

Phillip Carter, Principal Product Manager at Honeycomb, responds, "AIOps arguably doesn't provide actionable insights. Sure, there are examples of teams reducing false positives and using anomaly detection to identify something worth investigating. Still, teams have been able to reduce false positives and identify uniquely interesting patterns in data long before AIOps, and typically do this today without AIOps features."

"AIOps can help get to the underlying cause, and many solutions say they deliver actionable intelligence, but where is the action?" Newburn from PagerDuty continues. "So the potential certainly exists, especially in domain-agnostic solutions that can integrate across multiple tools and sources, but it requires a comprehensive approach. Creating this broader picture that would allow non-experts to rapidly gain situational awareness to address the issue, and/or have it auto-remediated so no intervention is required is the true promise of AIOps."

Cost

"AIOps adoption itself is the biggest challenge for teams," says Carter from Honeycomb. "Organizations may be constrained by their budget and cannot implement due to the feature's cost."

The big question is, of course, "How does this save me time and money?" Yigal from Logz.io points out. "If the system is something that actually introduces more complexity or cost — practically or in terms of how difficult it is to use — then obviously that is not a huge help, even if there is some promise of new or interesting outcomes."

Dennis Drogseth from EMA explains that IT buyers need to face the challenge of costs and overhead for AIOps both in terms of purchase and deployment, but also in terms of ongoing process and organizational evolution, the truth being that AIOps — if deployed wisely — can ultimately save on IT costs dramatically once human as well as technological factors are understood.

Go to: Discovering AIOps - Part 7: The Current State of AIOps

Pete Goldin is Editor and Publisher of APMdigest
Share this