For the last couple weeks, APMdigest posted a series of blogs about AIOps that included my commentary. In this blog, I present the case against AIOps.
In theory, the ideas behind AIOps features are sound, but the machine learning (ML) systems involved aren't sophisticated enough to be effective or trustworthy.
AIOps is relatively mature today, at least in its current form. The ML models companies use for AIOps tasks work as well as they can, and the features that wrap them are fairly stable and mature. That being said, maturity might be a bit orthogonal to usefulness.
Despite being based on mature tech, AIOps features aren't widely used because they don't seem to often help with problems people have in practice. It's like if you were struggling with cooking a meal and the main challenge lies in mixing all the ingredients at the right time, but someone offered you a better way to chop the vegetables. Does chopping up vegetables more efficiently help? Maybe, but that doesn't solve the difficulty in timing your ingredients.
In addition, AIOps adoption is a big challenge for teams. Organizations may be constrained by their budget and cannot implement due to the feature's cost. AIOps often comes bundled with several other features, all with a high learning curve, and very few can work as a turnkey solution. It's yet another thing for busy teams to learn, which is not likely to be high on their priority list.
AIOps Does Not Provide Actionable Insights
AIOps arguably doesn't provide actionable insights. Sure, there are examples of teams reducing false positives and using anomaly detection to identify something worth investigating. Still, teams have been able to reduce false positives and identify uniquely interesting patterns in data long before AIOps, and typically do this today without AIOps features.
For example, you don't need ML models to tell you that a particular measure crosses a threshold. Furthermore, these models work only with past behavior as context. They can't predict future behavior, especially for services with irregular traffic patterns. And it's services with irregular traffic patterns that actually present the most problems (and thus time spent debugging) in the first place.
One use case that can be helpful in understanding this problem is analyzing a giant bucket of data that hasn't been organized. When organizations treat operations data as a dumping ground, using an ML model to perform pattern analysis and separate usable from unusable data can be helpful. However, it's only treating a symptom and not the root cause.
And when there are issues that AIOps features can't help identify, you're back to an extremely long time spent figuring out what's wrong in a system.
Facing Your Organizational Issues
The advantages of AIOps are insignificant because AIOps features primarily exist to patch organizational and technical failures. The long-term solution is to invest in your organization and empower your teams to pick quality tools, not be sold the flashy promises of a quick AI fix.
I wouldn't suggest users go looking for an AIOps-specific provider and should instead leverage their team's expertise. Regarding these specific use cases, humans are far better at making critical judgment calls than the ML models on the market today. Deciding what's worth looking at and alerting on is the best possible use of human time.
Most of the problems that AIOps purports to solve are organizational issues. Fix your organizational and technical issues by giving your teams the agency to fix things in the first place.
If you have problems with noise in your data, look at how you generate telemetry and prioritize working to improve it. Lead a culture shift by enforcing the principle that good telemetry is a concern for application developers, not just ops teams.
If your alerts are out of order, have your team look at what they're alerting on and make necessary adjustments. If you have noisy alerts, talk to the people who are getting alerted to discover and investigate why things are too noisy. Take on call engineers very seriously, constantly poll people, and ensure they're not burning out. Some vendors will try to sell you on ML models that will magically solve alert fatigue, but please know and take caution that there is no magic, and your problems won't get solved by ML models.
If your organization doesn't have development teams prioritizing good telemetry, incentivize them to care about it.
LLMs for Observability
Can you tell I'm not particularly bullish on AIOps? I am incredibly bullish on LLMs for Observability, though. LLMs do a great job of taking natural language inputs and producing things like queries on data, analyzing data relevant to a query, and generating things that can help to teach people how to use a product. We'll uncover more use cases but right now LLMs are best at actually reducing toil and lowering the bar to learning how to analyze your production data in the first place.
While I'm not too hopeful about the future of AIOps, I am optimistic about how AI will continue to integrate into operations. LLMs present novel ways for us to interact with systems that were previously impossible. For example, observability vendors are releasing AI features that lower the barrier for developers to access and make the most out of their observability tools. Innovations like this will continue to enhance developer workflows and transform the way we work for the better.
Industry experts offer thoughtful, insightful, and often controversial predictions on how APM, AIOps, Observability, OpenTelemetry and related technologies will evolve and impact business in 2024. Part 2 covers more on Observability ...
The Holiday Season means it is time for APMdigest's annual list of Application Performance Management (APM) predictions, covering IT performance topics. Industry experts — from analysts and consultants to the top vendors — offer thoughtful, insightful, and often controversial predictions on how APM, observability, AIOps and related technologies will evolve and impact business in 2024. Part 1 covers APM and Observability ...
To help you stay on top of the ever-evolving tech scene, Automox IT experts shake the proverbial magic eight ball and share their predictions about tech trends in the coming year. From M&A frenzies to sustainable tech and automation, these forecasts paint an exciting picture of the future ...
Incident management processes are not keeping pace with the demands of modern operations teams, failing to meet the needs of SREs as well as platform and ops teams. Results from the State of DevOps Automation and AI Survey, commissioned by Transposit, point to an incident management paradox. Despite nearly 60% of ITOps and DevOps professionals reporting they have a defined incident management process that's fully documented in one place and over 70% saying they have a level of automation that meets their needs, teams are unable to quickly resolve incidents ...
Today, in the world of enterprise technology, the challenges posed by legacy Virtual Desktop Infrastructure (VDI) systems have long been a source of concern for IT departments. In many instances, this promising solution has become an organizational burden, hindering progress, depleting resources, and taking a psychological and operational toll on employees ...
Within retail organizations across the world, IT teams will be bracing themselves for a hectic holiday season ... While this is an exciting opportunity for retailers to boost sales, it also intensifies severe risk. Any application performance slipup will cause consumers to turn their back on brands, possibly forever. Online shoppers will be completely unforgiving to any retailer who doesn't deliver a seamless digital experience ...
Black Friday is a time when consumers can cash in on some of the biggest deals retailers offer all year long ... Nearly two-thirds of consumers utilize a retailer's web and mobile app for holiday shopping, raising the stakes for competitors to provide the best online experience to retain customer loyalty. Perforce's 2023 Black Friday survey sheds light on consumers' expectations this time of year and how developers can properly prepare their applications for increased online traffic ...
This holiday shopping season, the stakes for online retailers couldn't be higher ... Even an hour or two of downtime for a digital storefront during this critical period can cost millions in lost revenue and has the potential to damage brand credibility. Savvy retailers are increasingly investing in observability to help ensure a seamless, omnichannel customer experience. Just ahead of the holiday season, New Relic released its State of Observability for Retail report, which offers insight and analysis on the adoption and business value of observability for the global retail/consumer industry ...
As organizations struggle to find and retain the talent they need to manage complex cloud implementations, many are leaning toward hybrid cloud as a solution ... While it's true that using the cloud is not a "one size fits all" proposition, it is clear that both large and small companies prefer a hybrid cloud model ...