For the last couple weeks, APMdigest posted a series of blogs about AIOps that included my commentary. In this blog, I present the case against AIOps.
In theory, the ideas behind AIOps features are sound, but the machine learning (ML) systems involved aren't sophisticated enough to be effective or trustworthy.
AIOps is relatively mature today, at least in its current form. The ML models companies use for AIOps tasks work as well as they can, and the features that wrap them are fairly stable and mature. That being said, maturity might be a bit orthogonal to usefulness.
Despite being based on mature tech, AIOps features aren't widely used because they don't seem to often help with problems people have in practice. It's like if you were struggling with cooking a meal and the main challenge lies in mixing all the ingredients at the right time, but someone offered you a better way to chop the vegetables. Does chopping up vegetables more efficiently help? Maybe, but that doesn't solve the difficulty in timing your ingredients.
In addition, AIOps adoption is a big challenge for teams. Organizations may be constrained by their budget and cannot implement due to the feature's cost. AIOps often comes bundled with several other features, all with a high learning curve, and very few can work as a turnkey solution. It's yet another thing for busy teams to learn, which is not likely to be high on their priority list.
AIOps Does Not Provide Actionable Insights
AIOps arguably doesn't provide actionable insights. Sure, there are examples of teams reducing false positives and using anomaly detection to identify something worth investigating. Still, teams have been able to reduce false positives and identify uniquely interesting patterns in data long before AIOps, and typically do this today without AIOps features.
For example, you don't need ML models to tell you that a particular measure crosses a threshold. Furthermore, these models work only with past behavior as context. They can't predict future behavior, especially for services with irregular traffic patterns. And it's services with irregular traffic patterns that actually present the most problems (and thus time spent debugging) in the first place.
One use case that can be helpful in understanding this problem is analyzing a giant bucket of data that hasn't been organized. When organizations treat operations data as a dumping ground, using an ML model to perform pattern analysis and separate usable from unusable data can be helpful. However, it's only treating a symptom and not the root cause.
And when there are issues that AIOps features can't help identify, you're back to an extremely long time spent figuring out what's wrong in a system.
Facing Your Organizational Issues
The advantages of AIOps are insignificant because AIOps features primarily exist to patch organizational and technical failures. The long-term solution is to invest in your organization and empower your teams to pick quality tools, not be sold the flashy promises of a quick AI fix.
I wouldn't suggest users go looking for an AIOps-specific provider and should instead leverage their team's expertise. Regarding these specific use cases, humans are far better at making critical judgment calls than the ML models on the market today. Deciding what's worth looking at and alerting on is the best possible use of human time.
Most of the problems that AIOps purports to solve are organizational issues. Fix your organizational and technical issues by giving your teams the agency to fix things in the first place.
If you have problems with noise in your data, look at how you generate telemetry and prioritize working to improve it. Lead a culture shift by enforcing the principle that good telemetry is a concern for application developers, not just ops teams.
If your alerts are out of order, have your team look at what they're alerting on and make necessary adjustments. If you have noisy alerts, talk to the people who are getting alerted to discover and investigate why things are too noisy. Take on call engineers very seriously, constantly poll people, and ensure they're not burning out. Some vendors will try to sell you on ML models that will magically solve alert fatigue, but please know and take caution that there is no magic, and your problems won't get solved by ML models.
If your organization doesn't have development teams prioritizing good telemetry, incentivize them to care about it.
LLMs for Observability
Can you tell I'm not particularly bullish on AIOps? I am incredibly bullish on LLMs for Observability, though. LLMs do a great job of taking natural language inputs and producing things like queries on data, analyzing data relevant to a query, and generating things that can help to teach people how to use a product. We'll uncover more use cases but right now LLMs are best at actually reducing toil and lowering the bar to learning how to analyze your production data in the first place.
While I'm not too hopeful about the future of AIOps, I am optimistic about how AI will continue to integrate into operations. LLMs present novel ways for us to interact with systems that were previously impossible. For example, observability vendors are releasing AI features that lower the barrier for developers to access and make the most out of their observability tools. Innovations like this will continue to enhance developer workflows and transform the way we work for the better.
The Latest
When employees encounter tech friction or feel frustrated with the tools they are asked to use, they will find a workaround. In fact, one in two office workers admit to using personal devices to log into work networks, with 32% of them revealing their employers are unaware of this practice, according to Securing the Digital Employee Experience ...
In today's high-stakes race to deliver innovative products without disruptions, the importance of feature management and experimentation has never been more clear. But what strategies are driving success, and which tools are truly moving the needle? ...
The demand for real-time AI capabilities is pushing data scientists to develop and manage infrastructure that can handle massive volumes of data in motion. This includes streaming data pipelines, edge computing, scalable cloud architecture, and data quality and governance. These new responsibilities require data scientists to expand their skill sets significantly ...
As the digital landscape constantly evolves, it's critical for businesses to stay ahead, especially when it comes to operating systems updates. A recent ControlUp study revealed that 82% of enterprise Windows endpoint devices have yet to migrate to Windows 11. With Microsoft's cutoff date on October 14, 2025, for Windows 10 support fast approaching, the urgency cannot be overstated ...
In Part 1 of this two-part series, I defined multi-CDN and explored how and why this approach is used by streaming services, e-commerce platforms, gaming companies and global enterprises for fast and reliable content delivery ... Now, in Part 2 of the series, I'll explore one of the biggest challenges of multi-CDN: observability.
CDNs consist of geographically distributed data centers with servers that cache and serve content close to end users to reduce latency and improve load times. Each data center is strategically placed so that digital signals can rapidly travel from one "point of presence" to the next, getting the digital signal to the viewer as fast as possible ... Multi-CDN refers to the strategy of utilizing multiple CDNs to deliver digital content across the internet ...
We surveyed IT professionals on their attitudes and practices regarding using Generative AI with databases. We asked how they are layering the technology in with their systems, where it's working the best for them, and what their concerns are ...
40% of generative AI (GenAI) solutions will be multimodal (text, image, audio and video) by 2027, up from 1% in 2023, according to Gartner ...
Today's digital business landscape evolves rapidly ... Among the areas primed for innovation, the long-standing ticket-based IT support model stands out as particularly outdated. Emerging as a game-changer, the concept of the "ticketless enterprise" promises to shift IT management from a reactive stance to a proactive approach ...