Artificial Intelligence (AI) has gone from a theory considered to be science fiction to a now reality. It surrounds us, from talking to our digital assistants, finding the fastest way back home or taking photos of our loved ones – AI has now become a part of our everyday lives.
Enterprise applications powered by artificial intelligence are on the rise and truly differentiating the early adopters. IT Operations teams especially see great benefits from artificial intelligence and machine learning capabilities as the new trends scale operational excellence beyond human capabilities.
Leveraging the Power of AI and Machine Learning
Cloud-based service-oriented architectures bring not only flexibility and unprecedented scalability, but also present a real challenge in monitoring such environments. The old era of monitoring relied on thresholds and dashboards – if a server or a service is down, red lights up. But modern applications, with a built-in resilience, can have tenths of services down on red, yet still provide a superb end-user experience. The standard dashboards we've become accustomed to are no longer showing us the entire picture or what's most important, the impact on end-user experience.
Most app owners tend to see AI as the Holy Grail, which reveals the truth, when faced with the challenges of monitoring modern applications. With a variety of tools already available today – deep neural network libraries, pre-trained recognition models, and online machine learning frameworks backed by the well-known names implementing AI may seem fairly straightforward.
The obvious starting point is to put all the available data into one heap and create a unified data lake. Next – run these available tools, which excel in identifying cats, dogs and your murmuring after waking up every morning, on the data lake. Voila, the magic happens and your dashboard lights will go red only when a customer is impacted based on all the unpleasant moments logged from the past.
Machine learning methods share a common approach. However, it's the underlying data we would like the method to learn from. That's the key to the success. It is not just about the quantity of data, but also the quality. Just like at school – the better the lectures are, the easier the exam is to pass. To be successful, the training set must include a variety of the situations that can happen, and their proper classification – what's good and what's not.
Of course, the training set will never list all of the possible situations which can occur. Thus, we expect that our model will learn a sort of abstraction to be able to properly classify these new situations. It's this ability to abstract new data that's the real challenge.
Running general-purpose machine learning tools on typical IT Ops data, such as traces, logs and metrics, can shortly deliver an anomaly detection precision over 70% with an error rate under 15% (false positives and negatives), after a few tweaks. Even though these are pretty impressive results, unachievable with a standard dynamic thresholding for example, this is still far from being production-ready as IT Ops routinely works with confidence levels around 98%.
So how can we get the remaining 25-30% of accuracy? The parallel with students in school still works – If all our students (models) fail to deliver exam results over 70%, regardless of how long they are being trained, the true problem is that they do not understand what they are learning. The same holds true for AI and machine learning tools.
We need to help the models to understand hidden relations in the data. Those that might not be needed to recognize pictures or understand speech. Additional relations in IT Ops data are typically correlated to the underlying microservices organization, invocation chaining or network structures. However, the correlation machine learning methods are very good at identifying what is not the causation which brings the real prediction capabilities.
We at CA have recognized this opportunity and have a dedicated team invested in challenging the boundaries of existing methods. Some problems require truly out-of-box thinking and involve mathematical research. This research has now become a part of our patent-pending technologies in CA Digital Operational Intelligence integrated within CA Application Performance Management.
To learn more about our CA APM features backed by the machine learning in CA Digital Operational Intelligence be sure to join our upcoming AIOps Virtual Summit on June 20.