Observability: The Next Frontier for AIOps
September 24, 2020

Will Cappelli
Moogsoft

Share this

Enterprise ITOM and ITSM teams have been welcoming of AIOps, believing that it has the potential to deliver great value to them as their IT environments become more distributed, hybrid and complex. Not so with DevOps teams.

Listen to Will Cappelli discuss AIOps and Observability on the AI+ITOPS Podcast

It's safe to say they've kept AIOps at arm's length, because they don't think it's relevant nor useful for what they do. Instead, to manage the software code they develop and deploy, they've focused on observability.

In concrete terms, this means that for your typical DevOps pros, if the app delivered to their production environment is observable, that's all they need. They're skeptical of what, if anything, AIOps can contribute in this scenario.

This blog will explain why AIOps can help DevOps teams manage their environments with unprecedented accuracy and velocity, and outline the benefits of combining AIOps with observability.


AIOps: Room to Grow its Adoption and Functionality

In truth, there isn't one universally effective set of metrics that works for every team to measure the value that AIOps delivers. This is an issue not just for AIOps but for many ITOM and ITSM technologies as well. In fact, many enterprise IT teams who invested in AIOps in recent years are now carefully watching their deployments to assess their value before deciding whether or not to expand on them.

Still, there's a lot of room for AIOps adoption to grow, because there are many enterprises that haven't adopted it at all. That's why many vendors are trying to position themselves as AIOps players, to be part of a growing market. For this reason, the AIOps market has now gotten crowded.

So how can AIOps as a practice innovate and evolve at this point? What AIOps innovations can deliver unique capabilities that will set it apart from the pack of existing varieties? Clearly, the way to do this is to tailor, expand and apply AI-functionality to observability data. Such a solution would appeal strongly to the DevOps community, and dissolve its historical reluctance and skepticism towards AIOps.

But What is Observability?

However, there's an issue. When you press DevOps pros a little bit and ask them what observability is, you get three very different answers. The first is that observability is nothing more than traditional monitoring applied to a DevOps environment and toolset. This is flat out wrong.

Another meaning you'll hear given to observability is its traditional one: That it's a property of the system being monitored. In other words, observability isn't about the technology doing the monitoring or the observing, but rather it's the self-descriptive data a system generates.

According to this definition, people monitoring these systems can obtain an accurate picture of the changes occurring in them and of their causal relationships. However, it's clear that this view of observability, while related to the second one, is a dead end. It's just a stream of raw data and nothing else.

A third definition is that, compared with traditional monitoring, observability is a fundamentally different way of looking at and getting data from the environment being managed. And it needs to be, because the DevOps world is one of continuous integration, continuous delivery and continuous change — a world that's highly componentized and dynamic.

The way traditional monitoring tools take data from an environment, filter it, and generate events isn't appropriate for DevOps. You need to observe changes that happen so quickly that trying to fit the data into any kind of pre-arranged structure just falls short. You won't be able to see what's going on in the environment.

Instead, DevOps teams need to access the raw data generated by their toolset and environment, and perform analytics directly on it. That raw data is made up of metrics, traces, logs and events. So observability is indeed a revolution, a drastic shift away from all the pre-built filters and the pre-packaged models of traditional monitoring systems.

This definition is the one that serves up a potential for technological innovation and for delivering the most value through AIOps, because DevOps teams do need help to make sense of this raw data stream, and act accordingly.

AI analysis and automation applied to observability can deliver this assistance to DevOps teams. Such an approach would take the raw data from the DevOps environment and give DevOps practitioners an understanding of the systems that they're developing and delivering.

With these insights, DevOps teams can more effectively decide on actions to fix problems, or to improve performance.

So what's involved in combining AIOps and observability?

Metrics, traces, logs and events must first be collected and analyzed. Metrics captures a temporal dimension of what's happening, through its time-series data. Traces map a path through a topology, so they provide a spatial dimension -- a trace is a chain of execution across different system components, usually microservices. Logs and events provide a record of unstructured events.

With AIOps analysis, metrics reveal anomalies, traces show topology-based microservice relationships, and unstructured logs and events provide the foundation for triggering a significant alert.

Machine learning algorithms would then come into play to indicate an uncommon occurrence, pinpoint unusual metrics, traces, logs and events, and correlate them using temporal, spatial and textual criteria. The next step in the process would be the identification of a probable root cause of the problem, based on the history of previously resolved incidents. Then, ideally, automated remedial actions would be carried out.

Clearly, this combination of AIOps and observability would offer tremendous value to DevOps teams, as it would automate the detection, diagnosis and remediation of problems with the speed and accuracy required in their CI/CD environments. This would represent a breakthrough for AIOps: Earning the appreciation of reticent DevOps teams by giving them deep insights into observability data, and unparalleled visibility into their environments.

Will Cappelli is Field CTO at Moogsoft
Share this

The Latest

May 25, 2022

Site reliability engineering (SRE) is fast becoming an essential aspect of modern IT operations, particularly in highly scaled, big data environments. As businesses and industries shift to the digital and embrace new IT infrastructures and technologies to remain operational and competitive, the need for a new approach for IT teams to find and manage the balance between launching new systems and features and ensuring these are intuitive, reliable, and friendly for end users has intensified as well ...

May 24, 2022

The most sophisticated observability practitioners (leaders) are able to cut downtime costs by 90%, from an estimated $23.8 million annually to just $2.5 million, compared to observability beginners, according to the State of Observability 2022 from Splunk in collaboration with the Enterprise Strategy Group. What's more, leaders in observability are more innovative and more successful at achieving digital transformation outcomes and other initiatives ...

May 23, 2022

Programmatically tracked service level indicators (SLIs) are foundational to every site reliability engineering practice. When engineering teams have programmatic SLIs in place, they lessen the need to manually track performance and incident data. They're also able to reduce manual toil because our DevOps teams define the capabilities and metrics that define their SLI data, which they collect automatically — hence "programmatic" ...

May 19, 2022

Recently, a regional healthcare organization wanted to retire its legacy monitoring tools and adopt AIOps. The organization asked Windward Consulting to implement an AIOps strategy that would help streamline its outdated and unwieldy IT system management. Our team's AIOps implementation process helped this client and can help others in the industry too. Here's what my team did ...

May 18, 2022

You've likely heard it before: every business is a digital business. However, some businesses and sectors digitize more quickly than others. Healthcare has traditionally been on the slower side of digital transformation and technology adoption, but that's changing. As healthcare organizations roll out innovations at increasing velocity, they must build a long-term strategy for how they will maintain the uptime of their critical apps and services. And there's only one tool that can ensure this continuous availability in our modern IT ecosystems. AIOps can help IT Operations teams ensure the uptime of critical apps and services ...

May 17, 2022

Between 2012 to 2015 all of the hyperscalers attempted to use the legacy APM solutions to improve their own visibility. To no avail. The problem was that none of the previous generations of APM solutions could match the scaling demand, nor could they provide interoperability due to their proprietary and exclusive agentry ...

May 16, 2022

The DevOps journey begins by understanding a team's DevOps flow and identifying precisely what tasks deliver the best return on engineers' time when automated. The rest of this blog will help DevOps team managers by outlining what jobs can — and should be automated ...

May 12, 2022

A survey from Snow Software polled more than 500 IT leaders to determine the current state of cloud infrastructure. Nearly half of the IT leaders who responded agreed that cloud was critical to operations during the pandemic with the majority deploying a hybrid cloud strategy consisting of both public and private clouds. Unsurprisingly, over the last 12 months, the majority of respondents had increased overall cloud spend — a substantial increase over the 2020 findings ...

May 11, 2022

As we all know, the drastic changes in the world have caused the workforce to take a hybrid approach over the last two years. A lot of that time, being fully remote. With the back and forth between home and office, employees need ways to stay productive and access useful information necessary to complete their daily work. The ability to obtain a holistic view of data relevant to the user and get answers to topics, no matter the worker's location, is crucial for a successful and efficient hybrid working environment ...

May 10, 2022

For the past decade, Application Performance Management has been a capability provided by a very small and exclusive set of vendors. These vendors provided a bolt-on solution that provided monitoring capabilities without requiring developers to take ownership of instrumentation and monitoring. You may think of this as a benefit, but in reality, it was not ...