I've been writing this series of articles to show how achieving continuous AIOps is analogous to developing a level 5 autonomous vehicle. So far, we've looked at detecting system anomalies and removing false alarms (level 1), plus automating root-cause analysis across modern technology stacks (level 2).
Attaining these first two levels is now table-stakes for IT Operations. That's because traditional break-fix approaches no longer scale in the cloud and modern systems are more unpredictable. So, like their vehicular counterparts, level 1 and 2 AIOps systems are essential for detecting operational blind-spots and automating the analysis of complex performance conditions.
Now let's look at the level where AIOps gets really exciting – level 3.
The jump in capability between level 2 and level 3 AIOps is massive compared to that between level 1 and 2. Again it's analogous to what's happening in the development of autonomous cars.
In a level 3 car, we have what's commonly referred to as conditional automation. Here, the vehicle is capable of taking full driving control (aka - scary hands of the wheel kind of stuff) during parts of a journey under certain operating conditions. One example is the Audi IT Traffic Jam Pilot (TJP) in the new A8 sedan. A system that allows completely hands-free driving, albeit on a freeway, under 37mph, with no pedestrians or traffic-lights, solid center markers, and clear lane markings. Super stuff for dealing with all that gridlock angst on the morning commute.
But achieving conditional automation requires some serious smarts. For example, the Audi TJP system includes constant traffic map monitoring, 12 ultrasonic parking sensors, four 360 cameras, mid and long-range sensors, plus some new-fangled laser tech – presumably designed for radar support and not for you to take your frustration out on other vehicles during rush hour.
Level 3 AIOps requires similar smarts and correlation. With complex conditions, it's not enough to focus on narrow situational slices. Rather, advanced systems should be capable of ingesting a wide variety of data sources (including, but not limited to: structured and unstructured logs, traces, alarms) into a single data lake, and then correlating and prioritizing based on business impacting conditions. For example, just as a level 3 autonomous car would avoid a barrier crash when lane markings have faded (because it also has proximity sensors), level 3 AIOps will be equally dextrous – continually analyzing and correlating across multiple domains to determine how emerging patterns may impact critical services supporting a business.
These types of systems have a profound effect on the IT operations function. Even though the systems are beginning to allow "hands-free" monitoring approach, staff effectiveness and efficiency are being increased significantly. Now, the heavy-lifting of alarm analysis and root-cause analysis in being conducted by AIOps platforms, while the system is augmenting staff with advanced visualization and automated workflows. Just as a level 3 car constantly informs the driver and gives back control, level 3 AIOps systems automate with a similar purpose – taking away monitoring drudgery, but giving more control through deeper operational insights.
Take for example the AIOps level 3 capability presented in the diagram below. Very different from traditional alarm notification consoles, the system here is surfacing analytical insights via a richer contextual interface. Immediately, engineers can at-a-glance see predicted risk and availability scores. Notice too how the system automatically guides staff to emerging high-priority conditions - meaning staff are always focused on prevention not remediation.
AIOps level 3 automation capabilities can also be extremely useful when incorporated into runbook processing. Recovery effectiveness will often be dependent on other teams (e.g. development) providing well documented instructions, but that isn't always top of mind. What's essential therefore is injecting AIOps automation into runbooks, so that evidence gathering, correlation and recovery workflows are performed seamlessly.
In the next blog, we continue our self-driving app journey by looking at level 4 – self-healing operations – a state where high-levels of automation really start to kick in – helping systems self-correct, heal and recover. While there are no level 4 cars in production, advanced AIOps is certainly available in an IT ops context and I'll be discussing some surprising and valuable use-cases.
Read the next blog, Dude, Where's My Self-Driving App? - AIOps Level 4 - Self-Healing Applications.
In the meantime, if you want to learn more, tune in to a replay of the industry's first AIOps virtual summit.