APM Dude, Where's My Self-Driving App? - AIOps Level 4 - Self-Healing Applications
October 10, 2018

Kieran Taylor
CA Technologies

In this series of blogs, I've discussed how achieving self-driving intelligent applications with AIOps is a journey, not a destination. Thus far, we've explored the first three stages of that journey comparing it to the levels of automation in self-driving cars.

Start with Dude, Where's My Self-Driving App? - Level 1 - AIOps Anomaly Detection and Algorithmic Noise Reduction

Start with Dude, Where's My Self-Driving App? - AIOps Level 2 - Automated Root-Cause

Start with Dude, Where's My Self-Driving App? - AIOps Level 3 - Unified Visualization, Correlation and Workflow

Now let's hop into the seat of a highly automatedcar (level 4) and compare it to what we can expect with more advanced AIOps systems.

According the Society of Automotive Intelligence (SAE), a level 4 car should be able to drive itself safely even if the driver does not respond when asked to intervene. It's a "mind off the road" kind of deal, with the driver able to safely go to sleep or leave the driver's seat. At level 4, a car will accelerate, slow down, pull over or park safely if the driver doesn't take control when requested. Level 4 is considered highly automated; however, full self-driving is only supported in limited areas (geofenced) or certain conditions (traffic jams).

This level of automation will require some serious computing horsepower. For example, the NVIDIA Xavier SoC (system on a chip) technology offers a swag of deep learning technology supporting 30 trillion operations per second.

So, with these types of tech are we there yet? Well, not quite. Level 4 cars aren't scheduled until 2021. But, when the vision becomes a reality, these cars won't be cars as we know them – they'll be small offices, entertainment theaters and much more.

The machine learning at this level will be highly sophisticated. Deep neural nets will provide advanced situational understanding, detecting and responding to emerging conditions and patterns over time. As for artificial intelligence, sensors inside and outside the car will track driver attentiveness, eye movements and gaze, plus alert on conditions humans can't detect. It'll be like the ultimate back seat driver, only reliable and trustworthy.

In a level 4 AIOps system we need similar intelligence. As with driving, today's application architectures are highly dynamic. High elastic serverless architectures mean containers can spin up and down based on demand patterns. Microservice development styles mean more components, nodes and dependencies. Add continuous integration/delivery pipelines and we have systems that can change in minutes. All of this leads to systems producing emergent behaviors that can no longer be predicted; often and aptly called "unknown unknowns."

Take for example optimizing resource usage. In a container-based system, high CPU utilization might be expected, even desired. Triggering an alert might only make sense if utilization drops (indicating outages), or if it increases over a longer period.

In such situations, static polling and simplistic correlation as a determinant for automated self-healing will fall short. For example, 1-minute polling of resource usage might be sufficient in systems that rarely change but is inadequate in systems that change dynamically. In other cases, alert-driven automation might result in continuous provisioning and de-provisioning fluctuations, when what's really needed are more advanced methods to regulate and smooth the flow of resources; something like the way a PID controller works in industrial automation.

Nuance consideration in dynamic system behavior is a critical element of AIOps level 4 and self-healing applications. Just as no one drive to the office will ever be the same, multiple historical conditions across memory, resource consumption, storage, latency etc must be analyzed collectively to become a more reliable predictor and trigger of automation.

This is shown in the diagram below, where predictive capacity insights and historical "what if" analysis are being used to automate the provisioning of AWS instances. Similar to a level 4 car monitoring both driving conditions and the driver, advanced AIOps systems like these provide guardrails for IT operations – remediating problems and automating tasks, but able to give control back to staff when needed.


In the next and final blog in this series, I'll be discussing AIOps level 5 – Continuous AIOps. Here, I'll outline how operational intelligence and analytics can be incorporated within delivery pipelines and DevOps feedback loops to extend self-optimization. We'll also review the constituent parts needed in advanced systems to accelerate AI-driven automation. In the meantime, check-out the IT industry's first AIOps Virtual Summit that has some fantastic insights from AIOps thought leaders and practitioners.

Share this