In our first blog in this series, we discussed the 5 levels of driving automation. By presenting work from the Society of Automotive Engineers – the JS3016 standard – we used a comparative maturity level construct to show how achieving autonomous IT operations using AI and machine learning (also known as AIOps) follows a similar path. For example, a level 1 vehicle (driver assistance) will assist with many functions such as blind spot warnings, but the driver still has complete control of all operational and tactical driving tasks. Similarly, in IT operations, a level 1 state means staff are still behind the wheel so to speak, but analytics can be effectively employed to eliminate false positives and pinpoint anomalies – our own blind-spot equivalents.
It's not untypical for many IT Ops teams to get more than 50,000 alerts per month
For many organizations this is a great start. Modern containerized applications decomposed into 1000s of microservices can produce a 10-fold increase in metrics and it's not untypical for many IT Ops teams to get more than 50,000 alerts per month. That's just too many for staff to realistically process using traditional monitoring, so it's not surprising that more problematic conditions and anomalies are often missed. Even worse, when alarms are classified in the noise and nuisance bucket, they're easy to ignore completely.
At level 2 of self-driving cars we have what's known as partial automation. Most automakers are developing cars at this level, where the vehicle can assist with many functions and allow the driver to disengage from those tasks. Of course, the driver must still be ready to take control and maintains full responsibility for safety-critical functions and monitoring.
While level 2 features such as automated parking and lane keep assist are as common as coffee cup holders, they do introduce many new smarts. For example, in assisted parking, the function will be dependent on a variety of proximity sensors using electromagnet or ultrasonic detection to determine the distance and size of objects close to the car. Then an onboard computer will use calculations to set the best solution for parking. While control is still in the hands of the driver (e.g. braking), adjustments have to be made to accommodate mistakes – so easy in a tricky parallel parking situation, right?
While we now take partial automation in cars for granted, attaining this level to find the root-cause of problems confronting IT operations has proven more difficult. That's not through lack of monitoring sensors – far from it. Many operations teams routinely employ a variety of dashboards and agent-based technologies and many of these do a great job at helping teams find the root-cause of a problem within a narrow technology domain using an iterative cause-and-effect approach. Where they do fall down, however, is in the area of scalable cross-domain analytics – that is handling massive increases in data across the entire technology stack and then correlating conditions to a find the root-cause.
Coming back to level 2 cars, this process is actually quite similar. An assisted parking function only using a rear-bumper sensor can never handle a parallel parking scenario. This can only happen when inputs are gathered and correlated from sensors deployed at multiple locations (rear, front, side) to accurately determine the size and proximity of objects. Similarly, true Artificial Intelligence for IT Operations (AIOps) must be capable of agnostic data capture (logs, transactions, metrics) and cross-domain (application, infrastructure, network) correlation in order to automate root-cause analysis.
In practice, achieving level 2 AIOps begins by employing a consistent approach to data modelling. Individual tools can only document and visualize individual piece parts of highly-complex systems, which while supporting discrete monitoring needs does little to determine how overall business performance is impacted by complex and interrelated conditions.
Paradoxically, this can lead to contextual blindness; a condition where each new and valuable data source actually increases the cost of monitoring. Consider for example a case where a network fault management system has detected a problem with an element. The system will detect the problem, even the cause. However, because its data model is maintained in isolation it cannot determine which business applications will move out of compliance. This can only happen by presenting data in context to another team using a different tool based on a different data model.
To address these issues and help teams automate root cause at a true "system level," modern AIOps platforms will require a unified data model. Dynamically built using a time-journaled directed graph of objects, this model provides the foundation upon which to ingest, correlate and visualize the more complex problems eventuating across modern distributed applications and microservices. Ontology agnostic and open-ended, this unified data model allows every condition to be analyzed in context of shared business goals. To learn more about why a unified data model is essential for AIOps and root-cause analysis, check out this detailed RCA white paper.