Automation has been around since the first server admin wrote a script. Since then, IT life has continually become more complex — multiple data centers, high availability, disaster recovery, and now the cloud all create a dynamically ever-changing hybrid IT estate.
Over the last few decades, IT departments have decreased budgets in part because of recession. As a result, they have are being asked to do more with less. The increase in work has amplified the need for automation.
IT Process Automation
IT process automation can be used from simple individual scripts to branching scripts to full-blown orchestration systems. In addition to commercial offerings, the open-source community has developed over a dozen projects.
Until recently, all automation required human activation to kick off a process. Operations teams would manually go through the data they received. Once the problem was found, which could take days to fix and war room assemblies, a fix could then be implemented manually, with a script, or an automation tool. As compliance became more important, using tools that could log the actions to implement a fix became more important, but we still have humans working the automation.
Runbooks started as paper instructions on what to do when a well-known problem occurred. More recently, runbooks are automated scripts or orchestration systems. This documentation and automation helps move problem resolution to less experienced operators, sometimes called shift-left.
AI has recently become a reality for IT operations and can find problems then activate the appropriate automation to fix the problem. The original AIOps definition was focused on applying machine learning to the vast amount of data that IT operations was getting from all the monitoring tools it has. According to Enterprise Management Associates (EMA), enterprises have more than 10 monitoring tools managing hundred thousand metrics per day — not including log files.
This amount of data is too much for any human, even a group of humans to process. Hence, the application of machine learning which can process all this information in minutes or hours and point to the most likely root cause or at least narrow to a small number. Succinctly, AIOps turns IT operations data in operational insights to pinpoint the root cause of a problem.
Most recently, the definition of AIOps evolved to include automation. The idea is that once the machine learning determines the problem as described above, it kicks off the automation tools to fix the problem.
A recent survey determined about half the responding organizations allowed for fully automated problem resolution — no human involved. The other half wanted human review before acting, but even this is preferable to having the human take the time to decide which automation flow is required. This is a significant change from five years ago when most organizations were very nervous about automated remediation.
Getting to Automated AIOps
How do you get from where your current IT operations reality to automated AIOps? As the title implies, there are two parts - automation and AI. The below table shows the maturity curve from the automation perspective.
At the end of the day, the goal of the AI part is to take in data and automatically determine the problem. Some problems do not require machine learning to find, but the system must be able to take in data and isolate the problem. Once you have AI identifying a problem, you can connect it to the automated runbook that will remediate it. Then, from the Ops side, start by picking a single-use case, for example, "optimize event management" and then work with your teams to identify problems they see repeatedly. Voila — you have now automated AIOps.
The role of the CIO is evolving with more of a focus on revenue and strategy, according to the 2019 Global CIO Survey from Logicalis ...
Organizations face major infrastructure and security challenges in supporting multi-cloud and edge deployments, according to new global survey conducted by Propeller Insights for Volterra ...
Developers spend roughly 17.3 hours each week debugging, refactoring and modifying bad code — valuable time that could be spent writing more code, shipping better products and innovating. The bottom line? Nearly $300B (US) in lost developer productivity every year ...
While remote work policies have been gaining steam for the better part of the past decade across the enterprise space — driven in large part by more agile and scalable, cloud-delivered business solutions — recent events have pushed adoption into overdrive ...
Time-critical, unplanned work caused by IT disruptions continues to plague enterprises around the world, leading to lost revenue, significant employee morale problems and missed opportunities to innovate, according to the State of Unplanned Work Report 2020, conducted by Dimensional Research for PagerDuty ...