The IT operations technology landscape has become littered with analytics jargon - like machine learning, algorithmic IT and log analytics.
But however complex these sound, the basic premise is pretty simple: In the vast operational big data junk yard of counters, gauges, alerts, logs, events and alarms are hidden secrets. If we can make sense of it all, teams could unearth some wonderful insights. But as an army of vendors are quick to point out, they'll need some specialist tools to mine the gems. Tools to ingest a steaming pile of raw operational data and magically transform it into crystal clear business value.
So take data, liberally apply math fairy dust, build knowledge, and drive business actions. It's a familiar theme now being played in an IT operations theater near you.
Sounds great on the "unleash the power of IT ops data" banner ads, but if it's so simple why hasn't it produced stunning success stories? Cases where IT ops has gifted the business with hidden intelligence that equates to cold, hard cash.
Probably because the IT operations analytics hype and promise is well beyond the capabilities of existing tools and thinking.
Beware of Analytical Lipstick and Small Data Tools
Technology innovations especially in areas such as IT operations data collection has advanced rapidly. We've become accustomed to collecting and ingesting massive amounts of data from disparate sources using a swag of monitoring tools. However, in the frenzy to keep digging, our ability to apply new methods, learnings and experimentation is inhibited.
In the interim, software vendors are scurrying to fill the void with products that attempt to address the challenge, but in a small data way. Having bought into the promise of analyzing operational data at scale, teams have little more to demonstrate than yet another cool monitoring dashboard. Highly visual and interactive yes, but only providing narrow views, or worse, relying on sysadmins actually knowing what they're looking for.
Large enterprises are investing mega dollars in Amazonesque like technologies for IT ops big data, but unfortunately innovative thinking isn't accompanying the tools. Take predictive analytics, for example. Any half-decent machine learning technique should be capable of correlating thousands, perhaps millions of data points across applications, infrastructure and networks to pinpoint anomalies that will result in failures – enabling proactive maintenance. But imagine if, instead of just applying the math to the problem, site reliability engineers could correlate these massively complex data to determine the exact reasons why components fail and recommend actions.
Taking a leap of faith and adopting new techniques requires a cultural shift. For good reason, IT operations teams don't buy into outlandish claims; they've been burnt badly too many times. Problematic too has been the "protectionist" use of silo'd data in order to provide a rock solid alibi when the crap hits the fan.
But whatever algorithmic wrapper we care to use, analytics and machine learning will soon render traditional IT operations approaches (and excuses) obsolete. That's no reason to be downtrodden and vent South Park style - it's an opportunity to do some remarkable things. Like identifying software code and practices that correlate to the greatest customer engagement, or accurately determining the right cloud service for your workloads.
How to Trust Machine Learning If It Keeps Screwing Up
On the business side of the fence (if there actually is a fence anymore) analytics has been used for decades, but it's largely unexplored in IT ops. Sure, there are methods to determine future storage and compute needs when usage follows a linear path, but these paths are becoming less clear. With businesses fully committed to digital business where a million customers could be hitting an app at the same time across multiple regions, any traditional forecasting method goes out the window.
Of course assuming that analytics can step into the breach isn't always right. Ops pros with years of knowledge won't trust software when the decisions it makes repeatedly put the business at risk. Even small analytical nuances in context of customer experience can have dire consequences. Imagine for example a system that fails to process false-positives alarms and recommends, nay, actions an error-prone database failover – over and over again. Or a system that uses CPU utilization to predict cloud instance requirements, but because of lengthy sampling intervals and failing to account for VM stand-up times results in the business being hit by a swift and sudden performance fail.
Real-world quirks like these suggest that IT operations analytics success depends on building models from a diverse set of indicators. In a serial application where total availability is always less than the least reliable component, the best solutions will systematically pinpoint these elements, be that some lousy code, flaky network or aging server. And while this sounds obvious, it's amazing how teams invest time, energy and dollars where it delivers the least return. Why? Because even the latest analytical wizardry can lack fundamental operational savvy.
Without addressing these most basic of IT ops use cases, it's hard to put faith in machine learning for more complex problems. To make accurate predictions across complex distributed applications serving a multi-channel business means applying models that can work from any number of contexts. All of which requires correlating data from code to bare metal – not a trivial task and beyond the means of most providers.
We're only just starting to realize the full potential of analytics in IT operations. To make real progress, IT leaders must think beyond the hype and start working with partners who truly understand the scope of the problem and what it really takes to gain actionable insights.
Pete Waterhouse, Advisor, Product Marketing at CA Technologies.