APM AIOps: Helping SREs Predict the Future?
June 12, 2018

David Blank-Edelman

As a kid I grew up reading a lot of science fiction. My forbearing parents used to let me take out from the library the max number of books each week they would allow (30, I still remember that number). And each week I would go back for more. Given this constant consumption of augury you would think something I read would have prepared me for the future we now face within the Operations space.

While there are definitely some inklings in the science fiction canon about computer systems constructed at such scale that they would be hard for humans to understand, there is precious little attention paid to what it would take to operate them in production. Welcome to my world (and your reality, too, I bet).

At the upcoming AIOps Virtual Summit on June 20, we're going to be discussing two separate approaches to handling this level of complexity and how they intersect. The first is the engineering discipline known as Site Reliability Engineering (SRE) which aims to engineer failure out of the system. The second, AIOps, is a newly coined term for the application of a class of advanced algorithms to the massive corpus of operational data we are now accumulating just as part of the ordinary day-to-day activity of running all of these systems and services. 

One goal of the former is to construct a set of operational practices that allow us to navigate the tricky path between a desired feature velocity (iterating the software as fast as possible to provide the features a business needs to provide to its customer base) and a desired level of operational stability (keeping the system available for those customers). This is trickier than it sounds for at least three reasons: 

1. There are often completely different sets of people working on these problems.

2. They have very different incentives around the work.

3. Communication between these groups is often, shall we say, a little dicey.

SRE, like many other engineering disciplines, is a data-driven approach. It uses data (in ways we'll talk about in the upcoming session) to help create productive conversations and decision making easier between these different groups.

AIOps similarly tries to use operational data to provide a big win for an organization. It attempts to address the hard problem of "we have all of this data on the operational status and performance of our infrastructure, what can we learn from it?"

Can the record of the past help us understand how things are working in the present or even help predict the future? Is there information in the data I have already that might provide some insight into how my systems are behaving? For example:

■ Is this just a spike in traffic or an indication my systems are about to experience a tailspin into failure?

■ Are there any difficult-to-see patterns in the load in my system that could help me optimally provision my resources so I don't pay more than I need to?

■ Have we ever seen a outage like the one we are experiencing? (and how did we deal with it last time?)

Some of this is real today, some of it is easily imagined. There are definitely limits on what AIOps can offer our operations practices, but we surely haven't taken it to its full potential yet.

Join myself and Todd, Palino, who's a senior SRE at LinkedIn on June 20th at 1PM ET for our session on How AI is Helping Site Reliability Engineers Automate Incident Response. We'll discuss both approaches and their potential to bring a little bit of the future into your present. See you online! 

David Blank-Edelman is the Co-Founder of SREcon and Author of "Seeking SRE: Conversations on Running Production Systems at Scale"
Share this