How AIOps Defuses the Impact of Change
July 12, 2021

Phil Tee
Moogsoft

Share this

When you see distressing internet outages occur like the recent Fastly incident that threw a slew of websites offline, I am never surprised by how widespread the problem was, but paradoxically that it wasn't worse.

The infrastructure behind our digital world is mind-numbingly complex. The movement to cloud computing has added even more layers to the interconnectedness. So when a simple software update goes awry, despite the best efforts of quality control, the ripple effects can go far and wide. The digital economy in the US alone accounts for at least $1,849 billion annually, according to a 2020 report by the Bureau of Economic Analysis. So every moment offline matters.

Prompt troubleshooting is a herculean task — impossible, really, for the human mind alone. There's just too much information to sift through to quickly identify how a single change event precipitated such a widespread crash. IT teams must rely on artificial intelligence, machine learning and algorithms to find and repair the root cause of the problem.

The Perils of "Change"

What seems near effortless online to most of us — ordering food, a Zoom call, reading this article — is a staggeringly Byzantine interconnected flow of data packets, routers, modems, internet service providers, gateways, network exchanges, servers and applications. The interdependencies are at such a level that any meaningful amount of mappability is out of reach. For a human mind, you're talking about understanding more interdependencies than particles in the observable universe — a stunning amount of complexity.

Amid that landscape is the need to update software, whether to refresh the operating system, add features or bolster security. And from time to time, someone performs a routine update that has an unintended and unforeseen consequence. Identifying a problem before an outage occurs is largely a fool's errand because the scale of the situation is just too great. The key is to find the problem before a widespread outage occurs. In such an interconnected digital world, errors tend to cascade and propagate. Catching them early is paramount.

One simple update that goes awry could cripple e-commerce if widespread system outages lingered. The potential risk is profound. History has shown when unintended consequences snowball. Mexico reeled in the 1990s from the devaluation of the peso. The United States stumbled in the 2000s when collateralized debt obligations tied to the mortgage industry prompted a financial crisis.

To be clear, the Fastly incident wasn't a global crisis. The Fastly team responded remarkably well. But the outage underscored how trouble quickly can spread in the interconnected digital world. What's absolutely necessary is to pinpoint the problem immediately.

How Intelligent Observability Defuses the Threat

This is where intelligent observability comes in to analyze the impact of change. AIOps with observability work together to quickly spot the patterns and interconnections in the application data to identify the root cause of a problem before it cascades further and causes a widespread outage.

Every change, every software update, has some kind of record associated with it. So theoretically, when something goes wrong, a site reliability engineer or other IT expert would get an alert in which they could simply trace the issue back to the record of the change that triggered the issue. But in practice, the situation is very complicated. Thousands of other data points were created before and after this specific change occurred, so the challenge to identifying the root cause of the problem is linking the right data to the relevant change.

AIOps finds the right data. It applies algorithms to observability data such as metrics, logs and traces to identify anomalies, determine event significance, surface meaningful alerts and correlate data to provide valuable context. Observability makes the job easier by engineering the application infrastructure to make all of the data more observable. AIOps surfaces the right data amid an ocean of data so your IT teams can quickly spot and repair the problem.

Every change, every software update, leaves a clue behind. The problem is there are thousands and thousands of potential suspects. Intelligent observability can quickly solve the "whatdunnit" before any outage becomes much worse.

Phil Tee is CEO of Moogsoft
Share this

The Latest

July 28, 2021

Business leaders are in the unique position of having immediate access to huge amounts of data in today's smartphone and laptop-dominated world. They are also under pressure to make data-driven decisions and mobile business intelligence can one of the most valuable decision making tools in their arsenal ...

July 27, 2021

Unlike some AI initiatives, AIOps doesn't always necessitate the use of a data scientist, so don't let hiring expenses put your AIOps initiatives on hold. It is always nice to have IT team members with AI skills, but this becomes less critical as more intelligent solutions come into prominence that offer AIOps features out of the box, a readily deployable option for IT ...

July 26, 2021

AIOps is rapidly becoming a de-facto option for enterprises' IT strategies, with nearly immeasurable benefits to be provided. However, AIOps is still a relatively new discipline and misconceptions surrounding the technology's capabilities and uses have caused bottlenecks and roadblocks in its widespread adoption. So, what should organizations expect from AIOps? How can organizations that want to digitally transform their IT pursue AIOps for maximum benefit? ...

July 22, 2021

In response to the global pandemic, companies have given their workforce the tools they need to work remote. And research shows it has increased their engagement and productivity. But these gains are on the brink of being wiped out. According to a new study from Citrix Systems, Inc., employees feel they've been given too many tools and not enough efficient ways to execute. And it's hindering their ability to get things done ...

July 21, 2021

The third installment of Aptum's four-part Cloud Impact Study, A Bright Forecast on Cloud, presents data showing the benefits organizations gain from cloud computing, as well as mistakes to avoid during migration. As organizations migrate workloads to different cloud platforms, they often run into unexpected challenges due to a lack of proactive planning. Here are a few key findings from Part 3 of the Cloud Impact Study ...

July 20, 2021

Currently, (and most likely well into the future) the overwhelming majority of organizations still need to monitor and maintain enterprise applications. Moreover, where these are complex systems developed, debugged and refined over years, often decades, around a business's core processes, there can also be very strong practical arguments for viewing them as classics. They can offer a valuable legacy, one best left where it is, doing what it does, how it always has done ...

July 19, 2021

Anti-patterns involve realizing a problem and implementing a non-optimal solution that is broadly embraced as the go-to method for solving that problem. This solution sounds good in theory, but for one reason or another it is not the best means of solving the problem. Anti-patterns are common across IT as well, especially around application monitoring and observability. One that is particularly prevalent is in response to the increasing complexity of cloud-native infrastructure and applications ...

July 15, 2021

The introduction of the latest technology — such as AI and machine learning — can be seen as a way for organizations to accelerate growth, increase efficiency, and improve customer service. However, the truth is that the technology alone will do little to deliver on these business outcomes. AI for IT operations (AIOps) is one area where the application of technology, if not matched with organizational maturity readiness, will fail to deliver all the promised benefits ...

July 14, 2021

SREs that fail to deliver customer value run the risk of being stuck in an operational toil rut. Conversely, businesses failing to recognize the bi-modal nature and importance of SRE activities run the risk of losing talented employees and their competitive edge ...

July 13, 2021

As part of digital transformation initiatives, IT teams are quickly adopting AIOps solutions to accommodate a new multifaceted infrastructure. However, there are still several roadblocks IT leaders must overcome when adopting AIOps — namely, understanding how to showcase ROI and changing their team's cultural mindset around adopting a new strategy ...