The Leading Causes of IT Outages - and How to Prevent Them
November 04, 2019

Mark Banfield
LogicMonitor

Share this

IT outages happen to companies across the globe, regardless of location, annual revenue or size. Even the most mammoth companies are at risk of downtime. Increasingly over the past few years, high-profile IT outages — defined as when the services or systems a business provides suddenly become unavailable — have ended up splashed across national news headlines.

In March 2019, Facebook and Instagram each experienced 14 hours of downtime. A second IT outage struck both — along with WhatsApp — in April 2019, taking all three platforms offline. And in July 2019, all three platforms experienced availability problems that impacted users. British Airways has also faced a series of high-profile IT outages in the past, including one in April that resulted in 100 canceled flights and 200 delayed flights. An outage back in May 2017 also affected more than 1,000 flights, call centers, BA's website and BA's mobile app.

Given all of these recent disruptive and costly outages, LogicMonitor decided to investigate the causes behind downtime, commissioning an independent study investigating the major causes of downtime, the business impact of outages on organizations, and ways to avoid IT outages and brownouts. The IT Outage Impact Study involved surveying 300 IT decision-makers across the United States, Canada, the United Kingdom, Australia and New Zealand.

Outages Lead to Compliance Failures and High Costs

The number one and number two issues were concerns about performance and availability

Among other insights, the survey revealed the top 5 issues keeping IT decision makers up at night. The number one and number two issues were concerns about performance and availability, beating out security and cost-effectiveness worries.

Unfortunately, those self-reported fears about IT teams' ability to maintain availability are well-founded. In fact, 96% of global survey respondents reported that their organizations had suffered at least one IT outage over the past three years. Such outages can have serious implications, including steep costs and low customer satisfaction scores. Heavily regulated industries, such as healthcare and finance, face another dire consequence beyond service disruptions and costs as a result of outages: compliance failure.

"One of our clients is a radiology company, and they need to be up 24/7," said a service desk support engineer for a solution provider. "If they have more than an hour of downtime a year, probably less than that, that's a serious issue. These guys can never go down, for legal reasons."


Human Error is #1 Cause of IT Outages in the US and Canada

The study found that human error was the #1 cause of IT outages in the United States and Canada, and the #3 cause globally. Given this finding, it was no surprise that Network World covered the story of British Airways' May 2017 outage under the headline, "British Airways' outage, like most data center outages, was caused by humans."

The Network World article describes how an engineer working onsite at a data center near the Heathrow airport disconnected a power supply. When the power supply was reconnected, a surge of power caused the outage. The article also cites a 2016 Ponemon Institute study, which found that human error accounted for 11 percent of outages, more than weather (10%), generator failures (6%) or IT equipment malfunction (4%).

Faced with findings like this, it's no wonder that global IT decision makers said 51% of IT outages are avoidable. As a result, more and more teams worldwide are transitioning to monitoring tools that incorporate AIOps and automation to minimize human error and maximize early warning opportunities.

Monitoring Helps Prevent Outages Through Early Warning Systems

Comprehensive monitoring provides visibility into IT infrastructure and can help organizations get ahead of trends that indicate an outage may be rapidly approaching. The top two causes of outages, according to survey respondents, are declining hardware/software performance and IT teams' failure to notice when usage reaches a dangerous level. Artificial intelligence for IT operations (AIOps) and intelligent monitoring offer an effective solution to both of these outage factors.

To minimize your organizations' outage risk, look for monitoring solutions with the following capabilities:

■ A platform that offers a holistic view of your IT systems via a single pane of glass and integrates with all your technologies

■ A tool that builds in a high level of redundancy to eliminate single points of failure

■ A platform that provides early visibility via an early warning system into trends that could indicate future trouble

■ A solution that is able to scale with your business as it grows, making sure your current and future monitoring needs are met.

Mark Banfield is CRO at LogicMonitor
Share this

The Latest

January 13, 2022

Gartner highlighted 6 trends that infrastructure and operations (I&O) leaders must start preparing for in the next 12-18 months ...

January 11, 2022

Technology is now foundational to financial companies' operations with many institutions relying on tech to deliver critical services. As a result, uptime is essential to customer satisfaction and company success, and systems must be subject to continuous monitoring. But modern IT architectures are disparate, complex and interconnected, and the data is too voluminous for the human mind to handle. Enter AIOps ...

January 11, 2022

Having a variety of tools to choose from creates challenges in telemetry data collection. Organizations find themselves managing multiple libraries for logging, metrics, and traces, with each vendor having its own APIs, SDKs, agents, and collectors. An open source, community-driven approach to observability will gain steam in 2022 to remove unnecessary complications by tapping into the latest advancements in observability practice ...

January 10, 2022

These are the trends that will set up your engineers and developers to deliver amazing software that powers amazing digital experiences that fuel your organization's growth in 2022 — and beyond ...

January 06, 2022

In a world where digital services have become a critical part of how we go about our daily lives, the risk of undergoing an outage has become even more significant. Outages can range in severity and impact companies of every size — while outages from larger companies in the social media space or a cloud provider tend to receive a lot of coverage, application downtime from even the most targeted companies can disrupt users' personal and business operations ...

January 05, 2022

Move fast and break things: A phrase that has been a rallying cry for many SREs and DevOps practitioners. After all, these teams are charged with delivering rapid and unceasing innovation to wow customers and keep pace with competitors. But today's society doesn't tolerate broken things (aka downtime). So, what if you can move fast and not break things? Or at least, move fast and rapidly identify or even predict broken things? It's high time to rethink the old rallying cry, and with AI and observability working in tandem, it's possible ...

January 04, 2022

AIOps is still relatively new compared to existing technologies such as enterprise data warehouses, and early on many AIOps projects suffered hiccups, the aftereffects of which are still felt today. That's why, for some IT Ops teams and leaders, the prospect of transforming their IT operations using AIOps is a cause for concern ...

December 16, 2021

This year is the first time APMdigest is posting a separate list of Remote Work Predictions. Due to the drastic changes in the way we work and do business since the COVID pandemic started, and how significantly these changes have impacted IT operations, APMdigest asked industry experts — from analysts and consultants to users and the top vendors — how they think the work from home (WFH) revolution will evolve into 2022, with a special focus on IT operations and performance. Here are some very interesting and insightful predictions that may change what you think about the future of work and IT ...

December 15, 2021

Industry experts offer thoughtful, insightful, and often controversial predictions on how APM, AIOps, Observability, OpenTelemetry, and related technologies will evolve and impact business in 2022. Part 6 covers the user experience ...

December 14, 2021

Industry experts offer thoughtful, insightful, and often controversial predictions on how APM, AIOps, Observability, OpenTelemetry, and related technologies will evolve and impact business in 2022. Part 5 covers ITSM ...