Don't Let an IT Service Disruption Lead to Catastrophic Downtime
January 08, 2024

Krishna Dunthoori
Apty

Share this

Over the years, we've seen several high-profile examples of how even the slightest human error can induce devastating bouts of downtime. One infamous example came several years ago, when Amazon's S3 service was knocked offline, obliterating service to social media platforms, web publishers, and other leading websites. The cause? A simple typo — an authorized employee intended to take a small number of servers offline to fix a problem with the billing system, but accidentally entered a command incorrectly and removed a large number of servers instead.

Within several hours, Amazon's S3 service was back online, but the incident had lasting ramifications. Numerous popular apps and websites were impacted, and the estimated cost to S&P 500 companies was $150 million, while US financial services companies lost an estimated $160 million in revenue.

Even for the average organization (i.e., one not of Amazon's size), the cost of application downtime stands at a staggering $5,600 per minute. Moreover, outages are continuing to increase, as more people within an organization are empowered to make changes to IT services. In fact, a large majority of all incidents reported to an IT service desk are caused by change.

IT Service Management (ITSM) solutions are widely available to help solve this problem, with incident management as one of its main pillars. Incident management enables the rapid identification, notification, and resolution of critical application outages, and provides a clear, documented process to follow if and when things go wrong. The reported percentage of IT projects that result in failure depends on the article or survey you read, but most put the number at 55 - 75 percent. So why do so many ITSM implementations fail?

Like other software implementations, ITSM often suffers from a lack of user adoption. This is because people, by nature, are resistant to change. Sometimes, organizations and their training teams erroneously believe they can communicate once or twice about a new software implementation, deliver a round of training, and sit back and expect to realize software value. However, in prioritizing go-live, many training teams fail to properly support user adoption in the ensuing days and months, and adoption never reaches meaningful levels.

But in an incident response context, something else seems to be going on. Any strong emotion that temporarily impairs our thinking — anxiety, fear, or anger, for example — can result in a "brain freeze," or a temporary decline in cognitive functioning. So when an incident occurs, the ensuing panic among employees who are likely unfamiliar with the ITSM solution anyway, makes the situation that much more grim.

So how can organizations and training teams harness the full potential of ITSM solutions to maximize application uptime?

There are several areas to focus on, including:

Seamless onboarding and increasing user adoption - Organizations and their training teams need to simplify the ITSM onboarding process by providing real-time, in-app, context-driven guidance. This reduces the learning curve and eliminates the fear of embracing the new technology, while providing the right support at the right time.

Supporting change processes - Given the pace and frequency of change, context-driven guidance also makes it easier for ITSM users to implement changes posing fewer risks and disruptions, ensuring that changes are carried out much more smoothly.

Reducing all-important mean-time-to-repair (MTTR) - Especially in times of strain, context-driven guidance can also help ITSM users swiftly find information and efficiently resolve those IT issues they don't necessarily encounter every day, by providing in-the-moment, step-by-step guidance. This leads to augmented user productivity and satisfaction while minimizing service disruptions.

The Amazon S3 example may seem like an egregious example of "breaking the internet." Yet it clearly highlights how the slightest change or error can induce disaster, as well as the fragility of modern infrastructures — realities impacting all organizations. Successfully implementing and training on ITSM, and specifically incident management as part of an ITSM approach, can be vital in avoiding expensive downtime when a disruption occurs. The key is to have ongoing training and guided risk management in place so there is little to no pause in response when the inevitable error or disruption happens. This is where solutions like digital adoption platforms (DAPs) come into play to streamline and solve IT disruption downtime challenges — ensuring seamless and efficient adoption of ITSM tools.

Krishna Dunthoori is Founder and CEO of Apty
Share this

The Latest

July 25, 2024

The 2024 State of the Data Center Report from CoreSite shows that although C-suite confidence in the economy remains high, a VUCA (volatile, uncertain, complex, ambiguous) environment has many business leaders proceeding with caution when it comes to their IT and data ecosystems, with an emphasis on cost control and predictability, flexibility and risk management ...

July 24, 2024

In June, New Relic published the State of Observability for Energy and Utilities Report to share insights, analysis, and data on the impact of full-stack observability software in energy and utilities organizations' service capabilities. Here are eight key takeaways from the report ...

July 23, 2024

The rapid rise of generative AI (GenAI) has caught everyone's attention, leaving many to wonder if the technology's impact will live up to the immense hype. A recent survey by Alteryx provides valuable insights into the current state of GenAI adoption, revealing a shift from inflated expectations to tangible value realization across enterprises ... Here are five key takeaways that underscore GenAI's progression from hype to real-world impact ...

July 22, 2024
A defective software update caused what some experts are calling the largest IT outage in history on Friday, July 19. The impact reverberated through multiple industries around the world ...
July 18, 2024

As software development grows more intricate, the challenge for observability engineers tasked with ensuring optimal system performance becomes more daunting. Current methodologies are struggling to keep pace, with the annual Observability Pulse surveys indicating a rise in Mean Time to Remediation (MTTR). According to this survey, only a small fraction of organizations, around 10%, achieve full observability today. Generative AI, however, promises to significantly move the needle ...

July 17, 2024

While nearly all data leaders surveyed are building generative AI applications, most don't believe their data estate is actually prepared to support them, according to the State of Reliable AI report from Monte Carlo Data ...

July 16, 2024

Enterprises are putting a lot of effort into improving the digital employee experience (DEX), which has become essential to both improving organizational performance and attracting and retaining talented workers. But to date, most efforts to deliver outstanding DEX have focused on people working with laptops, PCs, or thin clients. Employees on the frontlines, using mobile devices to handle logistics ... have been largely overlooked ...

July 15, 2024

The average customer-facing incident takes nearly three hours to resolve (175 minutes) while the estimated cost of downtime is $4,537 per minute, meaning each incident can cost nearly $794,000, according to new research from PagerDuty ...

July 12, 2024

In MEAN TIME TO INSIGHT Episode 8, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses AutoCon with the conference founders Scott Robohn and Chris Grundemann ...

July 11, 2024

Numerous vendors and service providers have recently embraced the NaaS concept, yet there is still no industry consensus on its definition or the types of networks it involves. Furthermore, providers have varied in how they define the NaaS service delivery model. I conducted research for a new report, Network as a Service: Understanding the Cloud Consumption Model in Networking, to refine the concept of NaaS and reduce buyer confusion over what it is and how it can offer value ...