Skip to main content

Don't Let an IT Service Disruption Lead to Catastrophic Downtime

Krishna Dunthoori
Apty

Over the years, we've seen several high-profile examples of how even the slightest human error can induce devastating bouts of downtime. One infamous example came several years ago, when Amazon's S3 service was knocked offline, obliterating service to social media platforms, web publishers, and other leading websites. The cause? A simple typo — an authorized employee intended to take a small number of servers offline to fix a problem with the billing system, but accidentally entered a command incorrectly and removed a large number of servers instead.

Within several hours, Amazon's S3 service was back online, but the incident had lasting ramifications. Numerous popular apps and websites were impacted, and the estimated cost to S&P 500 companies was $150 million, while US financial services companies lost an estimated $160 million in revenue.

Even for the average organization (i.e., one not of Amazon's size), the cost of application downtime stands at a staggering $5,600 per minute. Moreover, outages are continuing to increase, as more people within an organization are empowered to make changes to IT services. In fact, a large majority of all incidents reported to an IT service desk are caused by change.

IT Service Management (ITSM) solutions are widely available to help solve this problem, with incident management as one of its main pillars. Incident management enables the rapid identification, notification, and resolution of critical application outages, and provides a clear, documented process to follow if and when things go wrong. The reported percentage of IT projects that result in failure depends on the article or survey you read, but most put the number at 55 - 75 percent. So why do so many ITSM implementations fail?

Like other software implementations, ITSM often suffers from a lack of user adoption. This is because people, by nature, are resistant to change. Sometimes, organizations and their training teams erroneously believe they can communicate once or twice about a new software implementation, deliver a round of training, and sit back and expect to realize software value. However, in prioritizing go-live, many training teams fail to properly support user adoption in the ensuing days and months, and adoption never reaches meaningful levels.

But in an incident response context, something else seems to be going on. Any strong emotion that temporarily impairs our thinking — anxiety, fear, or anger, for example — can result in a "brain freeze," or a temporary decline in cognitive functioning. So when an incident occurs, the ensuing panic among employees who are likely unfamiliar with the ITSM solution anyway, makes the situation that much more grim.

So how can organizations and training teams harness the full potential of ITSM solutions to maximize application uptime?

There are several areas to focus on, including:

Seamless onboarding and increasing user adoption - Organizations and their training teams need to simplify the ITSM onboarding process by providing real-time, in-app, context-driven guidance. This reduces the learning curve and eliminates the fear of embracing the new technology, while providing the right support at the right time.

Supporting change processes - Given the pace and frequency of change, context-driven guidance also makes it easier for ITSM users to implement changes posing fewer risks and disruptions, ensuring that changes are carried out much more smoothly.

Reducing all-important mean-time-to-repair (MTTR) - Especially in times of strain, context-driven guidance can also help ITSM users swiftly find information and efficiently resolve those IT issues they don't necessarily encounter every day, by providing in-the-moment, step-by-step guidance. This leads to augmented user productivity and satisfaction while minimizing service disruptions.

The Amazon S3 example may seem like an egregious example of "breaking the internet." Yet it clearly highlights how the slightest change or error can induce disaster, as well as the fragility of modern infrastructures — realities impacting all organizations. Successfully implementing and training on ITSM, and specifically incident management as part of an ITSM approach, can be vital in avoiding expensive downtime when a disruption occurs. The key is to have ongoing training and guided risk management in place so there is little to no pause in response when the inevitable error or disruption happens. This is where solutions like digital adoption platforms (DAPs) come into play to streamline and solve IT disruption downtime challenges — ensuring seamless and efficient adoption of ITSM tools.

Krishna Dunthoori is Founder and CEO of Apty

Hot Topics

The Latest

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 5 covers the infrastructure and hardware supporting AI ...

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 4 covers advancements in AI technology ...

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 3 covers AI's impact on employees and their roles ...

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 2 covers the challenges presented by AI, as well as solutions to those problems ...

In the final part of APMdigest's 2025 Predictions Series, industry experts offer predictions on how AI will evolve and impact technology and business in 2025 ...

E-commerce is set to skyrocket with a 9% rise over the next few years ... To thrive in this competitive environment, retailers must identify digital resilience as their top priority. In a world where savvy shoppers expect 24/7 access to online deals and experiences, any unexpected downtime to digital services can lead to significant financial losses, damage to brand reputation, abandoned carts with designer shoes, and additional issues ...

Efficiency is a highly-desirable objective in business ... We're seeing this scenario play out in enterprises around the world as they continue to struggle with infrastructures and remote work models with an eye toward operational efficiencies. In contrast to that goal, a recent Broadcom survey of global IT and network professionals found widespread adoption of these strategies is making the network more complex and hampering observability, leading to uptime, performance and security issues. Let's look more closely at these challenges ...

Image
Broadcom

The 2025 Catchpoint SRE Report dives into the forces transforming the SRE landscape, exploring both the challenges and opportunities ahead. Let's break down the key findings and what they mean for SRE professionals and the businesses relying on them ...

Image
Catchpoint

The pressure on IT teams has never been greater. As data environments grow increasingly complex, resource shortages are emerging as a major obstacle for IT leaders striving to meet the demands of modern infrastructure management ... According to DataStrike's newly released 2025 Data Infrastructure Survey Report, more than half (54%) of IT leaders cite resource limitations as a top challenge, highlighting a growing trend toward outsourcing as a solution ...

Image
Datastrike

Gartner revealed its top strategic predictions for 2025 and beyond. Gartner's top predictions explore how generative AI (GenAI) is affecting areas where most would assume only humans can have lasting impact ...

Don't Let an IT Service Disruption Lead to Catastrophic Downtime

Krishna Dunthoori
Apty

Over the years, we've seen several high-profile examples of how even the slightest human error can induce devastating bouts of downtime. One infamous example came several years ago, when Amazon's S3 service was knocked offline, obliterating service to social media platforms, web publishers, and other leading websites. The cause? A simple typo — an authorized employee intended to take a small number of servers offline to fix a problem with the billing system, but accidentally entered a command incorrectly and removed a large number of servers instead.

Within several hours, Amazon's S3 service was back online, but the incident had lasting ramifications. Numerous popular apps and websites were impacted, and the estimated cost to S&P 500 companies was $150 million, while US financial services companies lost an estimated $160 million in revenue.

Even for the average organization (i.e., one not of Amazon's size), the cost of application downtime stands at a staggering $5,600 per minute. Moreover, outages are continuing to increase, as more people within an organization are empowered to make changes to IT services. In fact, a large majority of all incidents reported to an IT service desk are caused by change.

IT Service Management (ITSM) solutions are widely available to help solve this problem, with incident management as one of its main pillars. Incident management enables the rapid identification, notification, and resolution of critical application outages, and provides a clear, documented process to follow if and when things go wrong. The reported percentage of IT projects that result in failure depends on the article or survey you read, but most put the number at 55 - 75 percent. So why do so many ITSM implementations fail?

Like other software implementations, ITSM often suffers from a lack of user adoption. This is because people, by nature, are resistant to change. Sometimes, organizations and their training teams erroneously believe they can communicate once or twice about a new software implementation, deliver a round of training, and sit back and expect to realize software value. However, in prioritizing go-live, many training teams fail to properly support user adoption in the ensuing days and months, and adoption never reaches meaningful levels.

But in an incident response context, something else seems to be going on. Any strong emotion that temporarily impairs our thinking — anxiety, fear, or anger, for example — can result in a "brain freeze," or a temporary decline in cognitive functioning. So when an incident occurs, the ensuing panic among employees who are likely unfamiliar with the ITSM solution anyway, makes the situation that much more grim.

So how can organizations and training teams harness the full potential of ITSM solutions to maximize application uptime?

There are several areas to focus on, including:

Seamless onboarding and increasing user adoption - Organizations and their training teams need to simplify the ITSM onboarding process by providing real-time, in-app, context-driven guidance. This reduces the learning curve and eliminates the fear of embracing the new technology, while providing the right support at the right time.

Supporting change processes - Given the pace and frequency of change, context-driven guidance also makes it easier for ITSM users to implement changes posing fewer risks and disruptions, ensuring that changes are carried out much more smoothly.

Reducing all-important mean-time-to-repair (MTTR) - Especially in times of strain, context-driven guidance can also help ITSM users swiftly find information and efficiently resolve those IT issues they don't necessarily encounter every day, by providing in-the-moment, step-by-step guidance. This leads to augmented user productivity and satisfaction while minimizing service disruptions.

The Amazon S3 example may seem like an egregious example of "breaking the internet." Yet it clearly highlights how the slightest change or error can induce disaster, as well as the fragility of modern infrastructures — realities impacting all organizations. Successfully implementing and training on ITSM, and specifically incident management as part of an ITSM approach, can be vital in avoiding expensive downtime when a disruption occurs. The key is to have ongoing training and guided risk management in place so there is little to no pause in response when the inevitable error or disruption happens. This is where solutions like digital adoption platforms (DAPs) come into play to streamline and solve IT disruption downtime challenges — ensuring seamless and efficient adoption of ITSM tools.

Krishna Dunthoori is Founder and CEO of Apty

Hot Topics

The Latest

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 5 covers the infrastructure and hardware supporting AI ...

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 4 covers advancements in AI technology ...

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 3 covers AI's impact on employees and their roles ...

Industry experts offer predictions on how AI will evolve and impact technology and business in 2025. Part 2 covers the challenges presented by AI, as well as solutions to those problems ...

In the final part of APMdigest's 2025 Predictions Series, industry experts offer predictions on how AI will evolve and impact technology and business in 2025 ...

E-commerce is set to skyrocket with a 9% rise over the next few years ... To thrive in this competitive environment, retailers must identify digital resilience as their top priority. In a world where savvy shoppers expect 24/7 access to online deals and experiences, any unexpected downtime to digital services can lead to significant financial losses, damage to brand reputation, abandoned carts with designer shoes, and additional issues ...

Efficiency is a highly-desirable objective in business ... We're seeing this scenario play out in enterprises around the world as they continue to struggle with infrastructures and remote work models with an eye toward operational efficiencies. In contrast to that goal, a recent Broadcom survey of global IT and network professionals found widespread adoption of these strategies is making the network more complex and hampering observability, leading to uptime, performance and security issues. Let's look more closely at these challenges ...

Image
Broadcom

The 2025 Catchpoint SRE Report dives into the forces transforming the SRE landscape, exploring both the challenges and opportunities ahead. Let's break down the key findings and what they mean for SRE professionals and the businesses relying on them ...

Image
Catchpoint

The pressure on IT teams has never been greater. As data environments grow increasingly complex, resource shortages are emerging as a major obstacle for IT leaders striving to meet the demands of modern infrastructure management ... According to DataStrike's newly released 2025 Data Infrastructure Survey Report, more than half (54%) of IT leaders cite resource limitations as a top challenge, highlighting a growing trend toward outsourcing as a solution ...

Image
Datastrike

Gartner revealed its top strategic predictions for 2025 and beyond. Gartner's top predictions explore how generative AI (GenAI) is affecting areas where most would assume only humans can have lasting impact ...