Skip to main content

Don't Let an IT Service Disruption Lead to Catastrophic Downtime

Krishna Dunthoori
Apty

Over the years, we've seen several high-profile examples of how even the slightest human error can induce devastating bouts of downtime. One infamous example came several years ago, when Amazon's S3 service was knocked offline, obliterating service to social media platforms, web publishers, and other leading websites. The cause? A simple typo — an authorized employee intended to take a small number of servers offline to fix a problem with the billing system, but accidentally entered a command incorrectly and removed a large number of servers instead.

Within several hours, Amazon's S3 service was back online, but the incident had lasting ramifications. Numerous popular apps and websites were impacted, and the estimated cost to S&P 500 companies was $150 million, while US financial services companies lost an estimated $160 million in revenue.

Even for the average organization (i.e., one not of Amazon's size), the cost of application downtime stands at a staggering $5,600 per minute. Moreover, outages are continuing to increase, as more people within an organization are empowered to make changes to IT services. In fact, a large majority of all incidents reported to an IT service desk are caused by change.

IT Service Management (ITSM) solutions are widely available to help solve this problem, with incident management as one of its main pillars. Incident management enables the rapid identification, notification, and resolution of critical application outages, and provides a clear, documented process to follow if and when things go wrong. The reported percentage of IT projects that result in failure depends on the article or survey you read, but most put the number at 55 - 75 percent. So why do so many ITSM implementations fail?

Like other software implementations, ITSM often suffers from a lack of user adoption. This is because people, by nature, are resistant to change. Sometimes, organizations and their training teams erroneously believe they can communicate once or twice about a new software implementation, deliver a round of training, and sit back and expect to realize software value. However, in prioritizing go-live, many training teams fail to properly support user adoption in the ensuing days and months, and adoption never reaches meaningful levels.

But in an incident response context, something else seems to be going on. Any strong emotion that temporarily impairs our thinking — anxiety, fear, or anger, for example — can result in a "brain freeze," or a temporary decline in cognitive functioning. So when an incident occurs, the ensuing panic among employees who are likely unfamiliar with the ITSM solution anyway, makes the situation that much more grim.

So how can organizations and training teams harness the full potential of ITSM solutions to maximize application uptime?

There are several areas to focus on, including:

Seamless onboarding and increasing user adoption - Organizations and their training teams need to simplify the ITSM onboarding process by providing real-time, in-app, context-driven guidance. This reduces the learning curve and eliminates the fear of embracing the new technology, while providing the right support at the right time.

Supporting change processes - Given the pace and frequency of change, context-driven guidance also makes it easier for ITSM users to implement changes posing fewer risks and disruptions, ensuring that changes are carried out much more smoothly.

Reducing all-important mean-time-to-repair (MTTR) - Especially in times of strain, context-driven guidance can also help ITSM users swiftly find information and efficiently resolve those IT issues they don't necessarily encounter every day, by providing in-the-moment, step-by-step guidance. This leads to augmented user productivity and satisfaction while minimizing service disruptions.

The Amazon S3 example may seem like an egregious example of "breaking the internet." Yet it clearly highlights how the slightest change or error can induce disaster, as well as the fragility of modern infrastructures — realities impacting all organizations. Successfully implementing and training on ITSM, and specifically incident management as part of an ITSM approach, can be vital in avoiding expensive downtime when a disruption occurs. The key is to have ongoing training and guided risk management in place so there is little to no pause in response when the inevitable error or disruption happens. This is where solutions like digital adoption platforms (DAPs) come into play to streamline and solve IT disruption downtime challenges — ensuring seamless and efficient adoption of ITSM tools.

Krishna Dunthoori is Founder and CEO of Apty

Hot Topics

The Latest

While 87% of manufacturing leaders and technical specialists report that ROI from their AIOps initiatives has met or exceeded expectations, only 37% say they are fully prepared to operationalize AI at scale, according to The Future of IT Operations in the AI Era, a report from Riverbed ...

Many organizations rely on cloud-first architectures to aggregate, analyze, and act on their operational data ... However, not all environments are conducive to cloud-first architectures ... There are limitations to cloud-first architectures that render them ineffective in mission-critical situations where responsiveness, cost control, and data sovereignty are non-negotiable; these limitations include ...

For years, cybersecurity was built around a simple assumption: protect the physical network and trust everything inside it. That model made sense when employees worked in offices, applications lived in data centers, and devices rarely left the building. Today's reality is fluid: people work from everywhere, applications run across multiple clouds, and AI-driven agents are beginning to act on behalf of users. But while the old perimeter dissolved, a new one quietly emerged ...

For years, infrastructure teams have treated compute as a relatively stable input. Capacity was provisioned, costs were forecasted, and performance expectations were set based on the assumption that identical resources behaved identically. That mental model is starting to break down. AI infrastructure is no longer behaving like static cloud capacity. It is increasingly behaving like a market ...

Resilience can no longer be defined by how quickly an organization recovers from an incident or disruption. The effectiveness of any resilience strategy is dependent on its ability to anticipate change, operate under continuous stress, and adapt confidently amid uncertainty ...

Mobile users are less tolerant of app instability than ever before. According to a new report from Luciq, No Margin for Error: What Mobile Users Expect and What Mobile Leaders Must Deliver in 2026, even minor performance issues now result in immediate abandonment, lost purchases, and long-term brand impact ...

Artificial intelligence (AI) has become the dominant force shaping enterprise data strategies. Boards expect progress. Executives expect returns. And data leaders are under pressure to prove that their organizations are "AI-ready" ...

Agentic AI is a major buzzword for 2026. Many tech companies are making bold promises about this technology, but many aren't grounded in reality, at least not yet. This coming year will likely be shaped by reality checks for IT teams, and progress will only come from a focus on strong foundations and disciplined execution ...

AI systems are still prone to hallucinations and misjudgments ... To build the trust needed for adoption, AI must be paired with human-in-the-loop (HITL) oversight, or checkpoints where humans verify, guide, and decide what actions are taken. The balance between autonomy and accountability is what will allow AI to deliver on its promise without sacrificing human trust ...

More data center leaders are reducing their reliance on utility grids by investing in onsite power for rapidly scaling data centers, according to the Data Center Power Report from Bloom Energy ...

Don't Let an IT Service Disruption Lead to Catastrophic Downtime

Krishna Dunthoori
Apty

Over the years, we've seen several high-profile examples of how even the slightest human error can induce devastating bouts of downtime. One infamous example came several years ago, when Amazon's S3 service was knocked offline, obliterating service to social media platforms, web publishers, and other leading websites. The cause? A simple typo — an authorized employee intended to take a small number of servers offline to fix a problem with the billing system, but accidentally entered a command incorrectly and removed a large number of servers instead.

Within several hours, Amazon's S3 service was back online, but the incident had lasting ramifications. Numerous popular apps and websites were impacted, and the estimated cost to S&P 500 companies was $150 million, while US financial services companies lost an estimated $160 million in revenue.

Even for the average organization (i.e., one not of Amazon's size), the cost of application downtime stands at a staggering $5,600 per minute. Moreover, outages are continuing to increase, as more people within an organization are empowered to make changes to IT services. In fact, a large majority of all incidents reported to an IT service desk are caused by change.

IT Service Management (ITSM) solutions are widely available to help solve this problem, with incident management as one of its main pillars. Incident management enables the rapid identification, notification, and resolution of critical application outages, and provides a clear, documented process to follow if and when things go wrong. The reported percentage of IT projects that result in failure depends on the article or survey you read, but most put the number at 55 - 75 percent. So why do so many ITSM implementations fail?

Like other software implementations, ITSM often suffers from a lack of user adoption. This is because people, by nature, are resistant to change. Sometimes, organizations and their training teams erroneously believe they can communicate once or twice about a new software implementation, deliver a round of training, and sit back and expect to realize software value. However, in prioritizing go-live, many training teams fail to properly support user adoption in the ensuing days and months, and adoption never reaches meaningful levels.

But in an incident response context, something else seems to be going on. Any strong emotion that temporarily impairs our thinking — anxiety, fear, or anger, for example — can result in a "brain freeze," or a temporary decline in cognitive functioning. So when an incident occurs, the ensuing panic among employees who are likely unfamiliar with the ITSM solution anyway, makes the situation that much more grim.

So how can organizations and training teams harness the full potential of ITSM solutions to maximize application uptime?

There are several areas to focus on, including:

Seamless onboarding and increasing user adoption - Organizations and their training teams need to simplify the ITSM onboarding process by providing real-time, in-app, context-driven guidance. This reduces the learning curve and eliminates the fear of embracing the new technology, while providing the right support at the right time.

Supporting change processes - Given the pace and frequency of change, context-driven guidance also makes it easier for ITSM users to implement changes posing fewer risks and disruptions, ensuring that changes are carried out much more smoothly.

Reducing all-important mean-time-to-repair (MTTR) - Especially in times of strain, context-driven guidance can also help ITSM users swiftly find information and efficiently resolve those IT issues they don't necessarily encounter every day, by providing in-the-moment, step-by-step guidance. This leads to augmented user productivity and satisfaction while minimizing service disruptions.

The Amazon S3 example may seem like an egregious example of "breaking the internet." Yet it clearly highlights how the slightest change or error can induce disaster, as well as the fragility of modern infrastructures — realities impacting all organizations. Successfully implementing and training on ITSM, and specifically incident management as part of an ITSM approach, can be vital in avoiding expensive downtime when a disruption occurs. The key is to have ongoing training and guided risk management in place so there is little to no pause in response when the inevitable error or disruption happens. This is where solutions like digital adoption platforms (DAPs) come into play to streamline and solve IT disruption downtime challenges — ensuring seamless and efficient adoption of ITSM tools.

Krishna Dunthoori is Founder and CEO of Apty

Hot Topics

The Latest

While 87% of manufacturing leaders and technical specialists report that ROI from their AIOps initiatives has met or exceeded expectations, only 37% say they are fully prepared to operationalize AI at scale, according to The Future of IT Operations in the AI Era, a report from Riverbed ...

Many organizations rely on cloud-first architectures to aggregate, analyze, and act on their operational data ... However, not all environments are conducive to cloud-first architectures ... There are limitations to cloud-first architectures that render them ineffective in mission-critical situations where responsiveness, cost control, and data sovereignty are non-negotiable; these limitations include ...

For years, cybersecurity was built around a simple assumption: protect the physical network and trust everything inside it. That model made sense when employees worked in offices, applications lived in data centers, and devices rarely left the building. Today's reality is fluid: people work from everywhere, applications run across multiple clouds, and AI-driven agents are beginning to act on behalf of users. But while the old perimeter dissolved, a new one quietly emerged ...

For years, infrastructure teams have treated compute as a relatively stable input. Capacity was provisioned, costs were forecasted, and performance expectations were set based on the assumption that identical resources behaved identically. That mental model is starting to break down. AI infrastructure is no longer behaving like static cloud capacity. It is increasingly behaving like a market ...

Resilience can no longer be defined by how quickly an organization recovers from an incident or disruption. The effectiveness of any resilience strategy is dependent on its ability to anticipate change, operate under continuous stress, and adapt confidently amid uncertainty ...

Mobile users are less tolerant of app instability than ever before. According to a new report from Luciq, No Margin for Error: What Mobile Users Expect and What Mobile Leaders Must Deliver in 2026, even minor performance issues now result in immediate abandonment, lost purchases, and long-term brand impact ...

Artificial intelligence (AI) has become the dominant force shaping enterprise data strategies. Boards expect progress. Executives expect returns. And data leaders are under pressure to prove that their organizations are "AI-ready" ...

Agentic AI is a major buzzword for 2026. Many tech companies are making bold promises about this technology, but many aren't grounded in reality, at least not yet. This coming year will likely be shaped by reality checks for IT teams, and progress will only come from a focus on strong foundations and disciplined execution ...

AI systems are still prone to hallucinations and misjudgments ... To build the trust needed for adoption, AI must be paired with human-in-the-loop (HITL) oversight, or checkpoints where humans verify, guide, and decide what actions are taken. The balance between autonomy and accountability is what will allow AI to deliver on its promise without sacrificing human trust ...

More data center leaders are reducing their reliance on utility grids by investing in onsite power for rapidly scaling data centers, according to the Data Center Power Report from Bloom Energy ...