Skip to main content

Don't Let an IT Service Disruption Lead to Catastrophic Downtime

Krishna Dunthoori
Apty

Over the years, we've seen several high-profile examples of how even the slightest human error can induce devastating bouts of downtime. One infamous example came several years ago, when Amazon's S3 service was knocked offline, obliterating service to social media platforms, web publishers, and other leading websites. The cause? A simple typo — an authorized employee intended to take a small number of servers offline to fix a problem with the billing system, but accidentally entered a command incorrectly and removed a large number of servers instead.

Within several hours, Amazon's S3 service was back online, but the incident had lasting ramifications. Numerous popular apps and websites were impacted, and the estimated cost to S&P 500 companies was $150 million, while US financial services companies lost an estimated $160 million in revenue.

Even for the average organization (i.e., one not of Amazon's size), the cost of application downtime stands at a staggering $5,600 per minute. Moreover, outages are continuing to increase, as more people within an organization are empowered to make changes to IT services. In fact, a large majority of all incidents reported to an IT service desk are caused by change.

IT Service Management (ITSM) solutions are widely available to help solve this problem, with incident management as one of its main pillars. Incident management enables the rapid identification, notification, and resolution of critical application outages, and provides a clear, documented process to follow if and when things go wrong. The reported percentage of IT projects that result in failure depends on the article or survey you read, but most put the number at 55 - 75 percent. So why do so many ITSM implementations fail?

Like other software implementations, ITSM often suffers from a lack of user adoption. This is because people, by nature, are resistant to change. Sometimes, organizations and their training teams erroneously believe they can communicate once or twice about a new software implementation, deliver a round of training, and sit back and expect to realize software value. However, in prioritizing go-live, many training teams fail to properly support user adoption in the ensuing days and months, and adoption never reaches meaningful levels.

But in an incident response context, something else seems to be going on. Any strong emotion that temporarily impairs our thinking — anxiety, fear, or anger, for example — can result in a "brain freeze," or a temporary decline in cognitive functioning. So when an incident occurs, the ensuing panic among employees who are likely unfamiliar with the ITSM solution anyway, makes the situation that much more grim.

So how can organizations and training teams harness the full potential of ITSM solutions to maximize application uptime?

There are several areas to focus on, including:

Seamless onboarding and increasing user adoption - Organizations and their training teams need to simplify the ITSM onboarding process by providing real-time, in-app, context-driven guidance. This reduces the learning curve and eliminates the fear of embracing the new technology, while providing the right support at the right time.

Supporting change processes - Given the pace and frequency of change, context-driven guidance also makes it easier for ITSM users to implement changes posing fewer risks and disruptions, ensuring that changes are carried out much more smoothly.

Reducing all-important mean-time-to-repair (MTTR) - Especially in times of strain, context-driven guidance can also help ITSM users swiftly find information and efficiently resolve those IT issues they don't necessarily encounter every day, by providing in-the-moment, step-by-step guidance. This leads to augmented user productivity and satisfaction while minimizing service disruptions.

The Amazon S3 example may seem like an egregious example of "breaking the internet." Yet it clearly highlights how the slightest change or error can induce disaster, as well as the fragility of modern infrastructures — realities impacting all organizations. Successfully implementing and training on ITSM, and specifically incident management as part of an ITSM approach, can be vital in avoiding expensive downtime when a disruption occurs. The key is to have ongoing training and guided risk management in place so there is little to no pause in response when the inevitable error or disruption happens. This is where solutions like digital adoption platforms (DAPs) come into play to streamline and solve IT disruption downtime challenges — ensuring seamless and efficient adoption of ITSM tools.

Krishna Dunthoori is Founder and CEO of Apty

Hot Topics

The Latest

As businesses increasingly rely on high-performance applications to deliver seamless user experiences, the demand for fast, reliable, and scalable data storage systems has never been greater. Redis — an open-source, in-memory data structure store — has emerged as a popular choice for use cases ranging from caching to real-time analytics. But with great performance comes the need for vigilant monitoring ...

Kubernetes was not initially designed with AI's vast resource variability in mind, and the rapid rise of AI has exposed Kubernetes limitations, particularly when it comes to cost and resource efficiency. Indeed, AI workloads differ from traditional applications in that they require a staggering amount and variety of compute resources, and their consumption is far less consistent than traditional workloads ... Considering the speed of AI innovation, teams cannot afford to be bogged down by these constant infrastructure concerns. A solution is needed ...

AI is the catalyst for significant investment in data teams as enterprises require higher-quality data to power their AI applications, according to the State of Analytics Engineering Report from dbt Labs ...

Misaligned architecture can lead to business consequences, with 93% of respondents reporting negative outcomes such as service disruptions, high operational costs and security challenges ...

A Gartner analyst recently suggested that GenAI tools could create 25% time savings for network operational teams. Where might these time savings come from? How are GenAI tools helping NetOps teams today, and what other tasks might they take on in the future as models continue improving? In general, these savings come from automating or streamlining manual NetOps tasks ...

IT and line-of-business teams are increasingly aligned in their efforts to close the data gap and drive greater collaboration to alleviate IT bottlenecks and offload growing demands on IT teams, according to The 2025 Automation Benchmark Report: Insights from IT Leaders on Enterprise Automation & the Future of AI-Driven Businesses from Jitterbit ...

A large majority (86%) of data management and AI decision makers cite protecting data privacy as a top concern, with 76% of respondents citing ROI on data privacy and AI initiatives across their organization, according to a new Harris Poll from Collibra ...

According to Gartner, Inc. the following six trends will shape the future of cloud over the next four years, ultimately resulting in new ways of working that are digital in nature and transformative in impact ...

2020 was the equivalent of a wedding with a top-shelf open bar. As businesses scrambled to adjust to remote work, digital transformation accelerated at breakneck speed. New software categories emerged overnight. Tech stacks ballooned with all sorts of SaaS apps solving ALL the problems — often with little oversight or long-term integration planning, and yes frequently a lot of duplicated functionality ... But now the music's faded. The lights are on. Everyone from the CIO to the CFO is checking the bill. Welcome to the Great SaaS Hangover ...

Regardless of OpenShift being a scalable and flexible software, it can be a pain to monitor since complete visibility into the underlying operations is not guaranteed ... To effectively monitor an OpenShift environment, IT administrators should focus on these five key elements and their associated metrics ...

Don't Let an IT Service Disruption Lead to Catastrophic Downtime

Krishna Dunthoori
Apty

Over the years, we've seen several high-profile examples of how even the slightest human error can induce devastating bouts of downtime. One infamous example came several years ago, when Amazon's S3 service was knocked offline, obliterating service to social media platforms, web publishers, and other leading websites. The cause? A simple typo — an authorized employee intended to take a small number of servers offline to fix a problem with the billing system, but accidentally entered a command incorrectly and removed a large number of servers instead.

Within several hours, Amazon's S3 service was back online, but the incident had lasting ramifications. Numerous popular apps and websites were impacted, and the estimated cost to S&P 500 companies was $150 million, while US financial services companies lost an estimated $160 million in revenue.

Even for the average organization (i.e., one not of Amazon's size), the cost of application downtime stands at a staggering $5,600 per minute. Moreover, outages are continuing to increase, as more people within an organization are empowered to make changes to IT services. In fact, a large majority of all incidents reported to an IT service desk are caused by change.

IT Service Management (ITSM) solutions are widely available to help solve this problem, with incident management as one of its main pillars. Incident management enables the rapid identification, notification, and resolution of critical application outages, and provides a clear, documented process to follow if and when things go wrong. The reported percentage of IT projects that result in failure depends on the article or survey you read, but most put the number at 55 - 75 percent. So why do so many ITSM implementations fail?

Like other software implementations, ITSM often suffers from a lack of user adoption. This is because people, by nature, are resistant to change. Sometimes, organizations and their training teams erroneously believe they can communicate once or twice about a new software implementation, deliver a round of training, and sit back and expect to realize software value. However, in prioritizing go-live, many training teams fail to properly support user adoption in the ensuing days and months, and adoption never reaches meaningful levels.

But in an incident response context, something else seems to be going on. Any strong emotion that temporarily impairs our thinking — anxiety, fear, or anger, for example — can result in a "brain freeze," or a temporary decline in cognitive functioning. So when an incident occurs, the ensuing panic among employees who are likely unfamiliar with the ITSM solution anyway, makes the situation that much more grim.

So how can organizations and training teams harness the full potential of ITSM solutions to maximize application uptime?

There are several areas to focus on, including:

Seamless onboarding and increasing user adoption - Organizations and their training teams need to simplify the ITSM onboarding process by providing real-time, in-app, context-driven guidance. This reduces the learning curve and eliminates the fear of embracing the new technology, while providing the right support at the right time.

Supporting change processes - Given the pace and frequency of change, context-driven guidance also makes it easier for ITSM users to implement changes posing fewer risks and disruptions, ensuring that changes are carried out much more smoothly.

Reducing all-important mean-time-to-repair (MTTR) - Especially in times of strain, context-driven guidance can also help ITSM users swiftly find information and efficiently resolve those IT issues they don't necessarily encounter every day, by providing in-the-moment, step-by-step guidance. This leads to augmented user productivity and satisfaction while minimizing service disruptions.

The Amazon S3 example may seem like an egregious example of "breaking the internet." Yet it clearly highlights how the slightest change or error can induce disaster, as well as the fragility of modern infrastructures — realities impacting all organizations. Successfully implementing and training on ITSM, and specifically incident management as part of an ITSM approach, can be vital in avoiding expensive downtime when a disruption occurs. The key is to have ongoing training and guided risk management in place so there is little to no pause in response when the inevitable error or disruption happens. This is where solutions like digital adoption platforms (DAPs) come into play to streamline and solve IT disruption downtime challenges — ensuring seamless and efficient adoption of ITSM tools.

Krishna Dunthoori is Founder and CEO of Apty

Hot Topics

The Latest

As businesses increasingly rely on high-performance applications to deliver seamless user experiences, the demand for fast, reliable, and scalable data storage systems has never been greater. Redis — an open-source, in-memory data structure store — has emerged as a popular choice for use cases ranging from caching to real-time analytics. But with great performance comes the need for vigilant monitoring ...

Kubernetes was not initially designed with AI's vast resource variability in mind, and the rapid rise of AI has exposed Kubernetes limitations, particularly when it comes to cost and resource efficiency. Indeed, AI workloads differ from traditional applications in that they require a staggering amount and variety of compute resources, and their consumption is far less consistent than traditional workloads ... Considering the speed of AI innovation, teams cannot afford to be bogged down by these constant infrastructure concerns. A solution is needed ...

AI is the catalyst for significant investment in data teams as enterprises require higher-quality data to power their AI applications, according to the State of Analytics Engineering Report from dbt Labs ...

Misaligned architecture can lead to business consequences, with 93% of respondents reporting negative outcomes such as service disruptions, high operational costs and security challenges ...

A Gartner analyst recently suggested that GenAI tools could create 25% time savings for network operational teams. Where might these time savings come from? How are GenAI tools helping NetOps teams today, and what other tasks might they take on in the future as models continue improving? In general, these savings come from automating or streamlining manual NetOps tasks ...

IT and line-of-business teams are increasingly aligned in their efforts to close the data gap and drive greater collaboration to alleviate IT bottlenecks and offload growing demands on IT teams, according to The 2025 Automation Benchmark Report: Insights from IT Leaders on Enterprise Automation & the Future of AI-Driven Businesses from Jitterbit ...

A large majority (86%) of data management and AI decision makers cite protecting data privacy as a top concern, with 76% of respondents citing ROI on data privacy and AI initiatives across their organization, according to a new Harris Poll from Collibra ...

According to Gartner, Inc. the following six trends will shape the future of cloud over the next four years, ultimately resulting in new ways of working that are digital in nature and transformative in impact ...

2020 was the equivalent of a wedding with a top-shelf open bar. As businesses scrambled to adjust to remote work, digital transformation accelerated at breakneck speed. New software categories emerged overnight. Tech stacks ballooned with all sorts of SaaS apps solving ALL the problems — often with little oversight or long-term integration planning, and yes frequently a lot of duplicated functionality ... But now the music's faded. The lights are on. Everyone from the CIO to the CFO is checking the bill. Welcome to the Great SaaS Hangover ...

Regardless of OpenShift being a scalable and flexible software, it can be a pain to monitor since complete visibility into the underlying operations is not guaranteed ... To effectively monitor an OpenShift environment, IT administrators should focus on these five key elements and their associated metrics ...