In today's complex, dynamic IT environments, the proliferation of disparate IT Ops, NOC, DevOps, and SRE teams and tools is a given — and usually considered a necessity. This leads to the inevitable truth that when an incident happens, often the biggest challenge is collaborating between these teams to understand what happened and resolve the issue. Inefficiencies suffered during this critical stage can have huge impacts on how much each incident costs the business.
I recently sat down (virtually) with Sid Roy, VP of Client Services at Scicom, to get a deeper understanding of how IT leaders can more effectively size up these inefficiencies and eliminate them.
The Cost of IT Incidents
When asked what a minute of downtime costs, analysts and vendors may provide different answers — but they are more or less aligned around the same order of magnitude — several thousands of dollars per minute. And with an average of 5 major incidents a month, at an average time of 6 hours for resolution — this easily amounts to millions of dollars a year.
The three key drivers of these costs are:
Staffing and team member costs: It includes FTEs, consultants, and overhead — when other teams are pulled in to deal with the incident. For many organizations, this can include offshore incident response teams.
The direct and indirect costs of an IT incident: This includes your infrastructure or capital expenditures like software licenses for monitoring, log and event management, notification, ticketing, collaboration, etc.
The business impact of an IT incident: This is one of the most challenging and unpredictable variable costs to calculate or manage, and is often the highest of all three drivers. It includes revenue loss/delay or reduction due to a major incident and the profit or loss due to brand or goodwill impact. It also includes inefficiencies suffered by other parts of the business when critical services they depend on are degraded or unavailable.
Fragmented Teams Magnify the Challenge
The incident volume, complexity, and throughput obviously affect the number of people and time needed to deal with them and often drive more indirect costs as needed resources pile up. To save on these millions of dollars of costs, you need to be able to collaborate and lower MTTR. As mentioned above, this becomes a challenge in agile IT environments.
To help streamline operations, teams need to start asking and answering several key questions:
■ Do you have an up-to-date map of your critical services?
■ Are they prioritized by business criticality (revenue, number of customers, other supported services in the supply chain)?
■ What are the upstream and downstream dependencies of these applications?
■ Have you identified the major infrastructure and application elements in your environment?
■ Are you aligned with the owners of these systems?
■ Do you have real-time knowledge of changes being done to the infrastructure and applications?
■ Do you have monitoring gaps?
■ Which monitoring tools provide you with the best value?
Answering these questions involves overcoming fragmentation across teams of people, processes, and tools — essentially integrating ITSM and ITOM to enjoy the benefits of contextual full-stack visibility and streamlined processes within the organization.
The Right Combination
What is the right combination of people, processes, and tools we just discussed? Here are the two main guidelines:
■ Set up a major incident management team- to optimally benefit from your existing staff.
This team includes three vital roles:
- The incident manager/incident response commander. A designated role in charge of declaring a major incident and taking ownership of it. Their job is to essentially stop the bleeding of revenue and costs.
- The NOC/monitoring team. This is your front line of defense. When things go bump in the night or boom in the day, they're the ones picking it up with their “eyes on the glass” — 24/7. And they're in charge of reporting and creating full situational awareness for the incident command through bidirectional communications.
- The production support. The team that actually effects the required changes and executes the remediating action.
■ Deploy event correlation and automation tools to enable the incident management team.
These tools are key, allowing your team to do all the above.
First, correlate the alerts your monitoring and observability tools create into a drastically reduced number of high-level, insight-rich incidents by using Machine Learning and AI. Add context to these incidents by ingesting and understanding topology sources as well. This creates the needed full-stack visibility and situational awareness.
Then use ML and AI to determine the root cause of these incidents, including correlating them with data streams from your change tools: CI/CD, orchestration, change management, and auditing — to identify whether any changes were done in your environment are causing these incidents.
Finally — automate as many manual processes as you can to free your IT Ops team from time-consuming tasks. By integrating with collaboration tools — you can also enable the above-mentioned bi-directional communications.
We invite you to watch our webinar with Scicom — Incident Management: When duct tape and band-aids will no longer help — to deep dive and learn more.
Site reliability engineering (SRE) is fast becoming an essential aspect of modern IT operations, particularly in highly scaled, big data environments. As businesses and industries shift to the digital and embrace new IT infrastructures and technologies to remain operational and competitive, the need for a new approach for IT teams to find and manage the balance between launching new systems and features and ensuring these are intuitive, reliable, and friendly for end users has intensified as well ...
The most sophisticated observability practitioners (leaders) are able to cut downtime costs by 90%, from an estimated $23.8 million annually to just $2.5 million, compared to observability beginners, according to the State of Observability 2022 from Splunk in collaboration with the Enterprise Strategy Group. What's more, leaders in observability are more innovative and more successful at achieving digital transformation outcomes and other initiatives ...
Programmatically tracked service level indicators (SLIs) are foundational to every site reliability engineering practice. When engineering teams have programmatic SLIs in place, they lessen the need to manually track performance and incident data. They're also able to reduce manual toil because our DevOps teams define the capabilities and metrics that define their SLI data, which they collect automatically — hence "programmatic" ...
Recently, a regional healthcare organization wanted to retire its legacy monitoring tools and adopt AIOps. The organization asked Windward Consulting to implement an AIOps strategy that would help streamline its outdated and unwieldy IT system management. Our team's AIOps implementation process helped this client and can help others in the industry too. Here's what my team did ...
You've likely heard it before: every business is a digital business. However, some businesses and sectors digitize more quickly than others. Healthcare has traditionally been on the slower side of digital transformation and technology adoption, but that's changing. As healthcare organizations roll out innovations at increasing velocity, they must build a long-term strategy for how they will maintain the uptime of their critical apps and services. And there's only one tool that can ensure this continuous availability in our modern IT ecosystems. AIOps can help IT Operations teams ensure the uptime of critical apps and services ...
Between 2012 to 2015 all of the hyperscalers attempted to use the legacy APM solutions to improve their own visibility. To no avail. The problem was that none of the previous generations of APM solutions could match the scaling demand, nor could they provide interoperability due to their proprietary and exclusive agentry ...
The DevOps journey begins by understanding a team's DevOps flow and identifying precisely what tasks deliver the best return on engineers' time when automated. The rest of this blog will help DevOps team managers by outlining what jobs can — and should be automated ...
A survey from Snow Software polled more than 500 IT leaders to determine the current state of cloud infrastructure. Nearly half of the IT leaders who responded agreed that cloud was critical to operations during the pandemic with the majority deploying a hybrid cloud strategy consisting of both public and private clouds. Unsurprisingly, over the last 12 months, the majority of respondents had increased overall cloud spend — a substantial increase over the 2020 findings ...
As we all know, the drastic changes in the world have caused the workforce to take a hybrid approach over the last two years. A lot of that time, being fully remote. With the back and forth between home and office, employees need ways to stay productive and access useful information necessary to complete their daily work. The ability to obtain a holistic view of data relevant to the user and get answers to topics, no matter the worker's location, is crucial for a successful and efficient hybrid working environment ...
For the past decade, Application Performance Management has been a capability provided by a very small and exclusive set of vendors. These vendors provided a bolt-on solution that provided monitoring capabilities without requiring developers to take ownership of instrumentation and monitoring. You may think of this as a benefit, but in reality, it was not ...