The global pandemic has radically changed the way enterprise IT services are produced, consumed, and managed. It also has exposed a glaring difference between the "the haves and have-nots" of the software development and operations teams.
Engineering teams riding on CI/CD and DevOps waves are starting to see the full potential and purpose of that now. However, newly distributed operations teams are struggling to cope with the sudden change to the WFH (work from home) concept. As a VP of IT operations of a large enterprise told me, "We are in a survival mode with bare minimum tools to cope with. We are fighting a gun battle with swords." This is because the IT operations teams were traditionally set up to work from centralized locations, unlike software and engineering teams. Some organizations have overcome that by implementing AIOps (artificial intelligence for IT operations) solutions; others are using a brute force method of employing more IT operations analysts to keep the distributed NOCs (network operations centers) going.
IT Operations Teams Were Already Stressed
Even before the pandemic started this "new normal" mode of operations, IT operations teams were stressed to deliver more with less. According to a survey of 1300 IT professionals by BigPanda from earlier this year:
■ Innovation and CI/CD culture have increased normal operational workloads by 50%. The majority of the surveyed (53%) expect their NOC/ITOps workloads to increase even more in the next two years.
■ ITOps & NOC teams experience fast-moving IT stacks. These technology changes — whether they were necessitated by faster development needs, or were hyper-scale architecture based changes, or technical debt based — almost always require additional training and insights into the stacks as well as additional qualified analysts.
■ About 47% of respondents see constant application and code changes and 39% experience constant infrastructure changes — most of them see multiple daily changes, sometimes even hourly changes.
To keep up with this, ITOps teams have requested more budget, more automation tools, and more qualified analysts. However, very surprisingly,
■ 56% of them expected their IT budgets to stay flat. And 21% expected their IT operations budgets to shrink.
Over the last few years, software design, development, and testing teams transitioned away from the traditional model to a remote work alternative. Though a lot of corporations have decided to promote face to face collaboration workforce culture recently, they had a mechanism to fall back when the pandemic hit. However, the operations teams were almost always working from a centralized network or security centers (NOC/SOC) and had no such setups in place to work remotely if needed.
IT Operations Teams Were Not Setup to Work Remotely
The coronavirus pandemic has created an even more stressful situation for the IT operations teams.
1. The IT Operations teams have become very distributed and lost almost all of their NOC center privileges almost overnight. These include, but are not limited to, visual health of systems on large monitors on walls, immediate availability of experts in the same room for advice, and quick collaborative decision making to solve critical issues in real time.
2. The DevOps teams are set up to push agile releases virtually, and with working from home, their release cycles have gone up by much higher cycles than normal.
3. The IT Ops teams might see a reduction in personnel and efficiency due to illness, self-isolation, and lay-offs, and they are not properly set up to mimic NOC centralized teams in a remote distributed working environment.
4. To keep up with working remotely, the CIOs are forced to spend more money on infrastructure services, which was not budgeted previously. According to Gartner, cloud-based telephony/messaging and conferencing will see high levels of spending — up 8.9% and 24.3% respectively. Additionally, with an increase in spending for VPN, virtual desktops, hardware upgrades, standup desks for employees, additional security software to work remotely, CIOs have even less money to spend on other things like hiring additional IT operations analysts.
5. Workloads have become more distributed. The DevOps teams are working crazy hours, in crazy locations, and they are making some crazy changes without keeping the operations teams in the loop. Enterprises are still not ready to measure the increased workloads and employee stress that is caused by it as they are still underwater coping with the distributed workforce changes.
With the new budget crunch because of the economic impact, many IT teams that were already under heavy strain have slashed their IT operations staff considerably just to stay alive. This is impacting and adding more stress to the IT operations teams to do more with much less.
Prepare for the Future as This Too Shall Pass
The forward-looking enterprises are already considering moving from survival mode to thriving mode. They are setting up the necessary tools, visibility, compliance, and control for operations teams so that in the future, whether working remotely or in person, they can cope with disruptions and deliver in sync with development and engineering teams. Now the Ops teams can remotely monitor, diagnose, and maintain hyper-scale hybrid cloud systems if needed. While this pandemic may end sometime in the future, there will be other situations that will require IT operations teams to work remotely. By preparing for those situations, enterprises can survive future disruptions and enable operations teams to work efficiently if the situation arises. And, as a bonus, opening up remote locations will allow enterprises to hire more qualified IT analysts without the limitation to hiring only in specific locations.
The bottom line is that old-fashioned IT with old-fashioned thinking can lead to disaster. Reduced budgets, reduced resources, increased workloads, and added stress could lead to an unsustainable spiral. If the CIOs can't support the digital dependency from anywhere during the pandemic and beyond, the business will eventually fail.
Because of the remote working situation, the number of daily incidents has gone up. In some verticals, such as online learning, entertainment services, and collaborative tools, the incidents levels have gone up 10x. Some of those online collaborative tools' security flaws were exposed under high volumes. Between dealing with those incidents, and keeping up with the development and DevOps teams pushing changes to fix them, the Ops teams and the IT operations analyst jobs have now become the most stressful IT jobs.
Here are some of the things enterprises can do to mitigate the situation:
1. If at all possible, stop supporting non-critical business applications. This will free up a lot of support time.
2. Prioritize solving business-critical issues (such as scalability, security flaws, etc.) over non-critical issues as well as feature requests. They can wait.
3. Automate the IT processes as much as possible. The IT teams should be set up to find and solve issues efficiently.
4. Synchronize development and the IT Ops teams. Unless the Ops teams are aware of things that broke the system, they might be looking in the wrong places to solve issues.
5. Use ML, AI, and AIOps to reduce the noise (aka multiple alerts, tickets for the same incident) so teams can avoid distractions, spot early warnings, and concentrate on real issues. Properly implemented AIOps solution can reduce up to 95%+ alerts and avoid teams from feeling overwhelmed by "alert fatigue."
6. Automate the routing of incidents to the right resource quickly rather than escalating through multiple levels of support.
More than 80% of organizations have experienced a significant increase in pressure on digital services since the start of the COVID-19 pandemic, according to a new study conducted by PagerDuty ...
In Episode 9, Sean McDermott, President, CEO and Founder of Windward Consulting Group, joins the AI+ITOPS Podcast to discuss how the pandemic has impacted IT and is driving the need for AIOps ...
Michael Olson on the AI+ITOPS Podcast: "I really see AIOps as being a core requirement for observability because it ... applies intelligence to your telemetry data and your incident data ... to potentially predict problems before they happen."
Enterprise ITOM and ITSM teams have been welcoming of AIOps, believing that it has the potential to deliver great value to them as their IT environments become more distributed, hybrid and complex. Not so with DevOps teams. It's safe to say they've kept AIOps at arm's length, because they don't think it's relevant nor useful for what they do. Instead, to manage the software code they develop and deploy, they've focused on observability ...
The post-pandemic environment has resulted in a major shift on where SREs will be located, with nearly 50% of SREs believing they will be working remotely post COVID-19, as compared to only 19% prior to the pandemic, according to the 2020 SRE Survey Report from Catchpoint and the DevOps Institute ...