The global pandemic has radically changed the way enterprise IT services are produced, consumed, and managed. It also has exposed a glaring difference between the "the haves and have-nots" of the software development and operations teams.
Engineering teams riding on CI/CD and DevOps waves are starting to see the full potential and purpose of that now. However, newly distributed operations teams are struggling to cope with the sudden change to the WFH (work from home) concept. As a VP of IT operations of a large enterprise told me, "We are in a survival mode with bare minimum tools to cope with. We are fighting a gun battle with swords." This is because the IT operations teams were traditionally set up to work from centralized locations, unlike software and engineering teams. Some organizations have overcome that by implementing AIOps (artificial intelligence for IT operations) solutions; others are using a brute force method of employing more IT operations analysts to keep the distributed NOCs (network operations centers) going.
IT Operations Teams Were Already Stressed
Even before the pandemic started this "new normal" mode of operations, IT operations teams were stressed to deliver more with less. According to a survey of 1300 IT professionals by BigPanda from earlier this year:
■ Innovation and CI/CD culture have increased normal operational workloads by 50%. The majority of the surveyed (53%) expect their NOC/ITOps workloads to increase even more in the next two years.
■ ITOps & NOC teams experience fast-moving IT stacks. These technology changes — whether they were necessitated by faster development needs, or were hyper-scale architecture based changes, or technical debt based — almost always require additional training and insights into the stacks as well as additional qualified analysts.
■ About 47% of respondents see constant application and code changes and 39% experience constant infrastructure changes — most of them see multiple daily changes, sometimes even hourly changes.
To keep up with this, ITOps teams have requested more budget, more automation tools, and more qualified analysts. However, very surprisingly,
■ 56% of them expected their IT budgets to stay flat. And 21% expected their IT operations budgets to shrink.
■ Worldwide IT spending is projected to trim down to $3.4 trillion in 2020, down 8% from 2019, according to Gartner.
Over the last few years, software design, development, and testing teams transitioned away from the traditional model to a remote work alternative. Though a lot of corporations have decided to promote face to face collaboration workforce culture recently, they had a mechanism to fall back when the pandemic hit. However, the operations teams were almost always working from a centralized network or security centers (NOC/SOC) and had no such setups in place to work remotely if needed.
IT Operations Teams Were Not Setup to Work Remotely
The coronavirus pandemic has created an even more stressful situation for the IT operations teams.
1. The IT Operations teams have become very distributed and lost almost all of their NOC center privileges almost overnight. These include, but are not limited to, visual health of systems on large monitors on walls, immediate availability of experts in the same room for advice, and quick collaborative decision making to solve critical issues in real time.
2. The DevOps teams are set up to push agile releases virtually, and with working from home, their release cycles have gone up by much higher cycles than normal.
3. The IT Ops teams might see a reduction in personnel and efficiency due to illness, self-isolation, and lay-offs, and they are not properly set up to mimic NOC centralized teams in a remote distributed working environment.
4. To keep up with working remotely, the CIOs are forced to spend more money on infrastructure services, which was not budgeted previously. According to Gartner, cloud-based telephony/messaging and conferencing will see high levels of spending — up 8.9% and 24.3% respectively. Additionally, with an increase in spending for VPN, virtual desktops, hardware upgrades, standup desks for employees, additional security software to work remotely, CIOs have even less money to spend on other things like hiring additional IT operations analysts.
5. Workloads have become more distributed. The DevOps teams are working crazy hours, in crazy locations, and they are making some crazy changes without keeping the operations teams in the loop. Enterprises are still not ready to measure the increased workloads and employee stress that is caused by it as they are still underwater coping with the distributed workforce changes.
With the new budget crunch because of the economic impact, many IT teams that were already under heavy strain have slashed their IT operations staff considerably just to stay alive. This is impacting and adding more stress to the IT operations teams to do more with much less.
Prepare for the Future as This Too Shall Pass
The forward-looking enterprises are already considering moving from survival mode to thriving mode. They are setting up the necessary tools, visibility, compliance, and control for operations teams so that in the future, whether working remotely or in person, they can cope with disruptions and deliver in sync with development and engineering teams. Now the Ops teams can remotely monitor, diagnose, and maintain hyper-scale hybrid cloud systems if needed. While this pandemic may end sometime in the future, there will be other situations that will require IT operations teams to work remotely. By preparing for those situations, enterprises can survive future disruptions and enable operations teams to work efficiently if the situation arises. And, as a bonus, opening up remote locations will allow enterprises to hire more qualified IT analysts without the limitation to hiring only in specific locations.
The bottom line is that old-fashioned IT with old-fashioned thinking can lead to disaster. Reduced budgets, reduced resources, increased workloads, and added stress could lead to an unsustainable spiral. If the CIOs can't support the digital dependency from anywhere during the pandemic and beyond, the business will eventually fail.
Because of the remote working situation, the number of daily incidents has gone up. In some verticals, such as online learning, entertainment services, and collaborative tools, the incidents levels have gone up 10x. Some of those online collaborative tools' security flaws were exposed under high volumes. Between dealing with those incidents, and keeping up with the development and DevOps teams pushing changes to fix them, the Ops teams and the IT operations analyst jobs have now become the most stressful IT jobs.
Here are some of the things enterprises can do to mitigate the situation:
1. If at all possible, stop supporting non-critical business applications. This will free up a lot of support time.
2. Prioritize solving business-critical issues (such as scalability, security flaws, etc.) over non-critical issues as well as feature requests. They can wait.
3. Automate the IT processes as much as possible. The IT teams should be set up to find and solve issues efficiently.
4. Synchronize development and the IT Ops teams. Unless the Ops teams are aware of things that broke the system, they might be looking in the wrong places to solve issues.
5. Use ML, AI, and AIOps to reduce the noise (aka multiple alerts, tickets for the same incident) so teams can avoid distractions, spot early warnings, and concentrate on real issues. Properly implemented AIOps solution can reduce up to 95%+ alerts and avoid teams from feeling overwhelmed by "alert fatigue."
6. Automate the routing of incidents to the right resource quickly rather than escalating through multiple levels of support.
The journey of maturing observability practices for users entails navigating peaks and valleys. Users have clearly witnessed the maturation of their monitoring capabilities, embraced DevOps practices, and adopted cloud and cloud-native technologies. Notwithstanding that, we witness the gradual increase of the Mean Time To Recovery (MTTR) for production issues year over year ...
Optimizing existing use of cloud is the top initiative — for the seventh year in a row, reported by 62% of respondents in the Flexera 2023 State of the Cloud Report ...
Gartner highlighted four trends impacting cloud, data center and edge infrastructure in 2023, as infrastructure and operations teams pivot to support new technologies and ways of working during a year of economic uncertainty ...
Developers need a tool that can be portable and vendor agnostic, given the advent of microservices. It may be clear an issue is occurring; what may not be clear is if it's part of a distributed system or the app itself. Enter OpenTelemetry, commonly referred to as OTel, an open-source framework that provides a standardized way of collecting and exporting telemetry data (logs, metrics, and traces) from cloud-native software ...
As SLOs grow in popularity their usage is becoming more mature. For example, 82% of respondents intend to increase their use of SLOs, and 96% have mapped SLOs directly to their business operations or already have a plan to, according to The State of Service Level Objectives 2023 from Nobl9 ...