The global pandemic has radically changed the way enterprise IT services are produced, consumed, and managed. It also has exposed a glaring difference between the "the haves and have-nots" of the software development and operations teams.
Engineering teams riding on CI/CD and DevOps waves are starting to see the full potential and purpose of that now. However, newly distributed operations teams are struggling to cope with the sudden change to the WFH (work from home) concept. As a VP of IT operations of a large enterprise told me, "We are in a survival mode with bare minimum tools to cope with. We are fighting a gun battle with swords." This is because the IT operations teams were traditionally set up to work from centralized locations, unlike software and engineering teams. Some organizations have overcome that by implementing AIOps (artificial intelligence for IT operations) solutions; others are using a brute force method of employing more IT operations analysts to keep the distributed NOCs (network operations centers) going.
IT Operations Teams Were Already Stressed
Even before the pandemic started this "new normal" mode of operations, IT operations teams were stressed to deliver more with less. According to a survey of 1300 IT professionals by BigPanda from earlier this year:
■ Innovation and CI/CD culture have increased normal operational workloads by 50%. The majority of the surveyed (53%) expect their NOC/ITOps workloads to increase even more in the next two years.
■ ITOps & NOC teams experience fast-moving IT stacks. These technology changes — whether they were necessitated by faster development needs, or were hyper-scale architecture based changes, or technical debt based — almost always require additional training and insights into the stacks as well as additional qualified analysts.
■ About 47% of respondents see constant application and code changes and 39% experience constant infrastructure changes — most of them see multiple daily changes, sometimes even hourly changes.
To keep up with this, ITOps teams have requested more budget, more automation tools, and more qualified analysts. However, very surprisingly,
■ 56% of them expected their IT budgets to stay flat. And 21% expected their IT operations budgets to shrink.
Over the last few years, software design, development, and testing teams transitioned away from the traditional model to a remote work alternative. Though a lot of corporations have decided to promote face to face collaboration workforce culture recently, they had a mechanism to fall back when the pandemic hit. However, the operations teams were almost always working from a centralized network or security centers (NOC/SOC) and had no such setups in place to work remotely if needed.
IT Operations Teams Were Not Setup to Work Remotely
The coronavirus pandemic has created an even more stressful situation for the IT operations teams.
1. The IT Operations teams have become very distributed and lost almost all of their NOC center privileges almost overnight. These include, but are not limited to, visual health of systems on large monitors on walls, immediate availability of experts in the same room for advice, and quick collaborative decision making to solve critical issues in real time.
2. The DevOps teams are set up to push agile releases virtually, and with working from home, their release cycles have gone up by much higher cycles than normal.
3. The IT Ops teams might see a reduction in personnel and efficiency due to illness, self-isolation, and lay-offs, and they are not properly set up to mimic NOC centralized teams in a remote distributed working environment.
4. To keep up with working remotely, the CIOs are forced to spend more money on infrastructure services, which was not budgeted previously. According to Gartner, cloud-based telephony/messaging and conferencing will see high levels of spending — up 8.9% and 24.3% respectively. Additionally, with an increase in spending for VPN, virtual desktops, hardware upgrades, standup desks for employees, additional security software to work remotely, CIOs have even less money to spend on other things like hiring additional IT operations analysts.
5. Workloads have become more distributed. The DevOps teams are working crazy hours, in crazy locations, and they are making some crazy changes without keeping the operations teams in the loop. Enterprises are still not ready to measure the increased workloads and employee stress that is caused by it as they are still underwater coping with the distributed workforce changes.
With the new budget crunch because of the economic impact, many IT teams that were already under heavy strain have slashed their IT operations staff considerably just to stay alive. This is impacting and adding more stress to the IT operations teams to do more with much less.
Prepare for the Future as This Too Shall Pass
The forward-looking enterprises are already considering moving from survival mode to thriving mode. They are setting up the necessary tools, visibility, compliance, and control for operations teams so that in the future, whether working remotely or in person, they can cope with disruptions and deliver in sync with development and engineering teams. Now the Ops teams can remotely monitor, diagnose, and maintain hyper-scale hybrid cloud systems if needed. While this pandemic may end sometime in the future, there will be other situations that will require IT operations teams to work remotely. By preparing for those situations, enterprises can survive future disruptions and enable operations teams to work efficiently if the situation arises. And, as a bonus, opening up remote locations will allow enterprises to hire more qualified IT analysts without the limitation to hiring only in specific locations.
The bottom line is that old-fashioned IT with old-fashioned thinking can lead to disaster. Reduced budgets, reduced resources, increased workloads, and added stress could lead to an unsustainable spiral. If the CIOs can't support the digital dependency from anywhere during the pandemic and beyond, the business will eventually fail.
Because of the remote working situation, the number of daily incidents has gone up. In some verticals, such as online learning, entertainment services, and collaborative tools, the incidents levels have gone up 10x. Some of those online collaborative tools' security flaws were exposed under high volumes. Between dealing with those incidents, and keeping up with the development and DevOps teams pushing changes to fix them, the Ops teams and the IT operations analyst jobs have now become the most stressful IT jobs.
Here are some of the things enterprises can do to mitigate the situation:
1. If at all possible, stop supporting non-critical business applications. This will free up a lot of support time.
2. Prioritize solving business-critical issues (such as scalability, security flaws, etc.) over non-critical issues as well as feature requests. They can wait.
3. Automate the IT processes as much as possible. The IT teams should be set up to find and solve issues efficiently.
4. Synchronize development and the IT Ops teams. Unless the Ops teams are aware of things that broke the system, they might be looking in the wrong places to solve issues.
5. Use ML, AI, and AIOps to reduce the noise (aka multiple alerts, tickets for the same incident) so teams can avoid distractions, spot early warnings, and concentrate on real issues. Properly implemented AIOps solution can reduce up to 95%+ alerts and avoid teams from feeling overwhelmed by "alert fatigue."
6. Automate the routing of incidents to the right resource quickly rather than escalating through multiple levels of support.
IT engineers and executives are responsible for system reliability and availability. The volume of data can make it hard to be proactive and fix issues quickly. With over a decade of experience in the field, I know the importance of IT operations analytics and how it can help identify incidents and enable agile responses ...
For businesses with vast and distributed computing infrastructures, one of the main objectives of IT and network operations is to locate the cause of a service condition that is having an impact. The more human resources are put into the task of gathering, processing, and finally visual monitoring the massive volumes of event and log data that serve as the main source of symptomatic indications for emerging crises, the closer the service is to the company's source of revenue ...
Our digital economy is intolerant of downtime. But consumers haven't just come to expect always-on digital apps and services. They also expect continuous innovation, new functionality and lightening fast response times. Organizations have taken note, investing heavily in teams and tools that supposedly increase uptime and free resources for innovation. But leaders have not realized this "throw money at the problem" approach to monitoring is burning through resources without much improvement in availability outcomes ...
Although 83% of businesses are concerned about a recession in 2023, B2B tech marketers can look forward to growth — 51% of organizations plan to increase IT budgets in 2023 vs. a narrow 6% that plan to reduce their spend, according to the 2023 State of IT report from Spiceworks Ziff Davis ...
Users have high expectations around applications — quick loading times, look and feel visually advanced, with feature-rich content, video streaming, and multimedia capabilities — all of these devour network bandwidth. With millions of users accessing applications and mobile apps from multiple devices, most companies today generate seemingly unmanageable volumes of data and traffic on their networks ...