The global pandemic has radically changed the way enterprise IT services are produced, consumed, and managed. It also has exposed a glaring difference between the "the haves and have-nots" of the software development and operations teams.
Engineering teams riding on CI/CD and DevOps waves are starting to see the full potential and purpose of that now. However, newly distributed operations teams are struggling to cope with the sudden change to the WFH (work from home) concept. As a VP of IT operations of a large enterprise told me, "We are in a survival mode with bare minimum tools to cope with. We are fighting a gun battle with swords." This is because the IT operations teams were traditionally set up to work from centralized locations, unlike software and engineering teams. Some organizations have overcome that by implementing AIOps (artificial intelligence for IT operations) solutions; others are using a brute force method of employing more IT operations analysts to keep the distributed NOCs (network operations centers) going.
IT Operations Teams Were Already Stressed
Even before the pandemic started this "new normal" mode of operations, IT operations teams were stressed to deliver more with less. According to a survey of 1300 IT professionals by BigPanda from earlier this year:
■ Innovation and CI/CD culture have increased normal operational workloads by 50%. The majority of the surveyed (53%) expect their NOC/ITOps workloads to increase even more in the next two years.
■ ITOps & NOC teams experience fast-moving IT stacks. These technology changes — whether they were necessitated by faster development needs, or were hyper-scale architecture based changes, or technical debt based — almost always require additional training and insights into the stacks as well as additional qualified analysts.
■ About 47% of respondents see constant application and code changes and 39% experience constant infrastructure changes — most of them see multiple daily changes, sometimes even hourly changes.
To keep up with this, ITOps teams have requested more budget, more automation tools, and more qualified analysts. However, very surprisingly,
■ 56% of them expected their IT budgets to stay flat. And 21% expected their IT operations budgets to shrink.
Over the last few years, software design, development, and testing teams transitioned away from the traditional model to a remote work alternative. Though a lot of corporations have decided to promote face to face collaboration workforce culture recently, they had a mechanism to fall back when the pandemic hit. However, the operations teams were almost always working from a centralized network or security centers (NOC/SOC) and had no such setups in place to work remotely if needed.
IT Operations Teams Were Not Setup to Work Remotely
The coronavirus pandemic has created an even more stressful situation for the IT operations teams.
1. The IT Operations teams have become very distributed and lost almost all of their NOC center privileges almost overnight. These include, but are not limited to, visual health of systems on large monitors on walls, immediate availability of experts in the same room for advice, and quick collaborative decision making to solve critical issues in real time.
2. The DevOps teams are set up to push agile releases virtually, and with working from home, their release cycles have gone up by much higher cycles than normal.
3. The IT Ops teams might see a reduction in personnel and efficiency due to illness, self-isolation, and lay-offs, and they are not properly set up to mimic NOC centralized teams in a remote distributed working environment.
4. To keep up with working remotely, the CIOs are forced to spend more money on infrastructure services, which was not budgeted previously. According to Gartner, cloud-based telephony/messaging and conferencing will see high levels of spending — up 8.9% and 24.3% respectively. Additionally, with an increase in spending for VPN, virtual desktops, hardware upgrades, standup desks for employees, additional security software to work remotely, CIOs have even less money to spend on other things like hiring additional IT operations analysts.
5. Workloads have become more distributed. The DevOps teams are working crazy hours, in crazy locations, and they are making some crazy changes without keeping the operations teams in the loop. Enterprises are still not ready to measure the increased workloads and employee stress that is caused by it as they are still underwater coping with the distributed workforce changes.
With the new budget crunch because of the economic impact, many IT teams that were already under heavy strain have slashed their IT operations staff considerably just to stay alive. This is impacting and adding more stress to the IT operations teams to do more with much less.
Prepare for the Future as This Too Shall Pass
The forward-looking enterprises are already considering moving from survival mode to thriving mode. They are setting up the necessary tools, visibility, compliance, and control for operations teams so that in the future, whether working remotely or in person, they can cope with disruptions and deliver in sync with development and engineering teams. Now the Ops teams can remotely monitor, diagnose, and maintain hyper-scale hybrid cloud systems if needed. While this pandemic may end sometime in the future, there will be other situations that will require IT operations teams to work remotely. By preparing for those situations, enterprises can survive future disruptions and enable operations teams to work efficiently if the situation arises. And, as a bonus, opening up remote locations will allow enterprises to hire more qualified IT analysts without the limitation to hiring only in specific locations.
The bottom line is that old-fashioned IT with old-fashioned thinking can lead to disaster. Reduced budgets, reduced resources, increased workloads, and added stress could lead to an unsustainable spiral. If the CIOs can't support the digital dependency from anywhere during the pandemic and beyond, the business will eventually fail.
Because of the remote working situation, the number of daily incidents has gone up. In some verticals, such as online learning, entertainment services, and collaborative tools, the incidents levels have gone up 10x. Some of those online collaborative tools' security flaws were exposed under high volumes. Between dealing with those incidents, and keeping up with the development and DevOps teams pushing changes to fix them, the Ops teams and the IT operations analyst jobs have now become the most stressful IT jobs.
Here are some of the things enterprises can do to mitigate the situation:
1. If at all possible, stop supporting non-critical business applications. This will free up a lot of support time.
2. Prioritize solving business-critical issues (such as scalability, security flaws, etc.) over non-critical issues as well as feature requests. They can wait.
3. Automate the IT processes as much as possible. The IT teams should be set up to find and solve issues efficiently.
4. Synchronize development and the IT Ops teams. Unless the Ops teams are aware of things that broke the system, they might be looking in the wrong places to solve issues.
5. Use ML, AI, and AIOps to reduce the noise (aka multiple alerts, tickets for the same incident) so teams can avoid distractions, spot early warnings, and concentrate on real issues. Properly implemented AIOps solution can reduce up to 95%+ alerts and avoid teams from feeling overwhelmed by "alert fatigue."
6. Automate the routing of incidents to the right resource quickly rather than escalating through multiple levels of support.
As enterprises work to implement or improve their observability practices, tool sprawl is a very real phenomenon ... Tool sprawl can and does happen all across the organization. In this post, though, we'll focus specifically on how and why observability efforts often result in tool sprawl, some of the possible negative consequences of that sprawl, and we'll offer some advice on how to reduce or even avoid sprawl ...
As companies generate more data across their network footprints, they need network observability tools to help find meaning in that data for better decision-making and problem solving. It seems many companies believe that adding more tools leads to better and faster insights ... And yet, observability tools aren't meeting many companies' needs. In fact, adding more tools introduces new challenges ...
Driven by the need to create scalable, faster, and more agile systems, businesses are adopting cloud native approaches. But cloud native environments also come with an explosion of data and complexity that makes it harder for businesses to detect and remediate issues before everything comes to a screeching halt. Observability, if done right, can make it easier to mitigate these challenges and remediate incidents before they become major customer-impacting problems ...
The spiraling cost of energy is forcing public cloud providers to raise their prices significantly. A recent report by Canalys predicted that public cloud prices will jump by around 20% in the US and more than 30% in Europe in 2023. These steep price increases will test the conventional wisdom that moving to the cloud is a cheap computing alternative ...
Despite strong interest over the past decade, the actual investment in DX has been recent. While 100% of enterprises are now engaged with DX in some way, most (77%) have begun their DX journey within the past two years. And most are early stage, with a fourth (24%) at the discussion stage and half (49%) currently transforming. Only 27% say they have finished their DX efforts ...