As we've seen, hardware is at the root of a large proportion of data center outages, and the costs and consequences are often exacerbated when VMs are affected. The best answer, therefore, is for IT pros to get back to basics.
Start with Part 1: Complacency Kills Uptime in Virtualized Environments
Just as drivers wearing seatbelts should still use turn signals (even though many don't), data center managers should continue to take the usual precautions to protect against equipment-related outages. Put simply:
Attend to the hardware
In the rush to implement the latest technologies, don't overlook the fundamentals, such as routine server maintenance, UPS tests and upgrades, and facility checks for hotspots, air flow problems, and other issues.
Integrate monitoring and response
Only about half of IT organizations rely on their monitoring tool or ticketing system to activate a response team. This is a lost opportunity for accelerating break/fix. So is the failure to utilize newer AI-driven hardware monitoring technologies which are becoming highly accessible.
Have parts on standby
It's no good to go searching for spares after a hardware failure occurs. Spare parts should be on site for mission critical systems or available for quick delivery in other cases.
Invest in expertise
Having the right people with the right skills is essential. Unfortunately, today's tight IT labor market is making it difficult to find and afford talent. Data center managers should consider whether they have the budget to build comprehensive engineering capabilities or if they are better off sourcing it from a partner.
It can be hard to manage these tasks in addition to the many responsibilities that have been piled on data center personnel over the past decade. In many cases, the easiest and most affordable option is to hand off the bulk of the hardware "to do" list to a third-party provider specializing in IT support. That way someone else can effectively address the risk associated with hardware through 24/7 monitoring, spares management, and immediate Level 3 support while the business gets back to business.