Resilience - The Modern Uptime Trinity
August 19, 2020

Terry Critchley
Author of "Making It in IT"

Share this

Some years ago, the computer systems' key focus was on performance and many articles, products and efforts were evident in this area. A few years later, the emphasis moved to high availability (HA) of hardware and software and all the other machinations they entail. Today the focus is on (cyber)security.

Read Dr. Terry Crtichley's full paper on Resilience

These discrete environments' boundaries have now blurred under the heading of resilience. The main components of resilience are:

1. Normal high availability (HA) design, redundancy etc. plus normal recovery from non-critical outages. This applies to hardware and software. Human factors ("fat finger" syndrome and deliberate malice), are extremely common causes of failure.

2. Cybersecurity breaches of all kinds. No hard system failures here but leaving a compromised system online is dangerous. This area has spawned the phrase cybersecurity resilience.

3. Disaster Recovery (DR), a discipline not in evidence, for example, in May 2017 when Wannacry struck the UK NHS (National Health Service).

You can't choose which of the three bases you cover; it's all or nothing and in the "any-2-from-3" choice, disaster beckons. It would be like trying to build then sit on a two-legged stool.

In boxing, resilience in simple terms means the ability to recover from a punch (normal recovery) or knock down (disaster recovery). However, it has connotations beyond just that, inasmuch as the boxer must prepare himself via tough training, a fight plan and coaching to avoid the knockdown and, should it happen, he should be fit enough to recover and re-join the fray quickly enough to beat the 10 second count; financial penalties in our world.

When is an Outage Not an Outage?

This is a valid question to ask if you understand service level agreements (SLAs). SLAs specify what properties the service should offer aside from a "system availability clause." These requirements usually include response times, hours of service schedule (not the same as availability) at various points in the calendar, for example, high volume activity periods such as major holidays, product promotions, year-end processing and so on.

Many people think of a system outage as complete failure — a knockout using our earlier analogy. In reality, a system not performing as expected and defined in a Service Level Agreement (SLA) will often lead users to consider the system as ‘down' since it is not doing what it is supposed to do and impedes their work.

This leads to the concept of a logical outage(a forced standing count in boxing) where physically everything is in working order but the service provided is not acceptable for some reason. These reasons vary, depending at what stage the applications have reached but they are many.

Resilience Areas

Resilience in bare terms means the ability to recover from a knock down, to use the boxing analogy once more. However, it has connotations beyond just that inasmuch as the boxer must prepare himself by tough training and coaching to avoid the knockdown and, should it happen, he should be fit enough to recover, get to his feet and continue fighting. The information technology (IT) scenario this involves, among other things:

■ "Fitness" through rigorous system design, implementation and monitoring.

■ Normal backup and recovery after outages or data loss.

■ Cybersecurity tools and techniques.

■ Disaster Recovery (DR) when the primary system(s) is totally unable to function for whatever reason and workload must be located and accessed from facilities — system and accommodation (often forgotten) — elsewhere.

■ Spanning the resilience ecosphere are the monitoring, management and analysis methods to turn data into information to support the resilience aims of a company and improve it. If you can't measure it, you can't manage it.

Figure 1 is a simple representation of resilience and the main thing to remember is that it is not a pick and choose exercise; you have to do them all to close the loop between the three contributing areas of resilience planning and recovery activities.


Figure 1: Resilience Components

Security(cybersecurity) is a new threat which the business world has to be aware of and take action on, not following the Mark Twain dictum: "Everybody is talking about the weather, nobody is doing anything about it."

The key factor is covering all the ‘resilience' bases at a level matching the business's needs. It is not a "chose any n from M" menu type of choice; it is all or nothing for optimum resilience.

To stretch a point a little, I think that resilience will be enhanced by recognizing the "trinity" aspect of the factors affecting resilience and should operate as such, even in virtual team mode across the individual teams involved. This needs some thought but a "war room" mentality might be appropriate.

The three areas considered in parallel (P) make for a more resilient system than different teams treating them in isolation as serial or siloed activities (S). Another downside of S is that it requires three sets of change management activities.

Conclusion

Like any major activity, the results of any resilience plan need review and corrective action taken. This requires an environment where parameters relating to resilience are measurable, recorded, reviewed and acted upon; it is not simply a monitoring activity since monitoring is passive, management is active and proactive.

Management = Monitoring + Analysis + Review + Action

This is a big subject which few understand in size or complexity but it has to be tackled.

Resilience is hard. If you think that throwing suitable, trendy products at the resilience design is the answer, you are deluding yourself. As Sir Winston Churchill said, in paraphrase; "All I can offer is blood, sweat and tears."

Dr. Terry Critchley is an IT consultant and author who previously worked for IBM, Oracle and Sun Microsystems
Share this

The Latest

December 01, 2020

Organizations around the world are facing heightened pressure to accelerate their digital transformation, as their customers, competitors, and business stakeholders all recognize doing so is no longer a company strategy, but a matter of survival. At the same time, these organizations are experiencing an equally difficult counter-pressure resulting from this transformation: complex multicloud environments and a growing inability to manage them ...

November 30, 2020

The "New Normal" in IT — the fact that most DevOps personnel work from home (WFH) now — is here to stay. What started out as a reaction to the COVID-19 pandemic is now a way of life. Many experts agree that development teams will not be going back to the office any time soon, even if the public health concerns are abated. How should DevOps and development adapt to the new normal? That is the question DEVOPSdigest posed to the development community. DevOps industry experts — from analysts and consultants to community leaders and the top vendors — offer their best recommendations for how development organizations can react to this new environment ...

November 24, 2020

Shoppers are heading into Black Friday with high expectations for digital experiences and are only willing to experience a service interruption of five minutes or less to get the best deal, according to the 2020 Black Friday and Cyber Monday eCommerce Trends Study, from xMatters ...

November 23, 2020

Digital Experience Monitoring (DEM) has become significant to businesses more than ever. Global events like Covid continue to disrupt best practices within IT to support business. The pandemic has already forced millions of employees to WFH and adopt a hybrid workspace. Network connectivity and cloud application issues in these environments will continue to impact productivity and slow progress. Even so, transparent migration and deployment of on-premise workloads across multi-cloud providers, by their very nature are complex ...

November 20, 2020

APMdigest posed the following question to the IT Operations community: How should ITOps adapt to the new normal? In response, industry experts offered their best recommendations for how ITOps can adapt to this new remote work environment. Part 5, the final installment in the series, covers open source and emerging technologies ...

November 19, 2020

APMdigest posed the following question to the IT Operations community: How should ITOps adapt to the new normal? In response, industry experts offered their best recommendations for how ITOps can adapt to this new remote work environment. Part 4 covers monitoring and visibility ...

November 18, 2020

APMdigest posed the following question to the IT Operations community: How should ITOps adapt to the new normal? In response, industry experts offered their best recommendations for how ITOps can adapt to this new remote work environment. Part 3 covers automation ...

November 17, 2020

APMdigest posed the following question to the IT Operations community: How should ITOps adapt to the new normal? In response, industry experts offered their best recommendations for how ITOps can adapt to this new remote work environment. Part 2 covers communication and collaboration ...

November 16, 2020

The "New Normal" in IT — the fact that most IT Operations personnel work from home (WFH) today — is here to stay. What started out as a reaction to the COVID-19 pandemic is now a way of life. Many experts agree that IT teams will not be going back to the office any time soon, even if the public health concerns are abated. How should ITOPs adapt to the new normal? That is the question APMdigest posed to the IT industry. ITOps experts — from analysts and consultants to the top vendors — offer their best recommendations for how ITOps can react to this new environment ...

November 12, 2020

The pandemic effectively "shocked" enterprises into pushing the gas on tech initiatives that, on the one hand, support a more flexible, decentralized workforce, but that were by-and-large already on the roadmap, regardless of whether businesses had been planning to support widespread work-from-home or not ...