The Perils of Downtime in the Cloud
October 23, 2014

Cliff Moon
Boundary

Share this

The mantra for developers at Facebook for the longest time has been "move fast and break things". The idea behind this philosophy being that the stigma around screwing up and breaking production slows down feature development, therefore if one removes the stigma from breakage, more agility will result. The cloud readily embodies this philosophy, since it is explicitly made of of unreliable components. The challenge for the enterprise embracing the cloud is to build up the processes and resiliency necessary to build reliable systems from unreliable components. Otherwise, moving to the cloud will mean that your customers are the first people to notice when you are experiencing downtime.

So what changes are necessary to remove the costs of downtime in the cloud? Foremost what is needed is a move to a more resilient architecture. The health of the service as a whole cannot rely on any single node. This means no special nodes: everything gets installed onto multiple instances with active-active load balancing between identical services. Not only that, but any service with a dependency must be able to survive that dependency going away. Writing code that is resilient to the myriad failures that may happen in the cloud is an art unto itself. No one will be good at it to start. This is where process and culture modifications come in.

It turns out that if you want programmers to write code that behaves well in production, an effective way to achieve that is to make them responsible for the behavior of their code in production. The individual programmers go on pager rotation and because they have to work side by side with the other people on rotation, they are held accountable for the code they write. It should never be an option to point to the failure of another service as the cause of your own service's failure. The writers of each discrete service should be encouraged to own their availability by measuring it separately from that of their dependencies. Techniques like serving stale data from cache, graceful degradation of ancillary features, and well reasoned timeout settings are all useful for being resilient while still depending on unreliable dependencies.

If your developers are on pager rotation, then there should be something to page them about. This is where monitoring comes in. Monitoring alerts come in two basic flavors: noise and signal. Monitoring setups with too many alerts configured will tend to be noisy, which leads to alert fatigue.

A good rule of thumb for any alerts you may have setup are that they be: actionable, impacting, and imminent. By actionable, I mean that there is a clear set of steps for resolving the issue. An actionable alert would be to tell you that a service has gone down. Less actionable would be to tell you that latencies are up, since it isn't clear what, if anything, you could do about that.

Impacting means that without human intervention the underlying condition will either cause or continue to cause customer impact.

And imminent means that the alert requires immediate intervention to alleviate service disruption. An example of a non-imminent alert would be alerting that your SSL certificates were due to expire in a month. Impactful and actionable, absolutely. But it doesn't warrant getting out of bed in the middle of the night.

At the end of the day, adopting the cloud alone isn't going to be the silver bullet that automatically injects agility into your team. The culture and structure of the team must be adapted to fit the tools and platforms they use in order to get the most out of them. Otherwise, you're going to be having a lot of downtime in the cloud.

Cliff Moon is CTO and Founder of Boundary.

Share this

The Latest

October 16, 2017
Hurricane season is in full swing. With the latest incoming cases of mega-storms devastating the Southeastern shoreline, communities are struggling to restore daily normalcy. People have been stepping up and showing remarkable strength and leadership in helping those affected. However, there is another area that we need to remember in these trying times – and that is businesses continuity ...
October 12, 2017

Gartner highlighted the top strategic technology trends that will impact most organizations in 2018. The next trends focus on blending the digital and physical worlds to create an immersive, digitally enhanced environment. The last three refer to exploiting connections between an expanding set of people and businesses, as well as devices, content and services to deliver digital business outcomes ...

October 11, 2017

Gartner highlighted the top strategic technology trends that will impact most organizations in 2018. The first three strategic technology trends explore how artificial intelligence (AI) and machine learning are seeping into virtually everything and represent a major battleground for technology providers over the next five years ...

October 10, 2017
This is the sixth in my series of blogs inspired by EMA's AIA buyer's guide — directed at helping IT invest in Advanced IT Analytics (AIA), what the industry more commonly calls "Operational Analytics." In this blog, I examine scenario-related shopping cart objectives for AIA. At EMA, we evaluated seven unique scenarios relevant to AIA adoptions. Our scenarios included agile/DevOps, Integrated security, change impact awareness, capacity optimization, business impact, business alignment and unifying IT ...
October 06, 2017

In the Riverbed Future of Networking Global Survey, more than half of the respondents acknowledged that achieving operational agility is critical to the success of a modern enterprise, and next-generation networks as well as the technology to support them are key to reaching this goal ...

October 05, 2017

Legacy infrastructures are holding back their cloud and digital strategies, according to the Riverbed Future of Networking Global Survey 2017. Nearly all survey respondents agree that legacy network infrastructure will have difficulty keeping pace with the changing demands of the cloud and hybrid networks ...

October 04, 2017

Digital disruptors are emerging in all industries, and the need for CIOs to embrace digital transformation is urgent, according to Gartner ...

October 02, 2017

Environments indicate "where" the AIA solutions we investigated can be applied. All 13 of the solutions we investigated support cloud for performance, core infrastructure, and application performance and availability. Mainframe had the support of six of our respondents, and IoT and cloud for change and capacity were not yet prime areas of focus for most of the vendors in our AIA buyer's guide ...

September 29, 2017

Cost, overhead, and time to value are often key challenges in adopting AIA solutions. In the past, these factors have often been especially onerous. But we saw strong levels of improvement among many vendors, and surprising areas of innovation among others ...

September 28, 2017
Most senior executives recognize that unified communications and collaboration (UC) are integral applications on the digital transformation path. As a result, many companies are in the process of replacing legacy voice and video infrastructure and disparate messaging and collaboration tools with next-generation UC systems, including cloud-based unified communication as a service (UCaaS). With UC, companies can accelerate time-to-revenue, improve productivity and reduce capex and opex – the three pillars of return on investment (ROI) that drive corporate strategy ...