The Perils of Downtime in the Cloud
October 23, 2014

Cliff Moon
Boundary

Share this

The mantra for developers at Facebook for the longest time has been "move fast and break things". The idea behind this philosophy being that the stigma around screwing up and breaking production slows down feature development, therefore if one removes the stigma from breakage, more agility will result. The cloud readily embodies this philosophy, since it is explicitly made of of unreliable components. The challenge for the enterprise embracing the cloud is to build up the processes and resiliency necessary to build reliable systems from unreliable components. Otherwise, moving to the cloud will mean that your customers are the first people to notice when you are experiencing downtime.

So what changes are necessary to remove the costs of downtime in the cloud? Foremost what is needed is a move to a more resilient architecture. The health of the service as a whole cannot rely on any single node. This means no special nodes: everything gets installed onto multiple instances with active-active load balancing between identical services. Not only that, but any service with a dependency must be able to survive that dependency going away. Writing code that is resilient to the myriad failures that may happen in the cloud is an art unto itself. No one will be good at it to start. This is where process and culture modifications come in.

It turns out that if you want programmers to write code that behaves well in production, an effective way to achieve that is to make them responsible for the behavior of their code in production. The individual programmers go on pager rotation and because they have to work side by side with the other people on rotation, they are held accountable for the code they write. It should never be an option to point to the failure of another service as the cause of your own service's failure. The writers of each discrete service should be encouraged to own their availability by measuring it separately from that of their dependencies. Techniques like serving stale data from cache, graceful degradation of ancillary features, and well reasoned timeout settings are all useful for being resilient while still depending on unreliable dependencies.

If your developers are on pager rotation, then there should be something to page them about. This is where monitoring comes in. Monitoring alerts come in two basic flavors: noise and signal. Monitoring setups with too many alerts configured will tend to be noisy, which leads to alert fatigue.

A good rule of thumb for any alerts you may have setup are that they be: actionable, impacting, and imminent. By actionable, I mean that there is a clear set of steps for resolving the issue. An actionable alert would be to tell you that a service has gone down. Less actionable would be to tell you that latencies are up, since it isn't clear what, if anything, you could do about that.

Impacting means that without human intervention the underlying condition will either cause or continue to cause customer impact.

And imminent means that the alert requires immediate intervention to alleviate service disruption. An example of a non-imminent alert would be alerting that your SSL certificates were due to expire in a month. Impactful and actionable, absolutely. But it doesn't warrant getting out of bed in the middle of the night.

At the end of the day, adopting the cloud alone isn't going to be the silver bullet that automatically injects agility into your team. The culture and structure of the team must be adapted to fit the tools and platforms they use in order to get the most out of them. Otherwise, you're going to be having a lot of downtime in the cloud.

Cliff Moon is CTO and Founder of Boundary.

Share this

The Latest

July 19, 2018

According to a recent survey, critical IT incidents cost the average organization upwards of $6 million per year. This infographic outlines 4 easy steps to automate incident management, reducing downtime and costs to organizations ...

July 17, 2018

The essential value resulting from data-driven processes has become progressively linked with analytics. Once considered a desired complement to intuitive decision-making, analytics has developed into a main focus of mission-critical applications across industries for any number of use cases ...

July 16, 2018

The question of SaaS-based technology over the past decade has quickly changed from "should we?" to "how soon can we?" even for the most customized and regulated of industries. This macro move toward SaaS has also encouraged a series of IT "best practices" that have potential impacts on the employee digital experience, organizational risk and ultimately, productivity ...

July 11, 2018

Optimization means improving the performance of your human and technology resources while keeping a watchful eye. To accomplish this, we must have clear, crisp visibility into the metrics relevant to the delivery of workspace applications to your end users and to the devices – the endpoints – they use to be productive ...

July 09, 2018

As tech headlines flash across my email, the CMDB, and its federated equivalent, the CMS, are almost never mentioned. And yet when I do research, dialog with IT, or support our consulting team, the CMDB/CMS many times still remains paramount ...

June 28, 2018

Given the size and complexity of today's IT networks it can be almost impossible to detect just when and where a security breach or network failure might occur. It's critical, therefore, that businesses have complete visibility over their IT networks, and any applications and services that run on those networks, in order to protect their customers' information, assure uninterrupted service delivery and, of course, comply with the GDPR ...

June 27, 2018

A new breed of solution has been born that simultaneously provides the precision of packet-based analytics with the speed of flow-based monitoring (at a reasonable cost). Here are more reasons to use these new NPM/APM analytics solutions ...

June 26, 2018

A new breed of solution has been born that simultaneously provides the precision of packet-based analytics with the speed of flow-based monitoring (at a reasonable cost). Here are 6 reasons to use these new NPM/APM analytics solutions ...

June 21, 2018

There’s no doubt that digital innovations are transforming industries, and business leaders are left with little or no choice – either embrace digital processes or suffer the consequences and get left behind ...

June 20, 2018

Looking ahead to the rest of 2018 and beyond, it seems like many of the trends that shaped 2017 are set to continue, with the key difference being in how they evolve and shift as they become mainstream. Five key factors defining the progression of the digital transformation movement are ...