The Perils of Downtime in the Cloud
October 23, 2014

Cliff Moon
Boundary

Share this

The mantra for developers at Facebook for the longest time has been "move fast and break things". The idea behind this philosophy being that the stigma around screwing up and breaking production slows down feature development, therefore if one removes the stigma from breakage, more agility will result. The cloud readily embodies this philosophy, since it is explicitly made of of unreliable components. The challenge for the enterprise embracing the cloud is to build up the processes and resiliency necessary to build reliable systems from unreliable components. Otherwise, moving to the cloud will mean that your customers are the first people to notice when you are experiencing downtime.

So what changes are necessary to remove the costs of downtime in the cloud? Foremost what is needed is a move to a more resilient architecture. The health of the service as a whole cannot rely on any single node. This means no special nodes: everything gets installed onto multiple instances with active-active load balancing between identical services. Not only that, but any service with a dependency must be able to survive that dependency going away. Writing code that is resilient to the myriad failures that may happen in the cloud is an art unto itself. No one will be good at it to start. This is where process and culture modifications come in.

It turns out that if you want programmers to write code that behaves well in production, an effective way to achieve that is to make them responsible for the behavior of their code in production. The individual programmers go on pager rotation and because they have to work side by side with the other people on rotation, they are held accountable for the code they write. It should never be an option to point to the failure of another service as the cause of your own service's failure. The writers of each discrete service should be encouraged to own their availability by measuring it separately from that of their dependencies. Techniques like serving stale data from cache, graceful degradation of ancillary features, and well reasoned timeout settings are all useful for being resilient while still depending on unreliable dependencies.

If your developers are on pager rotation, then there should be something to page them about. This is where monitoring comes in. Monitoring alerts come in two basic flavors: noise and signal. Monitoring setups with too many alerts configured will tend to be noisy, which leads to alert fatigue.

A good rule of thumb for any alerts you may have setup are that they be: actionable, impacting, and imminent. By actionable, I mean that there is a clear set of steps for resolving the issue. An actionable alert would be to tell you that a service has gone down. Less actionable would be to tell you that latencies are up, since it isn't clear what, if anything, you could do about that.

Impacting means that without human intervention the underlying condition will either cause or continue to cause customer impact.

And imminent means that the alert requires immediate intervention to alleviate service disruption. An example of a non-imminent alert would be alerting that your SSL certificates were due to expire in a month. Impactful and actionable, absolutely. But it doesn't warrant getting out of bed in the middle of the night.

At the end of the day, adopting the cloud alone isn't going to be the silver bullet that automatically injects agility into your team. The culture and structure of the team must be adapted to fit the tools and platforms they use in order to get the most out of them. Otherwise, you're going to be having a lot of downtime in the cloud.

Cliff Moon is CTO and Founder of Boundary.

Share this

The Latest

September 12, 2024

The OpenTelemetry End-User SIG surveyed more than 100 OpenTelemetry users to learn more about their observability journeys and what resources deliver the most value when establishing an observability practice ... Regardless of experience level, there's a clear need for more support and continued education ...

September 11, 2024

A silo is, by definition, an isolated component of an organization that doesn't interact with those around it in any meaningful way. This is the antithesis of collaboration, but its effects are even more insidious than the shutting down of effective conversation ...

September 10, 2024

New Relic's 2024 State of Observability for Industrials, Materials, and Manufacturing report outlines the adoption and business value of observability for the industrials, materials, and manufacturing industries ... Here are 8 key takeaways from the report ...

September 09, 2024

For mission-critical applications, it's often easy to justify an investment in a solution designed to ensure that the application is available no less than 99.99% of the time — easy because the cost to the organization of that app being offline would quickly surpass the cost of a high availability (HA) solution ... But not every application warrants the investment in an HA solution with redundant infrastructure spanning multiple data centers or cloud availability zones ...

September 05, 2024

The edge brings computing resources and data storage closer to end users, which explains the rapid boom in edge computing, but it also generates a huge amount of data ... 44% of organizations are investing in edge IT to create new customer experiences and improve engagement. To achieve those goals, edge services observability should be a centerpoint of that investment ...

September 04, 2024

The growing adoption of efficiency-boosting technologies like artificial intelligence (AI) and machine learning (ML) helps counteract staffing shortages, rising labor costs, and talent gaps, while giving employees more time to focus on strategic projects. This trend is especially evident in the government contracting sector, where, according to Deltek's 2024 Clarity Report, 34% of GovCon leaders rank AI and ML in their top three technology investment priorities for 2024, above perennial focus areas like cybersecurity, data management and integration, business automation and cloud infrastructure ...

September 03, 2024

While IT leaders are preparing organizations for accelerated generative AI (GenAI) adoption, C-suite executives' confidence in their IT team's ability to deliver basic services is declining, according to a study conducted by the IBM Institute for Business Value ...

August 29, 2024

The consequences of outages have become a pressing issue as the largest IT outage in history continues to rock the world with severe ramifications ... According to the Catchpoint Internet Resilience Report, these types of disruptions, internet outages in particular, can have severe financial and reputational impacts and enterprises should strongly consider their resilience ...

August 28, 2024

Everyday AI and digital employee experience (DEX) are projected to reach mainstream adoption in less than two years according to the Gartner, Inc. Hype Cycle for Digital Workplace Applications, 2024 ...

August 27, 2024

When an IT issue is not handled correctly, not only is innovation stifled, but stakeholder trust can also be impacted (such as when there's an IT outage or slowdowns in performance). When you add new technology investments and innovations into the mix, you have a recipe for disaster ...