Hard Lessons of Public Cloud: Designing for When AWS Goes Down
March 07, 2017

Eric Wright
Turbonomic

Share this

So, your site is down because AWS S3 went away. Too soon? It's never too soon to talk about why the responsibility for designing resilient infrastructure belongs in your camp. It's like when Smokey the Bear used to say that "only you can prevent forest fires." The difference is that it's Jeff Bezos saying it this time.

We have some real insight into what design for cloud resiliency really means thanks to a chat that I had recently.

Cloud Goes Down, so Design for It

There is no special text in the terms and conditions. These are hard facts. AWS designs its infrastructure to be as resilient as possible, but clearly tells you that you should design with the intention of surviving partial service outages. It isn't that AWS plans on being down a lot, but they have been hit by specific DDoS attacks, and also have had to reboot EC2 hosts in order to patch for security vulnerabilities.

At the time I was writing this, AWS S3 was fighting its way back to life in the US-east-1 Region. This means that there were multiple Availability Zones in the throes of recovery, and that potentially hundreds of thousands of web sites, and applications were experiencing issues retrieving objects from the widely used object storage platform.

So, how do we do this better? Let's ask someone who does design and see how the developers think about things. With that, I wanted to share a great discussion that I had with former Disney lead architect and current Principal Software Architect at Turbonomic, Steve Haines.

Q&A: Understanding the Developer's Reaction to the AWS Outage

EW: What does it mean to think about designing across regions inside the public cloud?

SH: Designing an application to run across multiple AWS regions is not a trivial task. While you can deploy stateless services or micro-services to multiple regions and then configure Route53 (Amazon's DNS Service) to point to Elastic Load Balancers (ELBs) in each region, that doesn't completely solve the problem.

First, it's crucial to consider the cost of redundancy. How many regions and how many availability zones (AZ) in each region do we want to deploy to? From historical outages, you're probably safe with two regions, but you do not want to keep a full copy of your application deployed in another region just for disaster recovery: you want to use it and distribute workloads across those regions!

For some use cases this will be easy, but for others you will need to design your application so that it is close to the resources it needs to access. If you design your application with failure in mind and to run in multiple regions then you can manage the cost because both regions will be running your workloads.

EW: That seems to be a bit of the cost of doing business for design and resiliency, but what is the impact below the presentation layers? It feels like that is the sort of "low hanging fruit" as we know it, but there is much more to the application architecture than that, right?

SH: Exactly! That leads to the next challenge: resources, such as databases and files. While AWS provides users multi-A to Z database replication free of charge for databases running behind RDS, users are still paying for storage, IOPS, etc. However, this model changes if a user wants to replicate across regions. For example, Oracle provides a product called GoldenGate for performing cross-region replication, which is a great tool but can significantly impact your IT budget.

Alternatively, you can consider one of Amazon's native offerings, Aurora, which supports cross- region replication out-of-the-box, but that needs to be a design decision you make when you're building or refactoring your application. And, if you store files in S3, be sure that you enable cross- region replication, it will cost you more, but it will ensure that files stored in one region will be available in the event of a regional outage.

EW: Sounds like we have already got some challenges in front of us with just porting our designs to cloud platforms, but when you're already leaning into the cloud as a first-class destination for your apps we have to already think about big outages. We do disaster recovery testing on-premises because that's something we can control. How do we do that type of testing out in the public cloud?

SH: Good question. It's important to remember that while designing an application to run in a cross-region capacity is one thing, having the confidence that it will work when you lose a region is another beast altogether!

This is where I'll defer to Netflix's practice of designing for failure and regularly testing failure scenarios. They have a "Simian Army" (https://github.com/Netflix/SimianArmy) that simulates various failure scenarios in production and ensures that everything continues to work. One of the members of the Simian Army is the Chaos Gorilla that regularly kills a region and ensures that Netflix continues to function, which is one of the reasons they were able to sustain the previous full region outage.

If you're serious about running across regions then you need to regularly validate that it works!

But maybe we should think bigger than cross-region – what if we could design across clouds for the ultimate protection?

EW: Thanks for the background and advice, Steve. Good food for thought for all of us in the IT industry. I'm sure there are a lot of people having this discussion in the coming weeks after the recent outage.

Eric Wright is Principal Solutions Engineer at Turbonomic.

Share this

The Latest

January 19, 2018

Confidence in satisfying and supporting core IT has diminished due in part to a strain on declining IT budgets and initiatives now progressing beyond implementation into production mode, according to TEKsystems' annual IT Forecast research ...

January 18, 2018

Making predictions is always a gamble. But given the way 2017 played out and the way 2018 is shaping up, odds are that certain technology trends will play a significant role in your IT department this year ...

January 17, 2018

With more than one-third of IT Professionals citing "moving faster" as their top goal for 2018, and an overwhelming 99 percent of IT and business decision makers noticing an increasing pace of change in today's connected world, it's clear that speed has become intrinsically linked to business success. For companies looking to compete in the digital economy, this pace of transformation is being driven by their customers and requires speedy software releases, agility through cloud services, and automation ...

January 16, 2018

Looking back on this year, we can see threads of what the future holds in enterprise networking. Specifically, taking a closer look at the biggest news and trends of this year, IT areas where businesses are investing and perspectives from the analyst community, as well as our own experiences, here are five network predictions for the coming year ...

January 12, 2018

As we enter 2018, businesses are busy anticipating what the new year will bring in terms of industry developments, growing trends, and hidden surprises. In 2017, the increased use of automation within testing teams (where Agile development boosted speed of release), led to QA becoming much more embedded within development teams than would have been the case a few years ago. As a result, proper software testing and monitoring assumes ever greater importance. The natural question is – what next? Here are some of the changes we believe will happen within our industry in 2018 ...

January 11, 2018

Application Performance Monitoring (APM) has become a must-have technology for IT organizations. In today’s era of digital transformation, distributed computing and cloud-native services, APM tools enable IT organizations to measure the real experience of users, trace business transactions to identify slowdowns and deliver the code-level visibility needed for optimizing the performance of applications. 2018 will see the requirements and expectations from APM solutions increase in the following ways ...

January 10, 2018

We don't often enough look back at the prior year’s predictions to see if they actually came to fruition. That is the purpose of this analysis. I have picked out a few key areas in APMdigest's 2017 Application Performance Management Predictions, and analyzed which predictions actually came true ...

January 09, 2018

Planning for a new year often includes predicting what’s going to happen. However, we don't often enough look back at the prior year’s predictions to see if they actually came to fruition. That is the purpose of this analysis. I have picked out a few key areas in APMdigest's 2017 Application Performance Management Predictions, and analyzed which predictions actually came true ...

January 08, 2018

The annual list of DevOps Predictions is now a DEVOPSdigest tradition. DevOps experts — analysts and consultants, users and the top vendors — offer predictions on how DevOps and related technologies will evolve and impact business in 2018 ...

January 05, 2018

Industry experts offer predictions on how Network Performance Management (NPM) and related technologies will evolve and impact business in 2018 ...