Lessons Learned from Cloud Outages
July 12, 2012

Audrey Rasmussen

Share this

The recent cloud service outages at Amazon and Salesforce.com grabbed the industry’s attention because of the broad customer impact caused by the outages, and the high visibility of both cloud service providers. The reputations of Amazon and Salesforce.com are a bit tarnished due to these outages, but we must be careful not to generalize this into an indictment against cloud computing (more discussion about this later.) Instead, the real value of unfortunate events like these is the opportunity they provide to learn from them.

On a positive note, Amazon publicly stated plans to change some of their processes, as a result of the outage. These "lessons learned" are relevant for those affected by the outages, as well as a "wake-up call" for those of us not directly affected by the outages.

At the macro view level there are several issues that these high profile cloud outages exposed. The first is bringing the cloud "hype bubble" back to reality. There's been an unnerving undercurrent in all of the cloud hype that has propagated the notion that cloud is the solution to all of our problems. "The cloud is faster, easier, cheaper, more scalable, resilient ..."

The truth is that some people are running to public clouds for the wrong reason — to run away from or circumvent around the problems of traditional IT. Needless to say, the public cloud runs on the same hardware that traditional IT uses. And just like traditional IT, it depends on software, it requires management, it uses processes, and more. So the fact of the matter is there will be outages.

There is no pixie dust in cloud computing that prevents any outages, just smart planning that minimizes the impact of outages. The same deep due diligence that is done in internal IT organizations to ensure business continuity, security and resilience must also be applied to evaluating cloud infrastructures. (Just because it's out of sight and your control, doesn't mean it should be out of mind.) And that is the first “lesson learned”, due diligence and planning for outages.

For starters, you need a plan that takes into consideration issues such as:

- What are my business risks if an outage happens? If the risks are high, you must plan for minimizing outages. This may include paying for higher value services like business continuity/disaster recovery services, multiple location failover or processing, etc. An outage at a cloud provider data center doesn't necessarily mean your application is also down.

- How resilient is the architecture and processes of my service provider? (Preferably done before selecting a provider.) All clouds are not alike. What's under the covers of the cloud service? What safeguards, redundancies, failover and processes do they have in place to minimize and quickly resolve outages?

- What other options do I need to put in place to keep my business running during a public cloud outage?

- What happens during and after an outage?

- Failure testing. Fine tune your own processes and responses to failures by testing.

Another point that comes out of the due diligence discussion above is the importance of IT intelligence in choosing cloud providers. There has been a lot of talk about "shadow IT" with regard to cloud – business users that use cloud services without involving their IT department. All cloud providers are not alike.

For example, in the Amazon outage, there were other cloud providers in the same area whose services weren't affected by the power outage. The due diligence required to select a cloud service provider requires IT expertise and knowledge, which highlights the fact that business units should be including IT involvement in their cloud initiatives.

Now, let's discuss why these outages should not be taken as an indictment of cloud computing in general. Putting it in perspective, these outages happened to two service providers at individual points in time, out of the thousands of cloud service providers out there. Yes, they were high profile cloud service providers with some high profile customers, but with the complexity and all of the "moving parts" involved with any data center, including dependence on the continuity of the power required to run the hardware, outages will happen.

Cloud computing has the capability to minimize outages by shifting resources between servers in the same data center and to remote data centers. Much of the responsibility for cloud service resilience, scalability and performance falls to the cloud service provider, in the form of the architecture, processes and services that they deliver. But some of the responsibility also rests with the cloud customer in the services that they pay for, as well as what and how they run in the cloud. For example, most cloud providers offer customers the option to use multiple data center locations to diversify the risk associated with a data center location outage.

This data center location diversification is one of the major advantages of public cloud computing, but it costs more money so some customers opt not to do this. For low risk applications where outages are tolerable, that makes sense. But for high risk applications, if business continuity options were not used, the fault in this case rests with the cloud customer not the cloud provider. The responsibility for balancing risk, services and costs rests with the cloud customer, which means that decisions to use or not use high value service options could spell the difference between hundreds of thousands of dollars in revenue lost due to an outage or business as usual during a single data center outage.

The advantages of public and private clouds are real. But these recent public cloud outages provide the opportunity to learn from and revisit how companies are planning their cloud initiatives, planning for outages and running their own clouds. It’s a good wake up call, take advantage of it.

Audrey Rasmussen is a Partner and Principal Analyst at Ptak/Noel.

Related Links:


Another Amazon Cloud Outage: 5 Ways APM Can Protect You from the Next Outage

Share this