Take two aspirin and call me in the morning (when you’re back up). Meantime, maybe it’s time for some preventive medicine?
On October 23rd, Amazon suffered another notable and very public outage — one of several major service disruptions in 2012.
What’s interesting about these incidents is not that they happen. The press has some fun covering them, of course, but outages are not shocking. Like most data centers, Amazon doesn’t run with perfect predictability. The Cloud isn’t a panacea; applications must still be fault tolerant and fail gracefully when key components suddenly perform poorly. What is intriguing is that it’s possible, with an early warning system, to predict and even deflect such outages from harming your business and customer relationships.
Here’s a quick recap of Amazon’s latest woes:
The incident started small, when EBS volumes in a single availability zone in the US-East-1 region experienced degraded performance. It quickly developed into a more serious performance degradation, taking nearly 12 hours before all systems were back up and operating normally.
Many of our customers have some portion of their infrastructure deployed in AWS. Looking at the aggregate data coming back from our customers, we can observe and measure the health of the Amazon infrastructure. We did, indeed, observe some interesting behavior.
Roughly two hours before Amazon first announced there were problems, we saw a significant drop in traffic from our EC2 customers — by nearly 27 percent. Next, at 10:40:21, Amazon posted a message to their Service Health Dashboard indicating degraded EBS performance in a single Availability zone. We could see the Amazon alert posted in our dashboard and it coincided with the lowest amount of data streaming from AWS for the day. Comparing daily norms of traffic, during typical Mondays in other weeks, this was definitely abnormal behavior.
What if Amazon could have notified its customers an hour or two ahead of time, of the potential impending doom? Early warning systems that notify IT managers of developing problems in their Cloud environments allows for preventative measures to head off downtime.
When emergent anomalous behavior is detected, operations staff can contact their cloud provider to investigate the issue and divert traffic to secondary components. Early warning systems can be as simple as tools that closely monitor application response times to track latency. More sophisticated options include tools that analyze the behavior between Cloud instances and external networks, whether private data centers or the public Internet, to derive insights into how the underlying Cloud infrastructure is performing.
The reality that all companies face in leveraging the power of the Cloud is that they don’t control the infrastructure. In fact, in the public Cloud, companies have no visibility into the underlying infrastructure they are dependent on, beyond the system-level metrics they get from the Cloud instance itself.
Outages aren’t just an Amazon problem, but a problem for any company running a data center. If you have any or all of your IT infrastructure with a third-party service provider, whether it’s a Cloud provider or data center hosting firm, you need an early warning system in place.
Gary Read is CEO of Boundary.