Lessons from 2011 AWS EBS Outage. Things will fail. Will you?

Today is Day One here at Cloud City Ventures and I wanted to dig into yesterday’s outage in Amazon’s US-EAST-1 region. The outage impacted well known applications and sites but has now been resolved by Amazon Web Services (AWS). The stakes are higher for AWS and public cloud providers now that regional issues can become the main red topic on Drudge within minutes for what is likely a localized S3 issue. Going from $1B/year in 2011 to a ~$15B/year run rate (Cloud City Ventures Estimate of $3.75B Q1 revenue) globally has certainly increased the speed in which the outages hit mainstream media.

To set the context, I wanted to share an experience I had managing the largest AWS outage in history from a client-facing perspective. I learned a lot from this experience in April 2011, and think that many of the same lessons for clients and public cloud providers still apply to this recent episode. I also believe that this recent outage event will play out similarly to those in the past and may even be a driver for higher spending by AWS clients evaluating their redundant architectures. This also won’t be the last public cloud outage.

On April 21, 2011 our AWS sales team began receiving reports from clients that something was wrong with their EC2 infrastructure in US-EAST followed by news reports of the affected client sites being down. AWS clients in the affected AZ’s were impacted for up to a week, not hours like yesterday. US-EAST was (and is) the most populated region at AWS (primer on AWS geos), so the 2011 outage affected a good number of clients due to an issue with Elastic Block Store networking errors.

In 2011, our sales team spent the next week in AWS conference rooms helping ten’s of thousands of clients and their “stuck” storage volumes recover their business applications. Yesterday’s outage has been linked to S3 (Simple Storage Service), similar to the 2011 outage in that a storage issue impacted multiple AWS services and took down entire client application stacks.

Not every client that was impacted by the outage in 2011 went down. Those that had invested in High Availability (HA) design on AWS were already running their applications across multiple AZ’s, or regions, while others that had their entire stack in affected AZ’s experienced complete downtime. While this incident yesterday will no doubt be a different cause, it is clear that AWS services beyond S3 were impacted and sites/apps relying on those services were in trouble, including AWS’s own Service Health Dashboard application:

Following April, 2011’s EBS outage, we conducted hundreds of 1:1 and group conference calls with our clients to help them understand how to architect properly and to prevent this type of localized event in the future from affecting their business. These same lessons need to be applied by every client of public cloud that want to achieve high availability in the future. I expect strong marketing from AWS pushing proper HA design on their platform over the next couple of weeks.

Does the value in investing for a few hours of downtime a year justify the investment in HA for sites that went offline today? That is the $1M+ question for many clients of AWS and their respective users.

 

AWS also did a post mortem from the event in 2011 that detailed the cause and fix in-depth. I expect them to share a similar post mortem here. Seeing AWS provide the same level of detail will be very important in rebuilding confidence quickly in the platform. Interested in continuing this discussion or getting deeper? Just contact us and schedule some time.




Matthew Scott