Categories
Analysis

Lessons learned: Amazon EBS is the error-prone backbone of AWS

In a blog article awe.sm writes about their own experiences using the Amazon Web Services. Beside the good things of the cloud infrastructure for them and other startups you can also derived from the context that Amazon EBS is the single point of failure in Amazon’s infrastructure.

Problems with Amazon EC2

awe.sm criticized Amazon EC2’s constraints regarding performance and reliability on which you absolutely have to pay attention as a customer and should incorporate into your own planning. The biggest problem awe.sm is seeing in AWS zone-concept. The Amazon Web Services consist on several worldwide distributed “regions”. Within this regions Amazon divided in so called “availability zones”. These are independent data center. awe.sm mentions three things they have learned from this concept so far.

Virtual hardware does not last as long as real hardware

awe.sm uses AWS for about 3 years. Within this period, the maximum duration of a virtual machine was about 200 days. The probability that a machine goes in the state “retired” after this period is very high. Furthermore Amazon’s “retirement process” is unpredictable. Sometimes you’ll notify ten days in advance that a virtual machine is going to be shut down. Sometimes the retirement notification email arrives 2 hours after the machine has already failed. While it is relatively simple to start a new virtual machine you must be aware that it is also necessary to early use an automated deployment solution.

Use more than one availability zone and plan redundancy across zones

awe.sm made ​​the experience that rather an entire availability zone fails than a single virtual machine. That means for the planning of failure scenarios, having a master and a slave in the same zone is as useless as having no slave at all. In case the master failures, it is possibly because the availability zone is not available.

Use multiple regions

The US-EAST region is the most famous and also oldest and cheapest of all AWS regions worldwide. However, this area is also very prone to error. Examples were in April 2011, March 2012 and June 2012 (twice). awe.sm therefore believes that the frequent regions-wide instability is due to the same reason: Amazon EBS.

Confidence in Amazon EBS is gone

The Amazon Elastic Block Store (EBS) is recommended by AWS to store all your data on it. This makes sense. If a virtual machine goes down the EBS volume can be connected to a new virtual machine without losing data. EBS volumes should also be used to save snapshots, backups of databases or operating systems on it. awe.sm sees some challenges in using EBS.

I/O rates of EBS volumes are bad

awe.sm made ​​the experiences that the I/O rates of EBS volumes in comparison to the local store on the virtual host (Ephemeral Storage) are significantly worse. Since EBS volumes are essentially network drives they also do not have a good performance. Meanwhile AWS provides Provisioned IOPS to give EBS volumes a higher performance. Because of the price they are too unattractive for awe.sm.

EBS fails at regional level and not per volume

awe.sm found two different types of behavior for EBS. Either all EBS volumes work or none! Two of the three AWS outages are due to problems with Amazon EBS. If your disaster recovery builds on top of moving EBS volumes around, but the downtime is due to an EBS failure, you have a problem. awe.sm had just been struggling with this problem many times.

The error status of EBS on Ubuntu is very serious

Since EBS volumes are disguised as block devices, this leads to problems in the Linux operating system. With it awe.sm made very bad experiences. A failing EBS volume causes an entire virtual machine to lock up, leaving it inaccessible and affecting even operations that don’t have any direct requirement of disk activity.

Many services of the Amazon Cloud rely on Amazon EBS

Because many other AWS services are built on EBS, they fail when EBS fails. These include e.g. Elastic Load Balancer (ELB), Relational Database Service (RDS) or Elastic Beanstalk. As awe.sm noticed EBS is nearly always the core of major outages at Amazon. If EBS fails and the traffic subsequently shall transfer to another region it’s not possible because the load balancer also runs on EBS. In addition, no new virtual machine can be started manually because the AWS Management Console also runs on EBS.

Comment

Reading the experiences of awe.sm I get the impression that Amazon do not live this so often propagandized “building blocks” as it actually should. Even when it is primary about the offering of various cloud services (be able to use them independently), why let they depend the majority of these services from a single service (EBS) and with that create a single point of failure?