Categories
Analysis

Amazon EBS: When does Amazon AWS eliminate its single point of failure?

Immediately the Amazon Web Services again stood still in the US-EAST-1 region in North Virginia. First, on 19 August 2013, which apparently also drag Amazon.com under and recently on 25 August 2013. Storms to have been the cause. Of course it has thereby again caught the usual suspects, especially Instagram and Reddit. It seems that Instagram strongly rely on its cloud architecture that blows up in its face regularly when the lightning strikes in North Virginia. However, also Amazon AWS should start to change its infrastructure, because it can’t go on like this.

US-EAST-1: Old, cheap and fragile

The Amazon region US-EAST-1 is the oldest and most widely used region in Amazon’s cloud computing infrastructure. This is on the one hand related to the cost which are considerably cheaper in comparison to other regions. Whereby the Oregon region has been adjusted in price. Secondly, because of the age, here are also many old customers with probably not optimized system architectures for the cloud.

All eggs in one nest

The mistake most users commit is that they only rely on one region, in this case, US-East-1. Should this fail, also the own service is of course no longer accessible. This applies to all of Amazon’s worldwide regions. To overcome this situation, a multi-region approach should be chosen by the application is scalable and highly available distributed over several regions.

Amazon EBS: Single Point of Failure

I have already suggested in the last year that the Amazon Web Services have a single point of failure. Below are the status messages of 25 August 2013 which underline my thesis. The first time I become aware of it was a blog post by the Amazon AWS customer awe.sm who tell about their experiences in the Amazon cloud. The bold points are crucial.

EC2 (N. Virginia)
[RESOLVED] Degraded performance for some EBS Volumes
1:22 PM PDT We are investigating degraded performance for some volumes in a single AZ in the US-EAST-1 Region
1:29 PM PDT We are investigating degraded performance for some EBS volumes and elevated EBS-related API and EBS-backed instance launch errors in a single AZ in the US-EAST-1 Region.
2:21 PM PDT We have identified and fixed the root cause of the performance issue. EBS backed instance launches are now operating normally. Most previously impacted volumes are now operating normally and we will continue to work on instances and volumes that are still experiencing degraded performance.
3:23 PM PDT From approximately 12:51 PM PDT to 1:42 PM PDT network packet loss caused elevated EBS-related API error rates in a single AZ, a small number of EBS volumes in that AZ to experience degraded performance, and a small number of EC2 instances to become unreachable due to packet loss in a single AZ in the US-EAST-1 Region. The root cause was a “grey” partial failure with a networking device that caused a portion of the AZ to experience packet loss. The network issue was resolved and most volumes, instances, and API calls returned to normal. The networking device was removed from service and we are performing a forensic investigation to understand how it failed. We are continuing to work on a small number of instances and volumes that require additional maintenance before they return to normal performance.
5:58 PM PDT Normal performance has been restored for the last stragglers of EC2 instances and EBS volumes that required additional maintenance.

ELB (N. Virginia)
[RESOLVED] Connectivity Issues
1:40 PM PDT We are investigating connectivity issues for load balancers in a single availability zone in the US-EAST-1 Region.
2:45 PM PDT We have identified and fixed the root cause of the connectivity issue affecting load balancers in a single availability zone. The connectivity impact has been mitigated for load balancers with back-end instances in multiple availability zones. We continue to work on load balancers that are still seeing connectivity issues.
6:08 PM PDT At 12:51 PM PDT, a small number of load balancers in a single availability zone in the US-EAST-1 Region experienced connectivity issues. The root cause of the issue was resolved and is described in the EC2 post. All affected load balancers have now been recovered and the service is operating normally.

RDS (N. Virginia)
[RESOLVED] RDS connectivity issues in a single availability zone
1:39 PM PDT We are currently investigating connectivity issues to a small number of RDS database instances in a single availability zone in the US-EAST-1 Region
2:07 PM PDT We continue to work to restore connectivity and reduce latencies to a small number of RDS database instances that are impacted in a single availability zone in the US-EAST-1 Region
2:43 PM PDT We have identified and resolved the root cause of the connectivity and latency issues in the single availability zone in the US-EAST-1 region. We are continuing to recover the small number of instances still impacted.
3:31 PM PDT The majority of RDS instances in the single availability zone in the US-EAST-1 Region that were impacted by the prior connectivity and latency issues have recovered. We are continuing to recover a small number of remaining instances experiencing connectivity issues.
6:01 PM PDT At 12:51 PM PDT, a small number of RDS instances in a single availability zone within the US-EAST-1 Region experienced connectivity and latency issues. The root cause of the issue was resolved and is described in the EC2 post. By 2:19 PM PDT, most RDS instances had recovered. All instances are now recovered and the service is operating normally.

Amazon EBS is the basis of many other services

The bold points are therefore of crucial importance, since all of these services are dependent on a single service at the same time: Amazon EBS. To this result awe.sm came during their analysis. How awe.sm has determined, EBS is nine times out of ten the main problem of major outages on Amazon. How in the outage above.

Among the services that are dependent on Amazon EBS include the Elastic Load Balancer (ELB), the Relational Database Service (RDS) or Elastic Beanstalk.

Q: What is the hardware configuration for Amazon RDS Standard storage?
Amazon RDS uses EBS volumes for database and log storage. Depending on the size of storage requested, Amazon RDS automatically stripes across multiple EBS volumes to enhance IOPS performance.

Source: http://aws.amazon.com/rds/faqs/

EBS Backed
EC2: If you select an EBS backed AMI
ELB: You must select an EBS backed AMI for EC2 host
RDS
Elastic Beanstalk
Elastic MapReduce

Source: Which AWS features are EBS backed?

Load balancing across availability zones is excellent advice in principle, but still succumbs to the problem above in the instance of EBS unavailability: ELB instances are also backed by Amazon’s EBS infrastructure.

Source: Comment – How to work around Amazon EC2 outages

As you can see more than a few services depend on Amazon EBS. This means, conversely, if EBS fails these services are also no longer available. Particularly tragic is the case with the Amazon Elastic Load Balancer (ELB), which is responsible for directing the traffic in case of failure or heavy load. Thus, if Amazon EBS fails and the data traffic should be transmitted to another region, this does not work because the load balancer also dependents on EBS.

I may be wrong. However, if you look at the past failures the evidence indicates that Amazon EBS is the main source of error in the Amazon Cloud.

It may therefore be allowed to ask the question whether constantly the same component of its infrastructure blow up in a leader’s face, which, in principle, is to be regarded as a single point of failure? And whether an infrastructure-as-a-service (IaaS) provider, who has experienced the most outages of all providers in the market, can be described under this point of view as a leader. Even if I frequently preach that a user of a public IaaS must take responsibility for the scalability and high availability on its own, the provider has the duty to ensure that the infrastructure is reliable.

By Rene Buest

Rene Buest is Gartner Analyst covering Infrastructure Services & Digital Operations. Prior to that he was Director of Technology Research at Arago, Senior Analyst and Cloud Practice Lead at Crisp Research, Principal Analyst at New Age Disruption and member of the worldwide Gigaom Research Analyst Network. Rene is considered as top cloud computing analyst in Germany and one of the worldwide top analysts in this area. In addition, he is one of the world’s top cloud computing influencers and belongs to the top 100 cloud computing experts on Twitter and Google+. Since the mid-90s he is focused on the strategic use of information technology in businesses and the IT impact on our society as well as disruptive technologies.

Rene Buest is the author of numerous professional technology articles. He regularly writes for well-known IT publications like Computerwoche, CIO Magazin, LANline as well as Silicon.de and is cited in German and international media – including New York Times, Forbes Magazin, Handelsblatt, Frankfurter Allgemeine Zeitung, Wirtschaftswoche, Computerwoche, CIO, Manager Magazin and Harvard Business Manager. Furthermore Rene Buest is speaker and participant of experts rounds. He is founder of CloudUser.de and writes about cloud computing, IT infrastructure, technologies, management and strategies. He holds a diploma in computer engineering from the Hochschule Bremen (Dipl.-Informatiker (FH)) as well as a M.Sc. in IT-Management and Information Systems from the FHDW Paderborn.