Categories
Comment

The Windows Azure Storage outage shows: Winner crowns have no meaning

The global outage of Windows Azure Storage due to an expired SSL certificate again has struck big waves on Twitter. The ironic aftertaste is particularly due to the fact that Windows Azure was recently chosen by Nasuni as the “Leader in Cloud Storage”. Especially because of performance, availability and write errors Azure could stick out.

Back to business

Of course, the test result has been rightly celebrated on the Windows Azure team’s blog (DE).

“In a study of the Nasuni Corporation various cloud storage providers such as Microsoft, Amazon, HP, Rackspace, Google were compared. The result: Windows Azure could reap in almost all categories (including performance, availability, errors) the winners crown. Windows Azure is there clearly shown as a “Leader in Cloud Storage”. Here is an infographic with the main findings, the entire report is here.”

However, the Azure team was brought back to earth with a bump due to the really heavy outage which could be felt around the world. They should rather go back to the daily business and ensure that an EXPIRED SSL CERTIFICATE never lead to such a failure. Because this has relativized the test result and thus shows that winner crowns have no meaning at all. Very surprising is that always evitable errors lead to such catastrophic outages. Another example is the problem with the leapyear 2012.

Categories
Management @en

Amazon Web Services suffered a 20-hour outage over Christmas

After a rather bumpy 2012 with some heavy outages the cloud infrastructure of the Amazon Web Services again experienced some problems over Christmas. During a 20 hour outage several big customers were affected, including Netflix and Heroku. This time the main problem was Amazons Elastic Load Balancer (ELB).

Region US-East-1 is a very big problem

This outage is the last out of a series of catastrophic failure in Amazon’s US-East Region-1. It is the oldest and most popular region in Amazon’s cloud computing infrastructure. This new outage precisely in US-East-1 raises new questions about the stability of this region and what Amazon has actually learned and actually improved from the past outages. Amazon customer awe.sm had recently expressed criticism of the Amazon cloud and especially on Amazon EBS (Amazon Elastic Block Store) and the services that depend on it.

Amazon Elastic Load Balancer (ELB)

Besides the Amazon Elastic Beanstalk API the Amazon Elastic Load Balancer (ELB) was mainly affected by the outage. Amazon ELB belongs to one of the important services if you try to build a scalable and highly available infrastructure in the Amazon cloud. With ELB users can move loads and capacities between different availability zones (Amazon independent data centers), to ensure availability when it comes to problems in one data center.

Nevertheless: both Amazon Elastic Beanstalk and Amazon ELB rely on Amazon EBS, which is known as the “error prone-backbone of the Amazon Web Services“.

Categories
Analysis

Lessons learned: Amazon EBS is the error-prone backbone of AWS

In a blog article awe.sm writes about their own experiences using the Amazon Web Services. Beside the good things of the cloud infrastructure for them and other startups you can also derived from the context that Amazon EBS is the single point of failure in Amazon’s infrastructure.

Problems with Amazon EC2

awe.sm criticized Amazon EC2’s constraints regarding performance and reliability on which you absolutely have to pay attention as a customer and should incorporate into your own planning. The biggest problem awe.sm is seeing in AWS zone-concept. The Amazon Web Services consist on several worldwide distributed “regions”. Within this regions Amazon divided in so called “availability zones”. These are independent data center. awe.sm mentions three things they have learned from this concept so far.

Virtual hardware does not last as long as real hardware

awe.sm uses AWS for about 3 years. Within this period, the maximum duration of a virtual machine was about 200 days. The probability that a machine goes in the state “retired” after this period is very high. Furthermore Amazon’s “retirement process” is unpredictable. Sometimes you’ll notify ten days in advance that a virtual machine is going to be shut down. Sometimes the retirement notification email arrives 2 hours after the machine has already failed. While it is relatively simple to start a new virtual machine you must be aware that it is also necessary to early use an automated deployment solution.

Use more than one availability zone and plan redundancy across zones

awe.sm made ​​the experience that rather an entire availability zone fails than a single virtual machine. That means for the planning of failure scenarios, having a master and a slave in the same zone is as useless as having no slave at all. In case the master failures, it is possibly because the availability zone is not available.

Use multiple regions

The US-EAST region is the most famous and also oldest and cheapest of all AWS regions worldwide. However, this area is also very prone to error. Examples were in April 2011, March 2012 and June 2012 (twice). awe.sm therefore believes that the frequent regions-wide instability is due to the same reason: Amazon EBS.

Confidence in Amazon EBS is gone

The Amazon Elastic Block Store (EBS) is recommended by AWS to store all your data on it. This makes sense. If a virtual machine goes down the EBS volume can be connected to a new virtual machine without losing data. EBS volumes should also be used to save snapshots, backups of databases or operating systems on it. awe.sm sees some challenges in using EBS.

I/O rates of EBS volumes are bad

awe.sm made ​​the experiences that the I/O rates of EBS volumes in comparison to the local store on the virtual host (Ephemeral Storage) are significantly worse. Since EBS volumes are essentially network drives they also do not have a good performance. Meanwhile AWS provides Provisioned IOPS to give EBS volumes a higher performance. Because of the price they are too unattractive for awe.sm.

EBS fails at regional level and not per volume

awe.sm found two different types of behavior for EBS. Either all EBS volumes work or none! Two of the three AWS outages are due to problems with Amazon EBS. If your disaster recovery builds on top of moving EBS volumes around, but the downtime is due to an EBS failure, you have a problem. awe.sm had just been struggling with this problem many times.

The error status of EBS on Ubuntu is very serious

Since EBS volumes are disguised as block devices, this leads to problems in the Linux operating system. With it awe.sm made very bad experiences. A failing EBS volume causes an entire virtual machine to lock up, leaving it inaccessible and affecting even operations that don’t have any direct requirement of disk activity.

Many services of the Amazon Cloud rely on Amazon EBS

Because many other AWS services are built on EBS, they fail when EBS fails. These include e.g. Elastic Load Balancer (ELB), Relational Database Service (RDS) or Elastic Beanstalk. As awe.sm noticed EBS is nearly always the core of major outages at Amazon. If EBS fails and the traffic subsequently shall transfer to another region it’s not possible because the load balancer also runs on EBS. In addition, no new virtual machine can be started manually because the AWS Management Console also runs on EBS.

Comment

Reading the experiences of awe.sm I get the impression that Amazon do not live this so often propagandized “building blocks” as it actually should. Even when it is primary about the offering of various cloud services (be able to use them independently), why let they depend the majority of these services from a single service (EBS) and with that create a single point of failure?

Categories
Comment

Failures in the cloud are in the nature of man

If the cloud falls it has to do with the people who are responsible for it. How could it be otherwise. Finally, the systems do not program or configure on their own. And even if it means “No cloud without automation” you have to say “(No cloud without automation) without human skills”. In particular, we see our human weaknesses in the cloud when an outage occurs. Especially this year, we have seen a few of them.

People make mistakes

And that’s ok. As long as we are learning from those mistakes. What at one or the other vendors was not the case this year. I will not mention any names here, because those certainly know itself that they need to do something much better. But if an emergency generator does not work twice and also all emergency plans fail, something is wrong!
Although this failure, I am writing about, was not primarily a human error. Because against a hurricane and storm we are helpless. But the cascade of failures that happened during the storm is inexplicable.

Other errors that have happened this year, can be assigned directly to the people. On the one hand we had for example misconfigured network interfaces which ensure that the vendor networks were flooded with data and have therefore set themselves checkmate. Or a provider recently changes the configuration of the load balancer, which led to a prolonged outage.

Like I said, people make mistakes. And therefore one should not always blame “the big bad cloud”, but rather look behind the facade and keep in mind that there are just sitting humans like you and I.

Along these lines, let’s have a better 2013. 🙂