Categories
Comment

The Windows Azure Storage outage shows: Winner crowns have no meaning

The global outage of Windows Azure Storage due to an expired SSL certificate again has struck big waves on Twitter. The ironic aftertaste is particularly due to the fact that Windows Azure was recently chosen by Nasuni as the “Leader in Cloud Storage”. Especially because of performance, availability and write errors Azure could stick out.

Back to business

Of course, the test result has been rightly celebrated on the Windows Azure team’s blog (DE).

“In a study of the Nasuni Corporation various cloud storage providers such as Microsoft, Amazon, HP, Rackspace, Google were compared. The result: Windows Azure could reap in almost all categories (including performance, availability, errors) the winners crown. Windows Azure is there clearly shown as a “Leader in Cloud Storage”. Here is an infographic with the main findings, the entire report is here.”

However, the Azure team was brought back to earth with a bump due to the really heavy outage which could be felt around the world. They should rather go back to the daily business and ensure that an EXPIRED SSL CERTIFICATE never lead to such a failure. Because this has relativized the test result and thus shows that winner crowns have no meaning at all. Very surprising is that always evitable errors lead to such catastrophic outages. Another example is the problem with the leapyear 2012.

Categories
News @en

Microsoft Windows Azure Storage experienced a global(!) outage due to an expired SSL certificate

Since Friday, 22. February, 12:44 PM PST Microsoft Windows Azure Storage suffers massive problems worldwide(!). This happened due to an expired SSL certificate. The outage has also affected Azure Storage dependent services (see screenshot). Currently, Microsoft is working to restore all services. First test deployments seem to show success. More to find on the Windows Azure Service Dashboard.

Microsoft Windows Azure Storage experienced a global(!) outage due to an expired SSL certificate

Pity, since Windows Azure Storage has recently won a Nasuni performance competition against Amazon S3.

Humor on

Humor off

Categories
Management @en

Amazon Web Services suffered a 20-hour outage over Christmas

After a rather bumpy 2012 with some heavy outages the cloud infrastructure of the Amazon Web Services again experienced some problems over Christmas. During a 20 hour outage several big customers were affected, including Netflix and Heroku. This time the main problem was Amazons Elastic Load Balancer (ELB).

Region US-East-1 is a very big problem

This outage is the last out of a series of catastrophic failure in Amazon’s US-East Region-1. It is the oldest and most popular region in Amazon’s cloud computing infrastructure. This new outage precisely in US-East-1 raises new questions about the stability of this region and what Amazon has actually learned and actually improved from the past outages. Amazon customer awe.sm had recently expressed criticism of the Amazon cloud and especially on Amazon EBS (Amazon Elastic Block Store) and the services that depend on it.

Amazon Elastic Load Balancer (ELB)

Besides the Amazon Elastic Beanstalk API the Amazon Elastic Load Balancer (ELB) was mainly affected by the outage. Amazon ELB belongs to one of the important services if you try to build a scalable and highly available infrastructure in the Amazon cloud. With ELB users can move loads and capacities between different availability zones (Amazon independent data centers), to ensure availability when it comes to problems in one data center.

Nevertheless: both Amazon Elastic Beanstalk and Amazon ELB rely on Amazon EBS, which is known as the “error prone-backbone of the Amazon Web Services“.

Categories
Analysis

Lessons learned: Amazon EBS is the error-prone backbone of AWS

In a blog article awe.sm writes about their own experiences using the Amazon Web Services. Beside the good things of the cloud infrastructure for them and other startups you can also derived from the context that Amazon EBS is the single point of failure in Amazon’s infrastructure.

Problems with Amazon EC2

awe.sm criticized Amazon EC2’s constraints regarding performance and reliability on which you absolutely have to pay attention as a customer and should incorporate into your own planning. The biggest problem awe.sm is seeing in AWS zone-concept. The Amazon Web Services consist on several worldwide distributed “regions”. Within this regions Amazon divided in so called “availability zones”. These are independent data center. awe.sm mentions three things they have learned from this concept so far.

Virtual hardware does not last as long as real hardware

awe.sm uses AWS for about 3 years. Within this period, the maximum duration of a virtual machine was about 200 days. The probability that a machine goes in the state “retired” after this period is very high. Furthermore Amazon’s “retirement process” is unpredictable. Sometimes you’ll notify ten days in advance that a virtual machine is going to be shut down. Sometimes the retirement notification email arrives 2 hours after the machine has already failed. While it is relatively simple to start a new virtual machine you must be aware that it is also necessary to early use an automated deployment solution.

Use more than one availability zone and plan redundancy across zones

awe.sm made ​​the experience that rather an entire availability zone fails than a single virtual machine. That means for the planning of failure scenarios, having a master and a slave in the same zone is as useless as having no slave at all. In case the master failures, it is possibly because the availability zone is not available.

Use multiple regions

The US-EAST region is the most famous and also oldest and cheapest of all AWS regions worldwide. However, this area is also very prone to error. Examples were in April 2011, March 2012 and June 2012 (twice). awe.sm therefore believes that the frequent regions-wide instability is due to the same reason: Amazon EBS.

Confidence in Amazon EBS is gone

The Amazon Elastic Block Store (EBS) is recommended by AWS to store all your data on it. This makes sense. If a virtual machine goes down the EBS volume can be connected to a new virtual machine without losing data. EBS volumes should also be used to save snapshots, backups of databases or operating systems on it. awe.sm sees some challenges in using EBS.

I/O rates of EBS volumes are bad

awe.sm made ​​the experiences that the I/O rates of EBS volumes in comparison to the local store on the virtual host (Ephemeral Storage) are significantly worse. Since EBS volumes are essentially network drives they also do not have a good performance. Meanwhile AWS provides Provisioned IOPS to give EBS volumes a higher performance. Because of the price they are too unattractive for awe.sm.

EBS fails at regional level and not per volume

awe.sm found two different types of behavior for EBS. Either all EBS volumes work or none! Two of the three AWS outages are due to problems with Amazon EBS. If your disaster recovery builds on top of moving EBS volumes around, but the downtime is due to an EBS failure, you have a problem. awe.sm had just been struggling with this problem many times.

The error status of EBS on Ubuntu is very serious

Since EBS volumes are disguised as block devices, this leads to problems in the Linux operating system. With it awe.sm made very bad experiences. A failing EBS volume causes an entire virtual machine to lock up, leaving it inaccessible and affecting even operations that don’t have any direct requirement of disk activity.

Many services of the Amazon Cloud rely on Amazon EBS

Because many other AWS services are built on EBS, they fail when EBS fails. These include e.g. Elastic Load Balancer (ELB), Relational Database Service (RDS) or Elastic Beanstalk. As awe.sm noticed EBS is nearly always the core of major outages at Amazon. If EBS fails and the traffic subsequently shall transfer to another region it’s not possible because the load balancer also runs on EBS. In addition, no new virtual machine can be started manually because the AWS Management Console also runs on EBS.

Comment

Reading the experiences of awe.sm I get the impression that Amazon do not live this so often propagandized “building blocks” as it actually should. Even when it is primary about the offering of various cloud services (be able to use them independently), why let they depend the majority of these services from a single service (EBS) and with that create a single point of failure?

Categories
Comment

Failures in the cloud are in the nature of man

If the cloud falls it has to do with the people who are responsible for it. How could it be otherwise. Finally, the systems do not program or configure on their own. And even if it means “No cloud without automation” you have to say “(No cloud without automation) without human skills”. In particular, we see our human weaknesses in the cloud when an outage occurs. Especially this year, we have seen a few of them.

People make mistakes

And that’s ok. As long as we are learning from those mistakes. What at one or the other vendors was not the case this year. I will not mention any names here, because those certainly know itself that they need to do something much better. But if an emergency generator does not work twice and also all emergency plans fail, something is wrong!
Although this failure, I am writing about, was not primarily a human error. Because against a hurricane and storm we are helpless. But the cascade of failures that happened during the storm is inexplicable.

Other errors that have happened this year, can be assigned directly to the people. On the one hand we had for example misconfigured network interfaces which ensure that the vendor networks were flooded with data and have therefore set themselves checkmate. Or a provider recently changes the configuration of the load balancer, which led to a prolonged outage.

Like I said, people make mistakes. And therefore one should not always blame “the big bad cloud”, but rather look behind the facade and keep in mind that there are just sitting humans like you and I.

Along these lines, let’s have a better 2013. 🙂

Categories
Analysen

Was Unternehmen aus dem Amazon EC2 Problem in North Virginia lernen sollten!

Am 21. April haben wir es wieder erlebt. Auch eine hochverfügbare Cloud wie die von Amazon ist nicht vor Fehlern geschützt. Natürlich erhalten Kritiker nun wieder die Gelegenheit gegen die Nutzung der Cloud zu argumentieren. Aber um es vorweg zu nehmen, Anbieter wie Hootsuite, Mobypicture, Foursquare und Reddit hätten von dem Ausfall nicht betroffen sein müssen, hätten Sie auf eine ausfallsichere Architektur gesetzt.

Aus diesem Grund ist es nicht nachvollziehbar, dass diese Unternehmen mit dem Finger auf Amazon zeigen und sagen: “Uns trifft keine Schuld, Amazon ist down!”. Es wäre interessant zu wissen, wer zur Rechenschaft gezogen worden wäre, hätten diese Anbieter keinen Public Cloud Service genutzt, sondern betrieben ein eigenes Rechenzentrum. Denn eines ist klar, (solche) Probleme können in jedem Rechenzentrum auftreten und hätten dann bedeutend schwerwiegendere Probleme. Und wenn die Cloud richtig genutzt worden wäre, wären auch die Probleme bei Amazon in der Form nicht sichtbar geworden.

Was ist passiert?

Hintergrund
Die Amazon Cloud besteht aus mehreren Regionen, verteilt über mehrere Kontinente, in denen sich wiederum mehrere sogenannte Availability Zones befinden. Availability Zones sind verschiedene Standorte innerhalb einer Region, die so konstruiert sind, dass sie isoliert betrieben werden und von Fehlern in anderen Availability Zones nicht betroffen sind.

Durch das Starten von Instanzen in separaten Regionen, können Web Anwendungen so konstruiert werden, dass sich diese geographisch in der Nähe von bestimmten Kunden befinden und rechtlichen oder anderen Anforderungen gerecht werden.

Weiterhin werden Anwendungen vor dem Ausfall eines einzelnen Standorts geschützt, indem Instanzen in separaten Availability Zones ausgeführt werden.

Wie das Konzept genau funktioniert, kann unter Das Konzept hinter den AWS Regionen und Verfügbarkeitszonen nachgelesen werden.

Das Problem
Um 1:41 AM PDT begann das Problem mit Latenzen und Fehlerraten innerhalb der EBS Volumes und Verbindungsproblemen zu den EC2 Instanzen in mehreren Availability Zones der Region US-EAST-1.

Das führte dazu, dass die Webseiten bzw. Webanwendungen, die auf den EC2 Instanzen betrieben werden, nicht mehr erreichbar waren.

Design For Failure

Zunächst muss eines klar sein. Amazon stellt “nur” virtuelle Ressourcen zum Aufbau einer eigenen virtuellen Infrastruktur bereit. Amazon ist NICHT für die Anwendung und deren Funktionalität zuständig, sondern stellt nur die Infrastruktur bereit, auf der die Anwendungen ausgeführt werden.

“Everything fails, all the time” Werner Vogels, CTO Amazon.com

Aus diesem Grund rät Amazon: “Design for failure!” und gibt Tipps, dieses umzusetzen:

  • Avoid single points of failure
  • Assume everything fails, and design backwards
  • Goal: Applications should continue to function even if the underlying physical hardware fails or is removed or replaced.

Und nennt Möglichkeiten für die Realisierung des Designs:

  • Use Elastic IP addresses for consistent and re-mappable routes
  • Use multiple Amazon EC2 Availability Zones (AZs)
  • Create multiple database slaves across AZs
  • Use Amazon Elastic Block Store (EBS) for persistent file systems

Design for failure! Ist im Prinzip nichts Neues. Im Grunde sollte das ebenfalls bei jedem anderen nicht Cloud Anbieter und im eigenen Rechenzentrum beachtet werden. Der Entwickler sollte immer in der Pflicht stehen, seine Anwendung gegen Hardware oder sonstige Ausfälle abzusichern und die Verfügbarkeit sicherzustellen.

Hootsuite und Mobypicture bspw. haben den Fehler gemacht, sich nur auf eine AWS Region zu konzentrieren, anstatt ihren Service über die gesamte Amazon Cloud hinweg zu verteilen. Speziell bei Mobypicture, als einen europäischen Anbieter mit Sitz in Holland, ist genau dies ein wenig verwunderlich. Die deutsche Seite von Foursquare hingegen war bspw. erreichbar und lief stabil, ebenso die von Reddit.

Nicht alles auf eine Karte setzen

Für jedes Unternehmen, das seinen Service über die Cloud anbietet, gilt es daher: “Nutze die gesamte Cloud eines Anbieters und verlasse Dich nicht nur auf einen einzigen Anbieter!”

Durch die Verteilung einer Anwendung über mehrere Regionen und Availability Zones bei einem Anbieter wird die Verfügbarkeit der Anwendung drastisch erhöht. Zudem ist eine MultiVendor Strategie zwingend erforderlich. Außerdem müssen bereits von Beginn an Fallback Szenarien entwickelt werden, um gegen einen plötzlichen Ausfall vorbereitet zu sein.

Es geht also bspw. darum, Systeme aufzubauen, die im Fehlerfall automatisch eine gespiegelte Infrastruktur bei einem anderen Anbieter aufbauen. Besser wäre es, mehrere Anbieter parallel zu nutzen und die Services über mehrere Anbieter hinweg zu verteilen. Dazu gehört z.B.: Instanzen von unterschiedlichen Anbietern parallel produktiv einzusetzen, um damit das Risiko zu streuen.

Denn: Cloud Computing löst nicht alle Probleme automatisch…

Für die Zukunft

Amazon darf auf keinen Fall von einer Schuld freigesprochen werden, dazu werben sie viel zu deutlich mit der eigenen Verfügbarkeit. Dennoch, beim Cloud Computing handelt es sich um bedeutend mehr als nur das Nutzen von ein paar virtuellen Ressourcen und jeder Nutzer der Cloud ist für die Verfügbarkeit seiner Anwendung selber verantwortlich. Dafür stehen ihm ausreichend Mittel und Wege auf Grund der Cloud zur Verfügung!