Amazon powerful upgraded its cloud infrastructure for big data. With the AWS Data Pipeline now a service (currently in beta) is available to automatically move and handle data across different systems. Amazon Redshift is a data warehouse in the cloud, which will be ten times faster than previously available solutions.
AWS Data Pipeline
With the AWS Data Pipeline Amazon wants to improve the access to the steady growing data on distributed systems and in different formats. For example, the service loads textfiles from Amazon EC2, processes it and saves them on Amazon S3. The main hub is represented by the AWS Management Console. Here the pipelines including the several sources, conditions, targets and commands are defined. Based on task plans it is defined when which job will be processed. The AWS Data Pipeline determines from which system based on which condition the data is loaded and processed and where it is stored afterwards.
The data processing can be conduct directly in the Amazon cloud on EC2 instances or in the own data center. Therefore the open source tool Task Runner is used which communicates with the AWS Data Pipeline. The Task Runner must run on each system that is processing data.
Amazon’s cloud data warehouse Amazon Redshift helps to analyze huge amount of data in a short time frame. Within it’s possible to store 1.6 petabytes of data and request them using SQL queries. Basically the service is charged by pay as you use. But customers who sign a three years contract and giving full load on their virtual infrastructure pay from 1.000 USD per terabyte per year. Amazon compares with numbers from IBM. IBM charges a data warehouse from 19.000 USD to 25.000 USD per terabyte per year.
First Amazon Redshift beta users are Netflix, JPL and Flipboard who were able to improve their requests 10 till 150 times faster compared to their current systems.
Amazon Redshift can be used as a single cluster with one server and a maximum of 2 terabyte of storage or as a multi node cluster including at least two compute nodes and one lead node. The lead node is responsible for the connection management, parsing the requests, create task plans and managing the requests for each compute node. The main processing is done on the compute node. Compute nodes are provided as hs1.xlarge with 2 terabyte storage and as hs1.8xlarge with 16 terabyte storage. One cluster has the maximum amount of 32 hs1.xlarge and 100 hs1.8xlarge compute nodes. This results in a maximum storage capacity of 64 terabyte respectively 1.6 terabyte. All compute nodes are connected over a separate 10 gigabit/s backbone.
Despite from the competition Amazon expands its cloud services portfolio. As a result, you can sometimes get the impression that all the other IaaS providers mark time – considering the innovative power of Amazon Web Services. I can only stress here once again that Value added services are the future of infrastructure-as-a-service respectively Don’t compete against the Amazon Web Services just with Infrastructure.
If we take a look at the latest developments, we see a steadily increasing demand for solutions for processing large amounts of structured and unstructured data. Barack Obama’s campaign is just one use case, which shows how important the possession of quality information is in order to gain competitive advantages in the future. And even though many see Amazon Web Services „just“ as a pure infrastructure-as-a-service provider (I don’t do that), is Amazon – more than any other (IaaS) provider – in the battle for Big Data solutions far up to play – which is not just the matter because of the knowledge from operating Amazon.com.