Big Data on AWS is a deep dive into Cloud-based big data solutions using Amazon Elastic MapReduce (EMR) and Amazon Redshift. In this session, you will learn how to create big data environments and leverage best practices to design big data environments for security and cost-effectiveness. Demonstrations will include using Amazon EMR to process log data and the ease of provisioning a Redshift data warehouse.
Big Data on AWS - AWS Washington D.C. Symposium 2014
1. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
AWS Big Data
Jon Einkauf
jeinkauf@amazon.com
2. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Agenda
• Brief overview of AWS Big Data services
• Demo (Query logs in S3 using Amazon EMR)
• Q&A
3. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Technologies and techniques for working
productively with data, at any scale.
Big Data
4. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Big data and AWS
Big data Cloud computing
Potentially massive
datasets
Virtually unlimited
capacity
5. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Big data and AWS
Big data Cloud computing
Iterative, experimental
style of data
manipulation and
analysis
Iterative, experimental
style of infrastructure
deployment/usage
6. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Big data and AWS
Big data Cloud computing
Frequently not steady-
state workload; peaks
and valleys
At its most efficient with
highly variable workloads
7. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Big data and AWS
Big data Cloud computing
“Time to results” is critical;
shared resources are a
bottleneck
Parallel compute projects
allow each workgroup to
have more autonomy, get
faster results
8. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Ease of useLower costs
9. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Only pay for what you
use
No capital investment
Pay as you go
Lower costs
10. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Programmable
Integrate with existing
tools
Low admin
Easy to configure
Ease of use
11. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Use the right tools
Amazon
S3
Amazon
Kinesis
Amazon
DynamoDB
Amazon
Redshift
Amazon
Elastic
MapReduce
AWS Data
Pipeline
12. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Amazon
S3
• High scalable object store
• 99.999999999% durability
• Encryption
• Data lifecycle management
13. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Amazon
Kinesis
• Real-time processing
• High throughput
• Elastic
• Integrates with EMR, S3,
Redshift, DynamoDB
14. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Amazon
DynamoDB
• NoSQL database
• Seamless scalability
• Low admin
• Single digit millisecond latency
15. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Amazon
Redshift
• Relational data warehouse
• Massively parallel
• Petabyte scale
• Fully Managed
• Low cost ($1K/TB/Year with
3 year Reservation)
16. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Amazon
Elastic
MapReduce
(EMR)
• Managed Hadoop clusters
• MapReduce, Hive, Pig, Impala,
HBase, Spark, Accumulo, etc.
• Integrates with S3, DynamoDB,
Redshift, Data Pipeline, Kinesis
17. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
AWS Data
Pipeline
• Data-driven workflows
• Integrates with EMR, EC2, S3,
Redshift, DynamoDB, SNS
• Process and move data
between AWS and your own
data center
18. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Log Analysis
Example
20. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Big Data on AWS
Brand new course on Big Data
aws.amazon.com/training/course-
descriptions/bigdata
21. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
AWS Big Data Test Drives
APN Partner-provided labs
aws.amazon.com/testdrive/bigdata
22. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
https://aws.amazon.com/tra
ining
AWS Training & Events
Webinars, Bootcamps,
and Self-Paced Labs
aws.amazon.com/events
23. AWS Government, Education, and Nonprofits Symposium
Washington, DC | June 24, 2014 - June 26, 2014
Thank you!
jeinkauf@amazon.com
Notes de l'éditeur
From TBs to PBs, we have the capacity and scale to handle your largest big data workloads
When we think of big data, we think of both the proliferation of digital information and also about the innovations to exploit or extract information from that data to increase sales, efficiency, better health, analysis, predictions, recommendations, and innovation
More specifically, we think cloud computing is a fundamental component to any big data strategy due to its inherent benefits
From TBs to PBs, we have the capacity and scale to handle your largest big data workloads
You can start and stop on demand, run big data workloads in parallel as you test out new ideas, allowing you to explore without commitments
With services such as Auto Scaling and elastic load balancing, you can dial up and down the amount of infrastructure you need for your variable or experimental workloads
The total time also includes the waiting to get access to those IT resources, with the cloud you can be up and running in minutes and in parallel allowing
In sense, AWS cloud democratizes big data for everyone to use and is based on two foundational benefits, lower costs and the ease of use and by focusing on these key tenents directs us in the direction of how we innovate
Lack of constraints leads to new usage models
Gives control back to individual development teams
Fail-fast (and fail-cheap) opens up exploratory style
Many customers create 100s of Amazon EMR clusters per day
Classic burst-y workload perfect for the cloud
Big data / HPC clusters themselves are parallelized resources
Can you build a faster on-premises cluster? Yes, but…
Usually a shared/contented resource; in cloud, each user/workgroup gets their own cluster
Cloud is often the fastest platform based on “MTTJC” (Mean Time To Job Completion)
We provide all of our services with a self service API, we als provide managed services so you don’t have to the back end administration and you can configure your infrastructure with code, scripts or point and click from our console all the while maintaining compatability with your current tools.
While I won’t be able to go over all of our big data services, I would like to spend some time introducing to you several key big data services that are designed for high availability and durability,
as a managed service where we provision the infrastructure on your behalf
where you can get significant big data storage and analytics with a few clicks or api calls.
Fundamental storage at internet scale, it can store any number of objects from 1 byte to 5 TB in size
It is engineered for 11 9’s of durability replicating your data at least three times in three distinct physical data centers we call availability zones
We have customers such as Dropbox, Spotify, Pinterest store billions of objects or files as photos, videos, songs, or any other type of file.
Amazon Kinesis is a fully managed service for real-time processing of streaming data at massive scale. Amazon Kinesis can collect and process hundreds of terabytes of data per hour from hundreds of thousands of sources.
For instance, instead of having to process log files in batch, you can have log events stream into Kinesis and then have workers with the Kinesis client library read from the stream and process the informaiton and drive a real time dashboard.
Later on today, we will have the product manager, Adi Krishnan, for Amazon Kinesis give a deep dive into the service
DynamoDB is a fast, fully managed NoSQL database service that makes it simple and cost-effective to store and retrieve any amount of data, and serve any level of request traffic. Its guaranteed throughput and single-digit millisecond latency make it a great fit for gaming, ad tech, mobile and many other applications.
Runs on solid state hard drives for high speed performance at scale and you can provision reads and writes to a table without having to worry about the admin of scaling or sharding, it is done all behind the scenes for you.
For instance, real time bidding where in less than 200 milliseconds 3 rounds of bidding of what ad to place on a website while a page loads needs the performance of a single-digit millisecond latency to determine what ad to place and what price to bid for that ad impression.
Provision a petabyte scale cluster to handle complex SQL queries in just a few minutes.
You can get either a HDD drive based cluster or the recently introduced SSD based cluster that is smaller in total cluster size but higher performance per GB
This data warehouse solution is about a tenth of what traditional solutions cost of comparable size.
Redshift can drive business intelligence tools such as Jaspersoft or Microstrategy because it supports standard SQL and can connect using ODBC or JDBC drivers.
When you think of big data these days, Hadoop is always an integral part. When you take the benefits of what the cloud can do along with the computational paradigm of MapReduce, you get Elastic MapReduce. Customers have launched millions of clusters to run big data workloads. Amazon Elastic MapReduce
A key tool in the toolbox to help with ‘Big Data’ challenges Makes possible analytics processes previously not feasibleCost effective when leveraged with EC2 spot market
When you think of big data these days, Hadoop is always an integral part. When you take the benefits of what the cloud can do along with the computational paradigm of MapReduce, you get Elastic MapReduce. Customers have launched millions of clusters to run big data workloads. Amazon Elastic MapReduce
A key tool in the toolbox to help with ‘Big Data’ challenges Makes possible analytics processes previously not feasibleCost effective when leveraged with EC2 spot market
Speaker Notes:
We have just released “Big Data to AWS”, a new technical training course for individuals who are responsible for implementing big data environments, namely Data Scientists, Data Analysts, and Enterprise Big Data Solution Architects. This course is designed to teach technical end users how to use Amazon EMR to process data using the broad ecosystem of Hadoop tools like Pig and Hive. We also cover how to create big data environments, work with Amazon DynamoDB and Amazon Redshift, understand the benefits of Amazon Kinesis, and leverage best practices to design big data environments for security and cost-effectiveness.
Upcoming classes include:
April 22 – Redwood City, CA
May 6 – Sao Paulo, Brazil
May 20 – Luxembourg
May 21 – Rio de Janeiro, Brazil
June 3 – New York, NY, Redwood City, CA, and Colombia, MD
June 4 – Porto Alegre, Brazil
Audience
Individuals responsible for implementing big data environments: Data Scientists, Data Analysts, and Enterprise Big Data Solution Architects
Objectives
Understand the architecture of an Amazon EMR cluster
Choose appropriate AWS data storage options for use with Amazon EMR
Know your options for ingesting, transferring, and compressing data for use with Amazon EMR
Use common programming frameworks for Amazon EMR including Hive, Pig, and Streaming
Work with Amazon Redshift and Spark/Shark to implement big data solutions
Leverage big data visualization software
Choose appropriate security and cost management options for Amazon EMR
Understand the benefits of using Amazon Kinesis for big data
Prerequisites
Basic familiarity with big data technologies, including Apache Hadoop and HDFS
Knowledge of big data technologies such as Pig, Hive, and MapReduce helpful, but not required
Working knowledge of core AWS services and public cloud implementation
AWS Essentials course completion or equivalent experience
Basic understanding of data warehousing, relational database systems, and database design
Format
Instructor-Led & Hands-on Labs
Duration
3 days
Details
aws.amazon.com/training/course-descriptions/bigdata/
Microstrategy
Splunk
QlikView
EMR
Pig
MongoDB
Oracle BI, OBIEE 11g
SAP Hana
Yellowfin BI
AWS is here to help
Thank you very much for your time to day, that concludes this presentation.