Amazon Elastic MapReduce (Amazon EMR) makes it easy to provision and manage Hadoop in the AWS Cloud. Hadoop is available in multiple distributions and Amazon EMR gives you the option of using the Amazon Distribution or the MapR Distribution for Hadoop.
This webinar will show you examples of how to use Amazon EMR to with the MapR Distribution for Hadoop. You will learn how you can free yourself from the heavy lifting required to run Hadoop on-premises, and gain the advantages of using the cloud to increase flexibility and accelerate projects while lowering costs.
What we'll learn:
• See a live demonstration of how you can quickly and easily launch your first Hadoop cluster in a few steps.
• Examples of real world applications and customer successes in production
• Best practices for maximizing the benefits of using MapR with AWS.
3. Webinar Overview
Submit Your Questions using the Q&A tool.
A copy of today’s presentation will be made available on:
AWS SlideShare Channel@ http://www.slideshare.net/AmazonWebServices/
AWS Webinar Channel on YouTube@ http://www.youtube.com/channel/UCT-nPlVzJI-
ccQXlxjSvJmw
4. Introducing
Jonathan Fritz
Sr. Product Manager
Amazon Web Services
Steve Wooledge
VP, Product Marketing
MapR Technologies
Bruce Penn
Principal Sales Engineer
MapR Technologies
5. What We’ll Cover
• Elastic MapReduce (EMR): Hadoop in the cloud
• Elastic clusters tailored for your workflows
• Best container to run Hadoop in the AWS Ecosystem
• Introduction to MapR’s Hadoop Platform
• Defining feature
• Increased performance
• Case Studies: MapR + Elastic MapReduce
• Q&A
6. Hadoop in the Cloud
Using MapR and Amazon Elastic MapReduce to unlock Big Data
Jonathan Fritz, Sr. Product Manager, Amazon Web Services
Steve Wooledge, VP, Product Marketing, MapR Technologies
7. Agenda
• Elastic MapReduce (EMR): Hadoop in the cloud
– Elastic clusters tailored for your workflows
– Best container to run Hadoop in the AWS Ecosystem
• Introduction to MapR’s Hadoop Platform
– Defining features
– Increased performance
• Case Studies: MapR + Elastic MapReduce
• Q+A
8. • YouTube users upload 48 hours of new video/min/day
• Twitter sees roughly 175 million tweets every day
The Three V’s: the drivers behind Big Data
Variety
Velocity
Volume
• Facebook analyzes 30+ petabytes of user generated data
• More than 5 billion people are calling, texting, tweeting and
browsing on mobile phones worldwide
• 2.7 zetabyes data exist in the digital universe today.
• Data production will be 44 times greater in 2020 vs. 2009
9. Hadoop is the right system for Big Data
• Scalable and fault tolerant
• Flexibility for multiple languages
and data formats
• Open source
• Ecosystem of tools
• Batch and real-time analytics
10. Challenges with managing Hadoop
On-Premises
• Manage HDFS, upgrades,
and system administration
• Pay for expensive support
contracts
• Select hardware in
advance and stick with
predictions
Cloud
• Hard to tightly integrate
with AWS storage services
• Independently manage
and monitor clusters
12. • Managed services
• Easy to tune clusters and trim costs
• Support for multiple AWS datastores
• Unique features and ecosystem support
Why Amazon Elastic MapReduce?
21. Choose your instance types
Try out different configurations to find your
optimal architecture.
CPU
c1.xlarge
cc1.4xlarge
cc2.8xlarge
Memory
m1.large
m2.2xlarge
m2.4xlarge
Disk
hs1.8xlarge
22. Long running or transient clusters
Easy to run Hadoop clusters short-term or 24/7, and
only pay for what you need.
=
26. Matched compute
demands with cluster sizing.
Resizable clusters
Easy to add and remove compute
capacity on your cluster.
10 hours
27. Use Spot and Reserved Instances.
Minimize costs by supplementing on-demand pricing.
28. Easy to use Spot Instances
Name-your-price supercomputing to minimize costs.
Spot for
task nodes
Up to 90%
off EC2
on-demand
pricing
On-demand for
core nodes
Standard EC2
pricing for
on-demand
capacity.
29. 24/7 clusters on Reserved Instances
Minimize cost for consistent capacity.
Reserved
Instances for
long running
clusters.
Up to 65% off
on-demand
pricing.
30. Your data, your choice.
Easy to integrate Elastic MapReduce with your datastores.
31.
32. Using Amazon S3 and On-Cluster Storage
Data Sources
Transient EMR cluster
for batch map/reduce jobs
for daily reports
Long running EMR cluster
holding data on the cluster
in a NoSQL database
Weekly Report
Ad-hoc Query
Data aggregated
and stored in
Amazon S3
33. Use Amazon EMR with Redshift and S3
Data Sources
Daily data
aggregated in
Amazon S3
Amazon EMR
cluster used to
process data
Processed data
loaded into
Amazon Redshift
data warehouse