SlideShare une entreprise Scribd logo
1  sur  137
Télécharger pour lire hors ligne
Amazon Elastic MapReduce:
Deep Dive and Best Practices
Parviz Deyhim
November 13, 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Outline
Introduction to Amazon EMR
Amazon EMR Design Patterns
Amazon EMR Best Practices
Amazon Controlling Cost with EMR
Advanced Optimizations
Outline
Introduction to Amazon EMR
Amazon EMR Design Patterns
Amazon EMR Best Practices
Amazon Controlling Cost with EMR
Advanced Optimizations
Hadoop-as-a-service
Map-Reduce engine

Integrated with tools

What is EMR?
Massively parallel

Integrated to AWS services

Cost effective AWS wrapper
HDFS

Amazon EMR
HDFS

Amazon EMR

Amazon S3

Amazon
DynamoDB
Data management

Analytics languages

HDFS

Amazon EMR

Amazon S3

Amazon
DynamoDB
Data management

Analytics languages

HDFS

Amazon EMR

Amazon S3

Amazon
DynamoDB

Amazon
RDS
Data management

Analytics languages

HDFS

Amazon EMR

Amazon
Redshift

Amazon
RDS

AWS Data Pipeline
Amazon S3

Amazon
DynamoDB
Amazon EMR Introduction
• Launch clusters of any size in a matter of
minutes
• Use variety of different instance sizes that match
your workload
Amazon EMR Introduction
• Don’t get stuck with hardware
• Don’t deal with capacity planning
• Run multiple clusters with different sizes, specs
and node types
Amazon EMR Introduction
• Integration with Spot market
• 70-80% discount
Outline
Introduction to Amazon EMR
Amazon EMR Design Patterns
Amazon EMR Best Practices
Amazon Controlling Cost with EMR
Advanced Optimizations
Amazon EMR Design Patterns
Pattern #1: Transient vs. Alive Clusters
Pattern #2: Core Nodes and Task Nodes
Pattern #3: Amazon S3 as HDFS
Pattern #4: Amazon S3 & HDFS
Pattern #5: Elastic Clusters
Pattern #1: Transient vs. Alive
Clusters
Pattern #1: Transient Clusters
• Cluster lives for the duration of the job
• Shut down the cluster when the job is done

• Data persist on
Amazon S3
• Input & output
Data on
Amazon S3
Benefits of Transient Clusters
1. Control your cost
2. Minimum maintenance
•

Cluster goes away when job is done

3. Practice cloud architecture
•

Pay for what you use

•

Data processing as a workflow
When to use Transient cluster?
If ( Data Load Time + Processing Time) *
Number Of Jobs < 24
Use Transient Clusters
Else
Use Alive Clusters
When to use Transient cluster?

( 20min data load + 1 hour
Processing time) * 10 jobs = 13
hours < 24 hour = Use Transient
Clusters
Alive Clusters
• Very similar to traditional Hadoop deployments
• Cluster stays around after the job is done
• Data persistence model:
• Amazon S3
• Amazon S3 Copy To HDFS
• HDFS and Amazon S3 as
backup
Alive Clusters
• Always keep data safe on Amazon S3 even if
you’re using HDFS for primary storage
• Get in the habit of shutting down your cluster and
start a new one, once a week or month
•

Design your data processing workflow to account for
failure

• You can use workflow managements such as
AWS Data Pipeline
Benefits of Alive Clusters
•

Ability to share data between multiple jobs

Transient cluster

Long running clusters

EMR

EMR

Amazon S3
Amazon S3

EMR
HDFS

HDFS
Benefit of Alive Clusters
•

Cost effective for repetitive jobs

pm
EMR

pm

pm

EMR

EMR
EMR

EMR

pm
When to use Alive cluster?
If ( Data Load Time + Processing Time) *
Number Of Jobs > 24
Use Alive Clusters
Else
Use Transient Clusters
When to use Alive cluster?

( 20min data load + 1 hour
Processing time) * 20 jobs =
26hours > 24 hour = Use Alive
Clusters
Pattern #2: Core & Task nodes
Core Nodes
Amazon EMR cluster
Master instance group

Run
TaskTrackers
(Compute)
Run DataNode
(HDFS)

HDFS

HDFS

Core instance group
Core Nodes
Amazon EMR cluster
Master instance group

Can add core
nodes

HDFS

HDFS

Core instance group
Core Nodes
Amazon EMR cluster

Can add core
nodes

Master instance group

More HDFS
space
More
CPU/mem

HDFS

HDFS

Core instance group

HDFS
Core Nodes
Amazon EMR cluster

Can’t remove
core nodes
because of
HDFS

Master instance group

HDFS

HDFS

Core instance group

HDFS
Amazon EMR Task Nodes
Amazon EMR cluster

Run TaskTrackers

Master instance group

No HDFS
Reads from core
node HDFS
HDFS

HDFS

Core instance group

Task instance group
Amazon EMR Task Nodes

Amazon EMR cluster

Can add
task nodes

Master instance group

HDFS

HDFS

Core instance group

Task instance group
Amazon EMR Task Nodes

Amazon EMR cluster

More CPU
power

Master instance group

More
memory
HDFS

HDFS

Core instance group

Task instance group
Amazon EMR Task Nodes

You can
remove task
nodes

Amazon EMR cluster
Master instance group

HDFS

HDFS

Core instance group

Task instance group
Amazon EMR Task Nodes

You can
remove task
nodes

Amazon EMR cluster
Master instance group

HDFS

HDFS

Core instance group

Task instance group
Tasknode Use-Case #1
• Speed up job processing using Spot
market
• Run task nodes on Spot market
• Get discount on hourly price

• Nodes can come and go without
interruption to your cluster
Tasknode Use-Case #2
• When you need extra horse power
for a short amount of time
• Example: Need to pull large amount
of data from Amazon S3
Example:
HS1
48TB
HDFS

HS1
48TB
HDFS

Amazon S3
Add Spot task
nodes to load
data from
Amazon S3

Example:
HS1
48TB
HDFS

m1.xl
m1.xl
m1.xl

HS1
48TB
HDFS

m1.xl
Amazon S3

m1.xl
m1.xl
Example:
HS1
48TB
HDFS

Remove after
data load from
Amazon S3
m1.xl
m1.xl
m1.xl

HS1
48TB
HDFS

m1.xl
m1.xl
m1.xl

Amazon S3
Pattern #3: Amazon S3 as HDFS
Amazon S3 as HDFS
Amazon EMR
cluster

• Use Amazon S3 as your
permanent data store
• HDFS for temporary
storage data between
jobs
• No additional step to
copy data to HDFS

HDF
S

HDF
S

Core instance group

Task instance group

Amazon S3
Benefits: Amazon S3 as HDFS
• Ability to shut down your cluster
HUGE Benefit!!
• Use Amazon S3 as your durable storage
11 9s of durability
Benefits: Amazon S3 as HDFS
• No need to scale HDFS
• Capacity
• Replication for durability

• Amazon S3 scales with your data
• Both in IOPs and data storage
Benefits: Amazon S3 as HDFS
• Ability to share data between multiple clusters
•

Hard to do with HDFS

EMR

EMR

Amazon S3
Benefits: Amazon S3 as HDFS
•

Take advantage of Amazon S3 features
• Amazon S3 ServerSideEncryption
• Amazon S3 LifeCyclePolicy
• Amazon S3 versioning to protect against corruption

•

Build elastic clusters
•

Add nodes to read from Amazon S3

•

Remove nodes with data safe on Amazon S3
What About Data Locality?
• Run your job in the same region as your
Amazon S3 bucket
• Amazon EMR nodes have high speed
connectivity to Amazon S3
• If your job Is CPU/memory-bounded data,
locality doesn’t make a difference
Anti-Pattern: Amazon S3 as HDFS
• Iterative workloads
– If you’re processing the same dataset more than once

• Disk I/O intensive workloads
Pattern #4: Amazon S3 & HDFS
Amazon S3 & HDFS
1. Data persist on Amazon S3
2. Launch Amazon EMR and
copy data to HDFS with
S3distcp

S3DistCp

Amazon S3 & HDFS
3. Start processing data on
HDFS

S3DistCp

Amazon S3 & HDFS
Benefits: Amazon S3 & HDFS
• Better pattern for I/O-intensive workloads
• Amazon S3 benefits discussed previously
applies
• Durability
• Scalability
• Cost
• Features: lifecycle policy, security
Pattern #5: Elastic Clusters
Amazon EMR Elastic Cluster (m)
1. Start cluster with certain number of nodes
Amazon EMR Elastic Cluster (m)
2. Monitor your cluster with Amazon CloudWatch
metrics
• Map Tasks
Running
• Map Tasks
Remaining
• Cluster Idle?
• Avg. Jobs Failed
Amazon EMR Elastic Cluster (m)
3. Increase the number of nodes as you need
more capacity by manually calling the API
Amazon EMR Elastic Cluster (a)
1. Start your cluster with certain number of nodes
Amazon EMR Elastic Cluster (a)
2. Monitor cluster capacity with Amazon
CloudWatch metrics
• Map Tasks Running
• Map Tasks Remaining
• Cluster Idle?
• Avg. Jobs Failed
Amazon EMR Elastic Cluster (a)
3. Get HTTP Amazon SNS notification to a simple
app deployed on Elastic Beanstalk
Amazon EMR Elastic Cluster (a)
4. Your app calls the API to add nodes to your
cluster

API
Outline
Introduction to Amazon EMR
Amazon EMR Design Patterns
Amazon EMR Best Practices
Amazon Controlling Cost with EMR
Advanced Optimizations
Amazon EMR Nodes and Size
• Use m1.smal, m1.large, c1.medium for
functional testing

• Use M1.xlarge and larger nodes for
production workloads
Amazon EMR Nodes and Size
• Use CC2 for memory and CPU intensive
jobs

• Use CC2/C1.xlarge for CPU intensive
jobs

• Hs1 instances for HDFS workloads
Amazon EMR Nodes and Size
• Hi1 and HS1 instances for disk I/Ointensive workload
• CC2 instances are more cost effective
than M2.4xlarge
• Prefer smaller cluster of larger nodes than
larger cluster of smaller nodes
Holy Grail Question
How many nodes do I need?
Introduction to Hadoop Splits
• Depends on how much data you have

• And how fast you like your data to be
processed
Introduction to Hadoop Splits
Before we understand Amazon EMR
capacity planning, we need to understand
Hadoop’s inner working of splits
Introduction to Hadoop Splits
• Data gets broken up to splits (64MB or 128)
Data
128MB

Splits
Introduction to Hadoop Splits
• Splits get packaged into mappers
Data

Splits

Mappers
Introduction to Hadoop Splits
• Mappers get assigned to Mappers
nodes for processing

Instances
Introduction to Hadoop Splits
• More data = More splits = More mappers
Introduction to Hadoop Splits
• More data = More splits = More mappers

Queue
Introduction to Hadoop Splits
•

Data mappers > cluster mapper capacity =
mappers wait for capacity = processing delay

Queue
Introduction to Hadoop Splits
•

More nodes = reduced queue size = faster
processing

Queue
Calculating the Number of Splits for
Your Job
Uncompressed files: Hadoop splits a single file to
multiple splits.
Example: 128MB = 2 splits based on 64MB split size
Calculating the Number of Splits for
Your Job
Compressed files:
1. Splittable compressions: same logic as uncompressed files
64MB BZIP
128MB BZIP
Calculating the Number of Splits for
Your Job
Compressed files:
2. Unsplittable compressions: the entire file is a
single split.
128MB GZ

128MB GZ
Calculating the Number of Splits for
Your Job
Number of splits
If data files have unsplittable compression
# of splits = number of files
Example: 10 GZ files = 10 mappers
Cluster Sizing Calculation

Just tell me how many nodes I
need for my job!!
Cluster Sizing Calculation

1. Estimate the number of mappers your job
requires.
Cluster Sizing Calculation

2. Pick an instance and note down the number of
mappers it can run in parallel
M1.xlarge = 8 mappers in parallel
Cluster Sizing Calculation

3. We need to pick some sample data files to run
a test workload. The number of sample files
should be the same number from step #2.
Cluster Sizing Calculation

4. Run an Amazon EMR cluster with a single core
node and process your sample files from #3.
Note down the amount of time taken to process
your sample files.
Cluster Sizing Calculation
Estimated Number Of Nodes:
Total Mappers * Time To Process Sample Files
Instance Mapper Capacity * Desired Processing Time
Example: Cluster Sizing Calculation
1. Estimate the number of mappers your job
requires

150
2. Pick an instance and note down the number of
mappers it can run in parallel

m1.xlarge with 8 mapper capacity
per instance
Example: Cluster Sizing Calculation
3. We need to pick some sample data files to run a
test workload. The number of sample files should
be the same number from step #2.

8 files selected for our sample test
Example: Cluster Sizing Calculation

4. Run an Amazon EMR cluster with a single core
node and process your sample files from #3.
Note down the amount of time taken to process
your sample files.

3 min to process 8 files
Cluster Sizing Calculation
Estimated number of nodes:
Total Mappers For Your Job * Time To Process Sample Files
Per Instance Mapper Capacity * Desired Processing Time

150 * 3 min
= 11 m1.xlarge
8 * 5 min
File Size Best Practices
• Avoid small files at all costs
• Anything smaller than 100MB

• Each mapper is a single JVM
• CPU time required to spawn
JVMs/mappers
File Size Best Practices
Mappers take 2 sec to spawn up and be ready
for processing

10TB of 100mgB = 100,000 mappers * 2Sec =
total of 55 hours mapper CPU setup time
File Size Best Practices
Mappers take 2 sec to spawn up and be ready
for processing

10TB of 1000MB = 10,000 mappers * 2Sec =
total of 5 hours mapper CPU setup time
File Size on Amazon S3: Best Practices

• What’s the best Amazon S3 file size for
Hadoop?
About 1-2GB
• Why?
File Size on Amazon S3: Best Practices
• Life of mapper should not be less than 60 sec

• Single mapper can get 10MB-15MB/s speed to
Amazon S3

60sec * 15MB

≈

1GB
Holy Grail Question
What if I have small file issues?
Dealing with Small Files
• Use S3DistCP to combine smaller files together

• S3DistCP takes a pattern and target file to
combine smaller input files to larger ones
Dealing with Small Files
Example:
./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar 
/home/hadoop/lib/emr-s3distcp-1.0.jar 

--args '--src,s3://myawsbucket/cf,
--dest,hdfs:///local,

--groupBy,.*XABCD12345678.([0-9]+-[09]+-[0-9]+-[0-9]+).*,
--targetSize,128,
Compressions
• Compress as much as you can
• Compress Amazon S3 input data files

– Reduces cost
– Speed up Amazon S3->mapper data transfer
time
Compressions
• Always Compress Data Files On Amazon S3
• Reduces Storage Cost
• Reduces Bandwidth Between Amazon S3
and Amazon EMR
• Speeds Up Your Job
Compressions
• Compress Mappers and Reducer Output
• Reduces Disk IO
Compressions
• Compression Types:
– Some are fast BUT offer less space reduction
– Some are space efficient BUT Slower
– Some are splitable and some are not
Algorithm

% Space
Remaining

Encoding
Speed

Decoding
Speed

GZIP

13%

21MB/s

118MB/s

LZO

20%

135MB/s

410MB/s

Snappy

22%

172MB/s

409MB/s
Compressions
• If You Are Time Sensitive, Faster Compressions
Are A Better Choice

• If You Have Large Amount Of Data, Use Space
Efficient Compressions

• If You Don’t Care, Pick GZIP
Change Compression Type
• You May Decide To Change Compression Type
• Use S3DistCP to change the compression types of
your files
• Example:
./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar 
/home/hadoop/lib/emr-s3distcp-1.0.jar 
--args '--src,s3://myawsbucket/cf,
--dest,hdfs:///local,

--outputCodec,lzo’
Outline
Introduction to Amazon EMR
Amazon EMR Design Patterns
Amazon EMR Best Practices
Amazon Controlling Cost with EMR
Advanced Optimizations
Architecting for cost
• AWS pricing models:

– On-demand: Pay as you go model.
– Spot: Market place. Bid for instances
and get a discount
– Reserved Instance: upfront payment
(for 1 or 3 year) for reduction in overall
monthly payment
Reserved Instances use-case
For alive and long-running clusters
Reserved Instances use-case
For ad-hoc and unpredictable workloads,
use medium utilization
Reserved Instances use-case
For unpredictable workloads, use Spot or
on-demand pricing
Outline
Introduction to Amazon EMR
Amazon EMR Design Patterns
Amazon EMR Best Practices
Amazon Controlling Cost with EMR
Advanced Optimizations
Adv. Optimizations (Stage 1)
• The best optimization is to structure your data
(i.e., smart data partitioning)

• Efficient data structuring= limit the amount of
data being processed by Hadoop= faster jobs
Adv. Optimizations (Stage 1)
• Hadoop is a batch processing framework
• Data processing time = an hour to days
• Not a great use-case for shorter jobs
• Other frameworks may be a better fit
–

Twitter Storm

–

Spark

–

Amazon Redshift, etc.
Adv. Optimizations (Stage 1)
• Amazon EMR team has done a great deal of
optimization already

• For smaller clusters, Amazon EMR configuration
optimization won’t buy you much
– Remember you’re paying for the full hour
cost of an instance
Adv. Optimizations (Stage 1)

Best Optimization??
Adv. Optimizations (Stage 1)

Add more nodes
Adv. Optimizations (Stage 2)
• Monitor your cluster
using Ganglia

• Amazon EMR has
Ganglia bootstrap
action
Adv. Optimizations (Stage 2)
• Monitor and look for bottlenecks
– Memory
– CPU
– Disk IO
– Network IO
Adv. Optimizations
Run Job
Adv. Optimizations
Run Job

Find
Bottlenecks
Ganglia
CPU
Disk
Memory
Adv. Optimizations
Run Job

Find
Bottlenecks
Ganglia
CPU
Disk
Memory

Address
Bottleneck
Fine Tune
Change Algo
Network IO
• Most important metric to watch for if using
Amazon S3 for storage

• Goal: Drive as much network IO as possible
from a single instance
Network IO
• Larger instances can drive > 600Mbps
• Cluster computes can drive 1Gbps -2 Gbps

• Optimize to get more out of your instance
throughput
– Add more mappers?
Network IO
• If you’re using Amazon S3 with Amazon
EMR, monitor Ganglia and watch network
throughput.

• Your goal is to maximize your NIC
throughput by having enough mappers
per node
Network IO, Example
Low network utilization
Increase number of mappers if possible to drive
more traffic
CPU
• Watch for CPU utilization of your clusters

• If >50% idle, increase # of mapper/reducer
per instance
– Reduces the number of nodes and reduces
cost
Example Adv. Optimizations (Stage
2)
What potential optimization
do you see in this graph?
Example Adv. Optimizations (Stage
2)
40% CPU idle. Maybe add more mappers?
Disk IO
• Limit the amount of disk IO
• Can increase mapper/reducer memory
• Compress data anywhere you can
• Monitor cluster and pay attention to HDFS
bytes written metrics
• One play to pay attention to is mapper/reducer
disk spill
Disk IO
• Mapper has in memory buffer
mapper

mapper memory buffer
Disk IO
• When memory gets full, data spills to disk
mapper

mapper memory buffer

data spills to
disk
Disk IO
• If you see mapper/reducer excessive spill to disk,
increase buffer memory per mapper

• Excessive spill when ratio of
“MAPPER_SPILLED_RECORDS” and
“MAPPER_OUTPUT_RECORDS” is more than 1
Disk IO
Example:

MAPPER_SPILLED_RECORDS = 221200123
MAPPER_OUTPUT_RECORDS = 101200123
Disk IO
• Increase mapper buffer memory by increasing
“io.sort.mb”

<property><name>io.sort.mb<name><value>200</value><
/property>

• Same logic applies to reducers
Disk IO
• Monitor disk IO using Ganglia

• Look out for disk IO wait
Disk IO
• Monitor disk IO using Ganglia
• Look out for disk IO wait
Remember!
Run Job

Find
Bottlenecks
Ganglia
CPU
Disk
Memory

Address
Bottleneck
Fine Tune
Change Algo
Please give us your feedback on this
presentation

BDT404
As a thank you, we will select prize
winners daily for completed surveys!

Contenu connexe

Tendances

Migrating Databases to the Cloud with AWS Database Migration Service (DAT207)...
Migrating Databases to the Cloud with AWS Database Migration Service (DAT207)...Migrating Databases to the Cloud with AWS Database Migration Service (DAT207)...
Migrating Databases to the Cloud with AWS Database Migration Service (DAT207)...Amazon Web Services
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
모든 데이터를 위한 단 하나의 저장소, Amazon S3 기반 데이터 레이크::정세웅::AWS Summit Seoul 2018
모든 데이터를 위한 단 하나의 저장소, Amazon S3 기반 데이터 레이크::정세웅::AWS Summit Seoul 2018모든 데이터를 위한 단 하나의 저장소, Amazon S3 기반 데이터 레이크::정세웅::AWS Summit Seoul 2018
모든 데이터를 위한 단 하나의 저장소, Amazon S3 기반 데이터 레이크::정세웅::AWS Summit Seoul 2018Amazon Web Services Korea
 
Serverless computing with AWS Lambda
Serverless computing with AWS Lambda Serverless computing with AWS Lambda
Serverless computing with AWS Lambda Apigee | Google Cloud
 
AWS Cloud cost optimization
AWS Cloud cost optimizationAWS Cloud cost optimization
AWS Cloud cost optimizationYogesh Sharma
 
A Serverless Journey: AWS Lambda Under the Hood (SRV409-R1) - AWS re:Invent 2018
A Serverless Journey: AWS Lambda Under the Hood (SRV409-R1) - AWS re:Invent 2018A Serverless Journey: AWS Lambda Under the Hood (SRV409-R1) - AWS re:Invent 2018
A Serverless Journey: AWS Lambda Under the Hood (SRV409-R1) - AWS re:Invent 2018Amazon Web Services
 
EMR 플랫폼 기반의 Spark 워크로드 실행 최적화 방안 - 정세웅, AWS 솔루션즈 아키텍트:: AWS Summit Online Ko...
EMR 플랫폼 기반의 Spark 워크로드 실행 최적화 방안 - 정세웅, AWS 솔루션즈 아키텍트::  AWS Summit Online Ko...EMR 플랫폼 기반의 Spark 워크로드 실행 최적화 방안 - 정세웅, AWS 솔루션즈 아키텍트::  AWS Summit Online Ko...
EMR 플랫폼 기반의 Spark 워크로드 실행 최적화 방안 - 정세웅, AWS 솔루션즈 아키텍트:: AWS Summit Online Ko...Amazon Web Services Korea
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorDatabricks
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSAmazon Web Services
 
Day 5 - AWS Autoscaling Master Class - The New Capacity Plan
Day 5 - AWS Autoscaling Master Class - The New Capacity PlanDay 5 - AWS Autoscaling Master Class - The New Capacity Plan
Day 5 - AWS Autoscaling Master Class - The New Capacity PlanAmazon Web Services
 
Arm 기반의 AWS Graviton 프로세서로 구동되는 AWS 인스턴스 살펴보기 - 김종선, AWS솔루션즈 아키텍트:: AWS Summi...
Arm 기반의 AWS Graviton 프로세서로 구동되는 AWS 인스턴스 살펴보기 - 김종선, AWS솔루션즈 아키텍트:: AWS Summi...Arm 기반의 AWS Graviton 프로세서로 구동되는 AWS 인스턴스 살펴보기 - 김종선, AWS솔루션즈 아키텍트:: AWS Summi...
Arm 기반의 AWS Graviton 프로세서로 구동되는 AWS 인스턴스 살펴보기 - 김종선, AWS솔루션즈 아키텍트:: AWS Summi...Amazon Web Services Korea
 
AWS CloudFormation: Infrastructure as Code | AWS Public Sector Summit 2016
AWS CloudFormation: Infrastructure as Code | AWS Public Sector Summit 2016AWS CloudFormation: Infrastructure as Code | AWS Public Sector Summit 2016
AWS CloudFormation: Infrastructure as Code | AWS Public Sector Summit 2016Amazon Web Services
 
Introduction to AWS Lambda and Serverless Applications
Introduction to AWS Lambda and Serverless ApplicationsIntroduction to AWS Lambda and Serverless Applications
Introduction to AWS Lambda and Serverless ApplicationsAmazon Web Services
 
AWS Webcast - What is Cloud Computing with AWS
AWS Webcast - What is Cloud Computing with AWSAWS Webcast - What is Cloud Computing with AWS
AWS Webcast - What is Cloud Computing with AWSAmazon Web Services
 
An introduction to AWS CloudFormation - Pop-up Loft Tel Aviv
An introduction to AWS CloudFormation - Pop-up Loft Tel AvivAn introduction to AWS CloudFormation - Pop-up Loft Tel Aviv
An introduction to AWS CloudFormation - Pop-up Loft Tel AvivAmazon Web Services
 

Tendances (20)

Amazon SQS overview
Amazon SQS overviewAmazon SQS overview
Amazon SQS overview
 
ElastiCache & Redis
ElastiCache & RedisElastiCache & Redis
ElastiCache & Redis
 
Migrating Databases to the Cloud with AWS Database Migration Service (DAT207)...
Migrating Databases to the Cloud with AWS Database Migration Service (DAT207)...Migrating Databases to the Cloud with AWS Database Migration Service (DAT207)...
Migrating Databases to the Cloud with AWS Database Migration Service (DAT207)...
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
모든 데이터를 위한 단 하나의 저장소, Amazon S3 기반 데이터 레이크::정세웅::AWS Summit Seoul 2018
모든 데이터를 위한 단 하나의 저장소, Amazon S3 기반 데이터 레이크::정세웅::AWS Summit Seoul 2018모든 데이터를 위한 단 하나의 저장소, Amazon S3 기반 데이터 레이크::정세웅::AWS Summit Seoul 2018
모든 데이터를 위한 단 하나의 저장소, Amazon S3 기반 데이터 레이크::정세웅::AWS Summit Seoul 2018
 
Deep Dive: Amazon RDS
Deep Dive: Amazon RDSDeep Dive: Amazon RDS
Deep Dive: Amazon RDS
 
Serverless computing with AWS Lambda
Serverless computing with AWS Lambda Serverless computing with AWS Lambda
Serverless computing with AWS Lambda
 
AWS Cloud cost optimization
AWS Cloud cost optimizationAWS Cloud cost optimization
AWS Cloud cost optimization
 
A Serverless Journey: AWS Lambda Under the Hood (SRV409-R1) - AWS re:Invent 2018
A Serverless Journey: AWS Lambda Under the Hood (SRV409-R1) - AWS re:Invent 2018A Serverless Journey: AWS Lambda Under the Hood (SRV409-R1) - AWS re:Invent 2018
A Serverless Journey: AWS Lambda Under the Hood (SRV409-R1) - AWS re:Invent 2018
 
EMR 플랫폼 기반의 Spark 워크로드 실행 최적화 방안 - 정세웅, AWS 솔루션즈 아키텍트:: AWS Summit Online Ko...
EMR 플랫폼 기반의 Spark 워크로드 실행 최적화 방안 - 정세웅, AWS 솔루션즈 아키텍트::  AWS Summit Online Ko...EMR 플랫폼 기반의 Spark 워크로드 실행 최적화 방안 - 정세웅, AWS 솔루션즈 아키텍트::  AWS Summit Online Ko...
EMR 플랫폼 기반의 Spark 워크로드 실행 최적화 방안 - 정세웅, AWS 솔루션즈 아키텍트:: AWS Summit Online Ko...
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Day 5 - AWS Autoscaling Master Class - The New Capacity Plan
Day 5 - AWS Autoscaling Master Class - The New Capacity PlanDay 5 - AWS Autoscaling Master Class - The New Capacity Plan
Day 5 - AWS Autoscaling Master Class - The New Capacity Plan
 
Arm 기반의 AWS Graviton 프로세서로 구동되는 AWS 인스턴스 살펴보기 - 김종선, AWS솔루션즈 아키텍트:: AWS Summi...
Arm 기반의 AWS Graviton 프로세서로 구동되는 AWS 인스턴스 살펴보기 - 김종선, AWS솔루션즈 아키텍트:: AWS Summi...Arm 기반의 AWS Graviton 프로세서로 구동되는 AWS 인스턴스 살펴보기 - 김종선, AWS솔루션즈 아키텍트:: AWS Summi...
Arm 기반의 AWS Graviton 프로세서로 구동되는 AWS 인스턴스 살펴보기 - 김종선, AWS솔루션즈 아키텍트:: AWS Summi...
 
Amazon S3 Masterclass
Amazon S3 MasterclassAmazon S3 Masterclass
Amazon S3 Masterclass
 
AWS CloudFormation: Infrastructure as Code | AWS Public Sector Summit 2016
AWS CloudFormation: Infrastructure as Code | AWS Public Sector Summit 2016AWS CloudFormation: Infrastructure as Code | AWS Public Sector Summit 2016
AWS CloudFormation: Infrastructure as Code | AWS Public Sector Summit 2016
 
Introduction to AWS Lambda and Serverless Applications
Introduction to AWS Lambda and Serverless ApplicationsIntroduction to AWS Lambda and Serverless Applications
Introduction to AWS Lambda and Serverless Applications
 
AWS Webcast - What is Cloud Computing with AWS
AWS Webcast - What is Cloud Computing with AWSAWS Webcast - What is Cloud Computing with AWS
AWS Webcast - What is Cloud Computing with AWS
 
An introduction to AWS CloudFormation - Pop-up Loft Tel Aviv
An introduction to AWS CloudFormation - Pop-up Loft Tel AvivAn introduction to AWS CloudFormation - Pop-up Loft Tel Aviv
An introduction to AWS CloudFormation - Pop-up Loft Tel Aviv
 

Similaire à Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...Amazon Web Services
 
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAmazon Web Services
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts Julien SIMON
 
Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
 Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
Big Data Analytics using Amazon Elastic MapReduce and Amazon RedshiftIndicThreads
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRIsrael AWS User Group
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAmazon Web Services
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceAmazon Web Services
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivAmazon Web Services
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceAmazon Web Services
 
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Amazon Web Services
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best PracticesAmazon Web Services
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesVladimir Simek
 
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Amazon Web Services
 
Optimizing Storage for Big Data Workloads
Optimizing Storage for Big Data WorkloadsOptimizing Storage for Big Data Workloads
Optimizing Storage for Big Data WorkloadsAmazon Web Services
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSAmazon Web Services
 
AWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explainedAWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explainedHarsha KM
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
 

Similaire à Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013 (20)

AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
 
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts
 
Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
 Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMR
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel Aviv
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
 
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
 
EMR Training
EMR TrainingEMR Training
EMR Training
 
Optimizing Storage for Big Data Workloads
Optimizing Storage for Big Data WorkloadsOptimizing Storage for Big Data Workloads
Optimizing Storage for Big Data Workloads
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWS
 
AWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explainedAWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explained
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 

Plus de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Plus de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Dernier

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 

Dernier (20)

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 

Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

  • 1. Amazon Elastic MapReduce: Deep Dive and Best Practices Parviz Deyhim November 13, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 2. Outline Introduction to Amazon EMR Amazon EMR Design Patterns Amazon EMR Best Practices Amazon Controlling Cost with EMR Advanced Optimizations
  • 3. Outline Introduction to Amazon EMR Amazon EMR Design Patterns Amazon EMR Best Practices Amazon Controlling Cost with EMR Advanced Optimizations
  • 4. Hadoop-as-a-service Map-Reduce engine Integrated with tools What is EMR? Massively parallel Integrated to AWS services Cost effective AWS wrapper
  • 7. Data management Analytics languages HDFS Amazon EMR Amazon S3 Amazon DynamoDB
  • 8. Data management Analytics languages HDFS Amazon EMR Amazon S3 Amazon DynamoDB Amazon RDS
  • 9. Data management Analytics languages HDFS Amazon EMR Amazon Redshift Amazon RDS AWS Data Pipeline Amazon S3 Amazon DynamoDB
  • 10. Amazon EMR Introduction • Launch clusters of any size in a matter of minutes • Use variety of different instance sizes that match your workload
  • 11. Amazon EMR Introduction • Don’t get stuck with hardware • Don’t deal with capacity planning • Run multiple clusters with different sizes, specs and node types
  • 12. Amazon EMR Introduction • Integration with Spot market • 70-80% discount
  • 13. Outline Introduction to Amazon EMR Amazon EMR Design Patterns Amazon EMR Best Practices Amazon Controlling Cost with EMR Advanced Optimizations
  • 14. Amazon EMR Design Patterns Pattern #1: Transient vs. Alive Clusters Pattern #2: Core Nodes and Task Nodes Pattern #3: Amazon S3 as HDFS Pattern #4: Amazon S3 & HDFS Pattern #5: Elastic Clusters
  • 15. Pattern #1: Transient vs. Alive Clusters
  • 16. Pattern #1: Transient Clusters • Cluster lives for the duration of the job • Shut down the cluster when the job is done • Data persist on Amazon S3 • Input & output Data on Amazon S3
  • 17. Benefits of Transient Clusters 1. Control your cost 2. Minimum maintenance • Cluster goes away when job is done 3. Practice cloud architecture • Pay for what you use • Data processing as a workflow
  • 18. When to use Transient cluster? If ( Data Load Time + Processing Time) * Number Of Jobs < 24 Use Transient Clusters Else Use Alive Clusters
  • 19. When to use Transient cluster? ( 20min data load + 1 hour Processing time) * 10 jobs = 13 hours < 24 hour = Use Transient Clusters
  • 20. Alive Clusters • Very similar to traditional Hadoop deployments • Cluster stays around after the job is done • Data persistence model: • Amazon S3 • Amazon S3 Copy To HDFS • HDFS and Amazon S3 as backup
  • 21. Alive Clusters • Always keep data safe on Amazon S3 even if you’re using HDFS for primary storage • Get in the habit of shutting down your cluster and start a new one, once a week or month • Design your data processing workflow to account for failure • You can use workflow managements such as AWS Data Pipeline
  • 22. Benefits of Alive Clusters • Ability to share data between multiple jobs Transient cluster Long running clusters EMR EMR Amazon S3 Amazon S3 EMR HDFS HDFS
  • 23. Benefit of Alive Clusters • Cost effective for repetitive jobs pm EMR pm pm EMR EMR EMR EMR pm
  • 24. When to use Alive cluster? If ( Data Load Time + Processing Time) * Number Of Jobs > 24 Use Alive Clusters Else Use Transient Clusters
  • 25. When to use Alive cluster? ( 20min data load + 1 hour Processing time) * 20 jobs = 26hours > 24 hour = Use Alive Clusters
  • 26. Pattern #2: Core & Task nodes
  • 27. Core Nodes Amazon EMR cluster Master instance group Run TaskTrackers (Compute) Run DataNode (HDFS) HDFS HDFS Core instance group
  • 28. Core Nodes Amazon EMR cluster Master instance group Can add core nodes HDFS HDFS Core instance group
  • 29. Core Nodes Amazon EMR cluster Can add core nodes Master instance group More HDFS space More CPU/mem HDFS HDFS Core instance group HDFS
  • 30. Core Nodes Amazon EMR cluster Can’t remove core nodes because of HDFS Master instance group HDFS HDFS Core instance group HDFS
  • 31. Amazon EMR Task Nodes Amazon EMR cluster Run TaskTrackers Master instance group No HDFS Reads from core node HDFS HDFS HDFS Core instance group Task instance group
  • 32. Amazon EMR Task Nodes Amazon EMR cluster Can add task nodes Master instance group HDFS HDFS Core instance group Task instance group
  • 33. Amazon EMR Task Nodes Amazon EMR cluster More CPU power Master instance group More memory HDFS HDFS Core instance group Task instance group
  • 34. Amazon EMR Task Nodes You can remove task nodes Amazon EMR cluster Master instance group HDFS HDFS Core instance group Task instance group
  • 35. Amazon EMR Task Nodes You can remove task nodes Amazon EMR cluster Master instance group HDFS HDFS Core instance group Task instance group
  • 36. Tasknode Use-Case #1 • Speed up job processing using Spot market • Run task nodes on Spot market • Get discount on hourly price • Nodes can come and go without interruption to your cluster
  • 37. Tasknode Use-Case #2 • When you need extra horse power for a short amount of time • Example: Need to pull large amount of data from Amazon S3
  • 39. Add Spot task nodes to load data from Amazon S3 Example: HS1 48TB HDFS m1.xl m1.xl m1.xl HS1 48TB HDFS m1.xl Amazon S3 m1.xl m1.xl
  • 40. Example: HS1 48TB HDFS Remove after data load from Amazon S3 m1.xl m1.xl m1.xl HS1 48TB HDFS m1.xl m1.xl m1.xl Amazon S3
  • 41. Pattern #3: Amazon S3 as HDFS
  • 42. Amazon S3 as HDFS Amazon EMR cluster • Use Amazon S3 as your permanent data store • HDFS for temporary storage data between jobs • No additional step to copy data to HDFS HDF S HDF S Core instance group Task instance group Amazon S3
  • 43. Benefits: Amazon S3 as HDFS • Ability to shut down your cluster HUGE Benefit!! • Use Amazon S3 as your durable storage 11 9s of durability
  • 44. Benefits: Amazon S3 as HDFS • No need to scale HDFS • Capacity • Replication for durability • Amazon S3 scales with your data • Both in IOPs and data storage
  • 45. Benefits: Amazon S3 as HDFS • Ability to share data between multiple clusters • Hard to do with HDFS EMR EMR Amazon S3
  • 46. Benefits: Amazon S3 as HDFS • Take advantage of Amazon S3 features • Amazon S3 ServerSideEncryption • Amazon S3 LifeCyclePolicy • Amazon S3 versioning to protect against corruption • Build elastic clusters • Add nodes to read from Amazon S3 • Remove nodes with data safe on Amazon S3
  • 47. What About Data Locality? • Run your job in the same region as your Amazon S3 bucket • Amazon EMR nodes have high speed connectivity to Amazon S3 • If your job Is CPU/memory-bounded data, locality doesn’t make a difference
  • 48. Anti-Pattern: Amazon S3 as HDFS • Iterative workloads – If you’re processing the same dataset more than once • Disk I/O intensive workloads
  • 49. Pattern #4: Amazon S3 & HDFS
  • 50. Amazon S3 & HDFS 1. Data persist on Amazon S3
  • 51. 2. Launch Amazon EMR and copy data to HDFS with S3distcp S3DistCp Amazon S3 & HDFS
  • 52. 3. Start processing data on HDFS S3DistCp Amazon S3 & HDFS
  • 53. Benefits: Amazon S3 & HDFS • Better pattern for I/O-intensive workloads • Amazon S3 benefits discussed previously applies • Durability • Scalability • Cost • Features: lifecycle policy, security
  • 55. Amazon EMR Elastic Cluster (m) 1. Start cluster with certain number of nodes
  • 56. Amazon EMR Elastic Cluster (m) 2. Monitor your cluster with Amazon CloudWatch metrics • Map Tasks Running • Map Tasks Remaining • Cluster Idle? • Avg. Jobs Failed
  • 57. Amazon EMR Elastic Cluster (m) 3. Increase the number of nodes as you need more capacity by manually calling the API
  • 58. Amazon EMR Elastic Cluster (a) 1. Start your cluster with certain number of nodes
  • 59. Amazon EMR Elastic Cluster (a) 2. Monitor cluster capacity with Amazon CloudWatch metrics • Map Tasks Running • Map Tasks Remaining • Cluster Idle? • Avg. Jobs Failed
  • 60. Amazon EMR Elastic Cluster (a) 3. Get HTTP Amazon SNS notification to a simple app deployed on Elastic Beanstalk
  • 61. Amazon EMR Elastic Cluster (a) 4. Your app calls the API to add nodes to your cluster API
  • 62. Outline Introduction to Amazon EMR Amazon EMR Design Patterns Amazon EMR Best Practices Amazon Controlling Cost with EMR Advanced Optimizations
  • 63. Amazon EMR Nodes and Size • Use m1.smal, m1.large, c1.medium for functional testing • Use M1.xlarge and larger nodes for production workloads
  • 64. Amazon EMR Nodes and Size • Use CC2 for memory and CPU intensive jobs • Use CC2/C1.xlarge for CPU intensive jobs • Hs1 instances for HDFS workloads
  • 65. Amazon EMR Nodes and Size • Hi1 and HS1 instances for disk I/Ointensive workload • CC2 instances are more cost effective than M2.4xlarge • Prefer smaller cluster of larger nodes than larger cluster of smaller nodes
  • 66. Holy Grail Question How many nodes do I need?
  • 67. Introduction to Hadoop Splits • Depends on how much data you have • And how fast you like your data to be processed
  • 68. Introduction to Hadoop Splits Before we understand Amazon EMR capacity planning, we need to understand Hadoop’s inner working of splits
  • 69. Introduction to Hadoop Splits • Data gets broken up to splits (64MB or 128) Data 128MB Splits
  • 70. Introduction to Hadoop Splits • Splits get packaged into mappers Data Splits Mappers
  • 71. Introduction to Hadoop Splits • Mappers get assigned to Mappers nodes for processing Instances
  • 72. Introduction to Hadoop Splits • More data = More splits = More mappers
  • 73. Introduction to Hadoop Splits • More data = More splits = More mappers Queue
  • 74. Introduction to Hadoop Splits • Data mappers > cluster mapper capacity = mappers wait for capacity = processing delay Queue
  • 75. Introduction to Hadoop Splits • More nodes = reduced queue size = faster processing Queue
  • 76. Calculating the Number of Splits for Your Job Uncompressed files: Hadoop splits a single file to multiple splits. Example: 128MB = 2 splits based on 64MB split size
  • 77. Calculating the Number of Splits for Your Job Compressed files: 1. Splittable compressions: same logic as uncompressed files 64MB BZIP 128MB BZIP
  • 78. Calculating the Number of Splits for Your Job Compressed files: 2. Unsplittable compressions: the entire file is a single split. 128MB GZ 128MB GZ
  • 79. Calculating the Number of Splits for Your Job Number of splits If data files have unsplittable compression # of splits = number of files Example: 10 GZ files = 10 mappers
  • 80. Cluster Sizing Calculation Just tell me how many nodes I need for my job!!
  • 81. Cluster Sizing Calculation 1. Estimate the number of mappers your job requires.
  • 82. Cluster Sizing Calculation 2. Pick an instance and note down the number of mappers it can run in parallel M1.xlarge = 8 mappers in parallel
  • 83. Cluster Sizing Calculation 3. We need to pick some sample data files to run a test workload. The number of sample files should be the same number from step #2.
  • 84. Cluster Sizing Calculation 4. Run an Amazon EMR cluster with a single core node and process your sample files from #3. Note down the amount of time taken to process your sample files.
  • 85. Cluster Sizing Calculation Estimated Number Of Nodes: Total Mappers * Time To Process Sample Files Instance Mapper Capacity * Desired Processing Time
  • 86. Example: Cluster Sizing Calculation 1. Estimate the number of mappers your job requires 150 2. Pick an instance and note down the number of mappers it can run in parallel m1.xlarge with 8 mapper capacity per instance
  • 87. Example: Cluster Sizing Calculation 3. We need to pick some sample data files to run a test workload. The number of sample files should be the same number from step #2. 8 files selected for our sample test
  • 88. Example: Cluster Sizing Calculation 4. Run an Amazon EMR cluster with a single core node and process your sample files from #3. Note down the amount of time taken to process your sample files. 3 min to process 8 files
  • 89. Cluster Sizing Calculation Estimated number of nodes: Total Mappers For Your Job * Time To Process Sample Files Per Instance Mapper Capacity * Desired Processing Time 150 * 3 min = 11 m1.xlarge 8 * 5 min
  • 90. File Size Best Practices • Avoid small files at all costs • Anything smaller than 100MB • Each mapper is a single JVM • CPU time required to spawn JVMs/mappers
  • 91. File Size Best Practices Mappers take 2 sec to spawn up and be ready for processing 10TB of 100mgB = 100,000 mappers * 2Sec = total of 55 hours mapper CPU setup time
  • 92. File Size Best Practices Mappers take 2 sec to spawn up and be ready for processing 10TB of 1000MB = 10,000 mappers * 2Sec = total of 5 hours mapper CPU setup time
  • 93. File Size on Amazon S3: Best Practices • What’s the best Amazon S3 file size for Hadoop? About 1-2GB • Why?
  • 94. File Size on Amazon S3: Best Practices • Life of mapper should not be less than 60 sec • Single mapper can get 10MB-15MB/s speed to Amazon S3 60sec * 15MB ≈ 1GB
  • 95. Holy Grail Question What if I have small file issues?
  • 96. Dealing with Small Files • Use S3DistCP to combine smaller files together • S3DistCP takes a pattern and target file to combine smaller input files to larger ones
  • 97. Dealing with Small Files Example: ./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar /home/hadoop/lib/emr-s3distcp-1.0.jar --args '--src,s3://myawsbucket/cf, --dest,hdfs:///local, --groupBy,.*XABCD12345678.([0-9]+-[09]+-[0-9]+-[0-9]+).*, --targetSize,128,
  • 98. Compressions • Compress as much as you can • Compress Amazon S3 input data files – Reduces cost – Speed up Amazon S3->mapper data transfer time
  • 99. Compressions • Always Compress Data Files On Amazon S3 • Reduces Storage Cost • Reduces Bandwidth Between Amazon S3 and Amazon EMR • Speeds Up Your Job
  • 100. Compressions • Compress Mappers and Reducer Output • Reduces Disk IO
  • 101. Compressions • Compression Types: – Some are fast BUT offer less space reduction – Some are space efficient BUT Slower – Some are splitable and some are not Algorithm % Space Remaining Encoding Speed Decoding Speed GZIP 13% 21MB/s 118MB/s LZO 20% 135MB/s 410MB/s Snappy 22% 172MB/s 409MB/s
  • 102. Compressions • If You Are Time Sensitive, Faster Compressions Are A Better Choice • If You Have Large Amount Of Data, Use Space Efficient Compressions • If You Don’t Care, Pick GZIP
  • 103. Change Compression Type • You May Decide To Change Compression Type • Use S3DistCP to change the compression types of your files • Example: ./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar /home/hadoop/lib/emr-s3distcp-1.0.jar --args '--src,s3://myawsbucket/cf, --dest,hdfs:///local, --outputCodec,lzo’
  • 104. Outline Introduction to Amazon EMR Amazon EMR Design Patterns Amazon EMR Best Practices Amazon Controlling Cost with EMR Advanced Optimizations
  • 105. Architecting for cost • AWS pricing models: – On-demand: Pay as you go model. – Spot: Market place. Bid for instances and get a discount – Reserved Instance: upfront payment (for 1 or 3 year) for reduction in overall monthly payment
  • 106. Reserved Instances use-case For alive and long-running clusters
  • 107. Reserved Instances use-case For ad-hoc and unpredictable workloads, use medium utilization
  • 108. Reserved Instances use-case For unpredictable workloads, use Spot or on-demand pricing
  • 109. Outline Introduction to Amazon EMR Amazon EMR Design Patterns Amazon EMR Best Practices Amazon Controlling Cost with EMR Advanced Optimizations
  • 110. Adv. Optimizations (Stage 1) • The best optimization is to structure your data (i.e., smart data partitioning) • Efficient data structuring= limit the amount of data being processed by Hadoop= faster jobs
  • 111. Adv. Optimizations (Stage 1) • Hadoop is a batch processing framework • Data processing time = an hour to days • Not a great use-case for shorter jobs • Other frameworks may be a better fit – Twitter Storm – Spark – Amazon Redshift, etc.
  • 112. Adv. Optimizations (Stage 1) • Amazon EMR team has done a great deal of optimization already • For smaller clusters, Amazon EMR configuration optimization won’t buy you much – Remember you’re paying for the full hour cost of an instance
  • 113. Adv. Optimizations (Stage 1) Best Optimization??
  • 114. Adv. Optimizations (Stage 1) Add more nodes
  • 115. Adv. Optimizations (Stage 2) • Monitor your cluster using Ganglia • Amazon EMR has Ganglia bootstrap action
  • 116. Adv. Optimizations (Stage 2) • Monitor and look for bottlenecks – Memory – CPU – Disk IO – Network IO
  • 120. Network IO • Most important metric to watch for if using Amazon S3 for storage • Goal: Drive as much network IO as possible from a single instance
  • 121. Network IO • Larger instances can drive > 600Mbps • Cluster computes can drive 1Gbps -2 Gbps • Optimize to get more out of your instance throughput – Add more mappers?
  • 122. Network IO • If you’re using Amazon S3 with Amazon EMR, monitor Ganglia and watch network throughput. • Your goal is to maximize your NIC throughput by having enough mappers per node
  • 123. Network IO, Example Low network utilization Increase number of mappers if possible to drive more traffic
  • 124. CPU • Watch for CPU utilization of your clusters • If >50% idle, increase # of mapper/reducer per instance – Reduces the number of nodes and reduces cost
  • 125. Example Adv. Optimizations (Stage 2) What potential optimization do you see in this graph?
  • 126. Example Adv. Optimizations (Stage 2) 40% CPU idle. Maybe add more mappers?
  • 127. Disk IO • Limit the amount of disk IO • Can increase mapper/reducer memory • Compress data anywhere you can • Monitor cluster and pay attention to HDFS bytes written metrics • One play to pay attention to is mapper/reducer disk spill
  • 128. Disk IO • Mapper has in memory buffer mapper mapper memory buffer
  • 129. Disk IO • When memory gets full, data spills to disk mapper mapper memory buffer data spills to disk
  • 130.
  • 131. Disk IO • If you see mapper/reducer excessive spill to disk, increase buffer memory per mapper • Excessive spill when ratio of “MAPPER_SPILLED_RECORDS” and “MAPPER_OUTPUT_RECORDS” is more than 1
  • 132. Disk IO Example: MAPPER_SPILLED_RECORDS = 221200123 MAPPER_OUTPUT_RECORDS = 101200123
  • 133. Disk IO • Increase mapper buffer memory by increasing “io.sort.mb” <property><name>io.sort.mb<name><value>200</value>< /property> • Same logic applies to reducers
  • 134. Disk IO • Monitor disk IO using Ganglia • Look out for disk IO wait
  • 135. Disk IO • Monitor disk IO using Ganglia • Look out for disk IO wait
  • 137. Please give us your feedback on this presentation BDT404 As a thank you, we will select prize winners daily for completed surveys!