SlideShare une entreprise Scribd logo
1  sur  43
Télécharger pour lire hors ligne
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Rahul Pathak, AWS
Scott Donaldson, FINRA
Clayton Kovar, FINRA
October 2015
Amazon EMR Deep Dive
& Best Practices
BDT305
What to expect from the session
• Update on the latest Amazon EMR release
• Information on advanced capabilities of Amazon EMR
• Tips for lowering your Amazon EMR costs
• Deep dive into how FINRA uses Amazon EMR and
Amazon S3 as their multi-petabyte data warehouse
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Rahul Pathak, Sr. Mgr. Amazon EMR
(@rahulpathak)
October 2015
Amazon EMR
Deep Dive & Best Practices
Amazon EMR
• Managed clusters for Hadoop, Spark, Presto, or any other applications in
the Apache/Hadoop stack
• Integrated with the AWS platform via EMRFS – connectors for Amazon S3,
Amazon DynamoDB, Amazon Kinesis, Amazon Redshift, and AWS KMS
• Secure with support for AWS IAM roles, KMS, S3 client-side encryption,
Hadoop transparent encryption, Amazon VPC, and HIPAA-eligible
• Built in support for resizing clusters and integrated with the Amazon EC2
spot market to help lower costs
New Features
EMR Release 4.1
• Hadoop KMS with transparent HDFS encryption support
• Spark 1.5, Zeppelin 0.6
• Presto 0.119, Airpal
• Hive, Oozie, Hue 3.7.1
• Simple APIs for launch and configuration
Intelligent Resize
• Incrementally scale up based on available capacity
• Wait for work to complete before resizing down
• Can scale core nodes and HDFS as well as task nodes
Leverage Amazon S3 with
EMR File System (EMRFS)
Amazon S3 as your persistent data store
• Separate compute and storage
• Resize and shut down Amazon EMR
clusters with no data loss
• Point multiple Amazon EMR clusters at
the same data in Amazon S3
• Easily evolve your analytic
infrastructure as technology evolves
EMR
EMR
Amazon
S3
EMRFS makes it easier to use Amazon S3
• Read-after-write consistency
• Very fast list operations
• Error handling options
• Support for Amazon S3 encryption
• Transparent to applications: s3://
Amazon
S3
Going from HDFS to Amazon S3
CREATE EXTERNAL TABLE serde_regex(
host STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
)
LOCATION ‘samples/pig-apache/input/'
Going from HDFS to Amazon S3
CREATE EXTERNAL TABLE serde_regex(
host STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
)
LOCATION 's3://elasticmapreduce.samples/pig-
apache/input/'
Amazon S3
EMRFS metadata
in Amazon DynamoDB
List and read-after-write consistency
Faster list operations
Consistent view and fast listing using
the optional EMRFS metadata layer
*Tested using a single node cluster with a m3.xlarge instance.
Number of
objects
Without
consistent view
With
consistent view
1,000,000 147.72 29.70
100,000 12.70 3.69
EMRFS client-side encryption
Amazon S3
AmazonS3encryptionclients
EMRFSenabledfor
AmazonS3client-sideencryption
Key vendor (AWS KMS or your custom key vendor)
(client-side encrypted objects)
HDFS is still there if you need it
• Iterative workloads
• If you’re processing the same dataset more than once
• Consider using Spark & RDDs for this too
• Disk I/O intensive workloads
• Persist data on Amazon S3 and use S3DistCp to copy to/from
HDFS for processing
Optimizations for storage
File formats
Row oriented
• Text files
• Sequence files
• Writable object
• Avro data files
• Described by schema
Columnar format
• Object Record Columnar (ORC)
• Parquet
Logical table
Row oriented
Column oriented
Factors to consider
Processing and query tools
• Hive, Impala and Presto
Evolution of schema
• Avro for schema and Presto for storage
File format “splittability”
• Avoid JSON/XML Files. Use them as records
Encryption requirements
File sizes
Avoid small files
• Anything smaller than 100MB
Each mapper is a single JVM
• CPU time is required to spawn JVMs/mappers
Fewer files, matching closely to block size
• fewer calls to S3
• fewer network/HDFS requests
Dealing with small files
Reduce HDFS block size, e.g. 1MB (default is 128MB)
• --bootstrap-action s3://elasticmapreduce/bootstrap-
actions/configure-hadoop --args “-m,dfs.block.size=1048576”
Better: Use S3DistCp to combine smaller files together
• S3DistCp takes a pattern and target path to combine smaller
input files to larger ones
• Supply a target size and compression codec
Compression
Always compress data files On Amazon S3
• Reduces network traffic between Amazon S3 and Amazon EMR
• Speeds Up Your Job
Compress mappers and reducer output
Amazon EMR compresses inter-node traffic with LZO with
Hadoop 1, and Snappy with Hadoop 2
Choosing the right compression
• Time sensitive, faster compressions are a better choice
• Large amount of data, use space efficient compressions
• Combined Workload, use gzip
Algorithm Splittable? Compression ratio
Compress +
decompress speed
Gzip (DEFLATE) No High Medium
bzip2 Yes Very high Slow
LZO Yes Low Fast
Snappy No Low Very fast
Cost saving tips for Amazon EMR
Use S3 as your persistent data store – query it using
Presto, Hive, Spark, etc.
Only pay for compute when you need it
Use Amazon EC2 Spot instances to save >80%
Use Amazon EC2 Reserved instances for steady
workloads
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Scott Donaldson, Senior Director
Clayton Kovar, Principal Architect
EMR &
Interactive
Analytics
EMR is Ubiquitous in our architecture
Data Marts
(Amazon
Redshift)
Query Cluster
(EMR)
Query Cluster
(EMR)
Auto Scaled
EC2
Analytics
App
Normalization
ETL Clusters
(EMR)
Batch Analytic
Clusters
(EMR)
Adhoc Query
Cluster (EMR)
Auto Scaled
EC2
Analytics
App
Users Data
Providers
Auto Scaled
EC2
Data
Ingestion
Services
Optimization
ETL Clusters
(EMR)
Shared Metastore
(RDS)
Query Optimized
(S3)
Auto Scaled EC2
Data
Catalog
& Lineage
Services
Reference Data
(RDS)
Shared Data Services
Auto Scaled
EC2
Cluster Mgt
& Workflow
Services
Source of
Truth (S3)
It starts with the data
S3 is your durable system of record
Separate your compute and storage
Shutdown your cluster when not in use
Share data among multiple clusters
Fault tolerance and disaster recovery
Use EMRFS for consistent view
Partition your data for performance
Optimize for your query use cases and access patterns
Larger files >256MB are more efficient
Compact small files into >100MB
FINRA Data Manager orchestrates
data between storage and compute
clusters
Unified catalog
Manage EMR clusters
Track usage & lineage
Job orchestration
File formats & compression
Text format for archival copies on S3 & Amazon Glacier
Select compression algorithm for best fit
We wanted high compression for archive copy
Select a row or columnar format for performance
Sequence or AVRO
ORC, Parquet, RC File
Columnar Benefits:
Predicate pushdown
Skip unwanted columns
Serve multiple query engines: Hive, Presto, Spark
Avoid bloated formats with repetitive markup (e.g. XML)
Our partition and query strategy
Data
received
as:
Users
query
by:
Symbol Group 1
Symbol Group 2
Symbol Group 3
…
Symbol Group 100
Symbol & Firm Query
Late
Data
All late
records
scanned
for all
queries
On Time Data
(Processing Date = Event Date)
99.97% of all records are on time
Symbol Only Query
FirmOnlyQuery
Example hive table creation
create external table if not exists NEW_ORDERS (…)
partitioned by (EVENT_DT DATE, HASH_PRTN_NB SMALLINT)
stored as orc
location 's3://reinvent/new_orders/'
tblproperties ("orc.compress"="SNAPPY");
alter table NEW_ORDERS add if not exists
partition (event_dt='2015-10-08', hash_prtn_nb=0) location
's3://reinvent/new_orders/event_dt=2015-10-08/hash_prtn_nb=0/’
…
partition (event_dt='2015-10-08', hash_prtn_nb=1000) location
's3://reinvent/new_orders/event_dt=2015-10-08/hash_prtn_nb=1000/’
;
Each record’s hash partition number is calculated by
((pmod(hash(symbol), 100) * 10) + pmod(firm, 10))
Made hive on EMR/S3 competitive
Partitions are great, but beware…
select … from NEW_ORDERS where
EVENT_DT between '2015-10-06' and '2015-10-09’
and FIRM = 12345
and (pmod(HASH_PRTN_NB, 10) = pmod(12345, 10)
or HASH_PRTN_NB = 1000) -- 1000 is always read
Using PMOD around the hash_prtn_nb prevents Hive from using a targeted
query on the metastore resulting in millions of partitions returned for pruning
Optimized query with enumeration
select … from NEW_ORDERS where
EVENT_DT >= '2015-10-06' and EVENT_DT <= '2015-10-09’
and FIRM = 12345
and (HASH_PRTN_NB = 5 or HASH_PRTN_NB = 15
… or HASH_PRTN_NB = 985 or HASH_PRTN_NB = 995
or HASH_PRTN_NB = 1000) -- 1000 is always read
Using an IN clause was insufficient to avoid the pruning issue
Explicitly enumerating all partitions vastly improved query planning time
Data security
Required to have encryption of all data both at-rest, and in-transit
S3 server-side encryption was evaluated and determined to be suitable for purpose
Encrypt ephemeral storage on Master, Core, and Task nodes
Use a custom bootstrap action with LUKS with a random, memory only key
Task nodes don’t have HDFS but Mapper and Reducer temporary files need to also be encrypted
Lose the server, lose the data – Remember S3 is our source of truth
Use security groups to ensure only the client applications connect to the Master node
Hive authentication/authorization was not necessary for our usage scenarios
Evaluating transparent encryption (Hadoop 2.6+) in HDFS
Selection of the fittest
HDFS was cost prohibitive for our use cases
Need 30 D2.8XL’s just to store two of our tables: ~$1.5M/yr on HDFS vs ~$120K/yr on S3
Need 90 D2.8XL’s to store all queryable data: ~$4.5M/yr on HDFS vs. $360K/yr on S3
Data locality is desirable but not practical for our scale
EMR & S3 with partitioned data is a great fit
Tuned queries & data structures on S3 take ~2X if on HDFS under perfect locality conditions
Localize data into HDFS on Core nodes using S3DistCp if making 3 or more passes
Consider tiered storage
External tables in Hive can have a blend of some partitions in HDFS and others in S3
Introduces operational complexity for partition maintenance
Doesn’t play well with shared metastore for multiple clusters
Darwin rules: Adaptation
Take advantage of new instance types
Find the right instance type(s) for your workload
Prefer a smaller cluster of larger nodes: e.g. 4XL
With millions of partitions, more memory is needed for the Master node (HS2)
Use CLI based scripts rather than console → Infrastructure is code
Node Type Before After
Master 1 - R3.4XL 1 - R3.2XL
Core 40 - M3.2XL 10 - C3.4XL
Task (peak) 100 - M3.2XL 35 - C3.4XL
Beat the incumbent
Right size your cluster
Transient use cases: ETL and batch analytics
Size cluster to complete within ten minutes of an hour boundary to optimize $$
Use Spot when you have flexible SLA to save $$
Use On Demand or Reserved to meet SLA at predictable cost
Always On use case: Interactive analytics
Size Core based on HDFS needs (statistics, logging, etc)
Reserve Master and Core nodes
Resize # of Task nodes as demand changes
Use Spot on Task nodes to save $$
Keep a ratio of Core to Task of 1:5 to avoid bottlenecks
Consider bidding Spot above the On Demand price to ensure greater stability
One metastore to rule them all
Consider creating a shared hive metastore service
Fault tolerance & DR with Multi-AZ RDS
Offload metastore hydration of tables and partitions
Transient clusters initialize faster
Millions of partitions/day can take >7 min/day per table
Avoid duplicative effort by separate development teams
Separate metastores are needed for Hive 0.13.1, Hive 1.0 and Presto
However, you can locate them all on a single RDS instance
Utilize FINRA Data Management services to orchestrate metastore updates
Register new tables and partitions as the data arrives via notifications
Monitor, learn, and optimize
Utilize workload management: Fair Scheduler
Refactor your code as necessary to remove bottlenecks
Optimize transient clusters, size to execute workload 10 minutes from an hour boundary
Set hive.mapred.reduce.tasks.speculative.execution = FALSE when writing to external tables
in S3 via Map Reduce
Use broadcast joins when joining small tables (SET hive.auto.convert.join=true).
EMR Step API works fore simple job queuing; use Oozie for more complex jobs
The impact
Removed obstacles
“Before data analysis of this magnitude required intervention from the technology team.”
Lowered the cost of curiosity
“Analysts are able to quickly obtain a full picture of what happens to an order over time,
helping to inform decision making as to whether a rule violation has occurred.”
Elasticity allows us to process years of data in days as opposed to months and save
money by using Spot market
Separately optimize batch and interactive workloads without compromise
Increased teams delivery velocity
Recap
Use Amazon S3 as your durable system of record
Use transient clusters as much as possible
Resize clusters and use the Spot to more efficiently manage capacity, performance, & cost
Move to new instance families to take advantage of performance
Monitor to determine when to resize or change instance types
Share a persistent Hive metastore in RDS among multiple EMR clusters
Be prepared to switch your query engine or execution framework in the future
Budget time to experiment for new tools & engines at scale that weren’t possible before
Related sessions
BDT208 - A Technical Introduction to Amazon Elastic MapReduce
Thursday, Oct 8, 12:15 PM - 1:15 PM– Titian 2201B
BDT303 - Running Spark & Presto on the Netflix Big Data Platform
Thursday, Oct 8, 11:00 AM - 12:00 PM– Palazzo F
BDT309 - Best Practices for Apache Spark on Amazon EMR
Thursday, Oct 8, 5:30 PM - 6:30 PM– Palazzo F
BDT314 - Big Data/Analytics on Amazon EMR & Amazon Redshift
Thursday, Oct 8, 1:30 PM - 2:30 PM– Palazzo F
Remember to complete
your evaluations!
Thank you!

Contenu connexe

Tendances

(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon RedshiftKel Graham
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
AWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationAWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationVolodymyr Rovetskiy
 
Aws glue를 통한 손쉬운 데이터 전처리 작업하기
Aws glue를 통한 손쉬운 데이터 전처리 작업하기Aws glue를 통한 손쉬운 데이터 전처리 작업하기
Aws glue를 통한 손쉬운 데이터 전처리 작업하기Amazon Web Services Korea
 
ABCs of AWS: S3
ABCs of AWS: S3ABCs of AWS: S3
ABCs of AWS: S3Mark Cohen
 
ABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWSABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
 
(DAT201) Introduction to Amazon Redshift
(DAT201) Introduction to Amazon Redshift(DAT201) Introduction to Amazon Redshift
(DAT201) Introduction to Amazon RedshiftAmazon Web Services
 
AWS EMR Cost optimization
AWS EMR Cost optimizationAWS EMR Cost optimization
AWS EMR Cost optimizationSANG WON PARK
 
Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)Amazon Web Services
 
SRV401 Deep Dive on Amazon Elastic File System (Amazon EFS)
SRV401 Deep Dive on Amazon Elastic File System (Amazon EFS)SRV401 Deep Dive on Amazon Elastic File System (Amazon EFS)
SRV401 Deep Dive on Amazon Elastic File System (Amazon EFS)Amazon Web Services
 
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain PipelineThe Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain PipelineAmazon Web Services
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAmazon Web Services
 

Tendances (20)

(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon Redshift
 
Introduction of AWS KMS
Introduction of AWS KMSIntroduction of AWS KMS
Introduction of AWS KMS
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
AWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationAWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentation
 
Introduction to Amazon S3
Introduction to Amazon S3Introduction to Amazon S3
Introduction to Amazon S3
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Aws glue를 통한 손쉬운 데이터 전처리 작업하기
Aws glue를 통한 손쉬운 데이터 전처리 작업하기Aws glue를 통한 손쉬운 데이터 전처리 작업하기
Aws glue를 통한 손쉬운 데이터 전처리 작업하기
 
Intro to AWS: Database Services
Intro to AWS: Database ServicesIntro to AWS: Database Services
Intro to AWS: Database Services
 
ABCs of AWS: S3
ABCs of AWS: S3ABCs of AWS: S3
ABCs of AWS: S3
 
ABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWSABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWS
 
(DAT201) Introduction to Amazon Redshift
(DAT201) Introduction to Amazon Redshift(DAT201) Introduction to Amazon Redshift
(DAT201) Introduction to Amazon Redshift
 
AWS EMR Cost optimization
AWS EMR Cost optimizationAWS EMR Cost optimization
AWS EMR Cost optimization
 
Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)
 
SRV401 Deep Dive on Amazon Elastic File System (Amazon EFS)
SRV401 Deep Dive on Amazon Elastic File System (Amazon EFS)SRV401 Deep Dive on Amazon Elastic File System (Amazon EFS)
SRV401 Deep Dive on Amazon Elastic File System (Amazon EFS)
 
[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue
 
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain PipelineThe Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
 

En vedette

男女共同ペアプログラミング勉強会関西の紹介
男女共同ペアプログラミング勉強会関西の紹介男女共同ペアプログラミング勉強会関西の紹介
男女共同ペアプログラミング勉強会関西の紹介takepu
 
Getting Started with Amazon Kinesis
Getting Started with Amazon KinesisGetting Started with Amazon Kinesis
Getting Started with Amazon KinesisAmazon Web Services
 
なんたらアジャイルその前に
なんたらアジャイルその前になんたらアジャイルその前に
なんたらアジャイルその前にTakaesu Makoto
 
Building a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSBuilding a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSSmartNews, Inc.
 
[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例
[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例
[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例Amazon Web Services Japan
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon RedshiftAmazon Web Services
 
Amazon Redshiftによるリアルタイム分析サービスの構築
Amazon Redshiftによるリアルタイム分析サービスの構築Amazon Redshiftによるリアルタイム分析サービスの構築
Amazon Redshiftによるリアルタイム分析サービスの構築Minero Aoki
 
AWS Black Belt Online Seminar 2017 Amazon Connect
AWS Black Belt Online Seminar 2017 Amazon ConnectAWS Black Belt Online Seminar 2017 Amazon Connect
AWS Black Belt Online Seminar 2017 Amazon ConnectAmazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 AWS Shield
AWS Black Belt Online Seminar 2017 AWS ShieldAWS Black Belt Online Seminar 2017 AWS Shield
AWS Black Belt Online Seminar 2017 AWS ShieldAmazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめ
AWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめAWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめ
AWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめAmazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 Deployment on AWS
AWS Black Belt Online Seminar 2017 Deployment on AWSAWS Black Belt Online Seminar 2017 Deployment on AWS
AWS Black Belt Online Seminar 2017 Deployment on AWSAmazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 Amazon Aurora
AWS Black Belt Online Seminar 2017 Amazon AuroraAWS Black Belt Online Seminar 2017 Amazon Aurora
AWS Black Belt Online Seminar 2017 Amazon AuroraAmazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 AWS X-Ray
AWS Black Belt Online Seminar 2017 AWS X-RayAWS Black Belt Online Seminar 2017 AWS X-Ray
AWS Black Belt Online Seminar 2017 AWS X-RayAmazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 Amazon GameLift
AWS Black Belt Online Seminar 2017 Amazon GameLiftAWS Black Belt Online Seminar 2017 Amazon GameLift
AWS Black Belt Online Seminar 2017 Amazon GameLiftAmazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 Amazon DynamoDB
AWS Black Belt Online Seminar 2017 Amazon DynamoDB AWS Black Belt Online Seminar 2017 Amazon DynamoDB
AWS Black Belt Online Seminar 2017 Amazon DynamoDB Amazon Web Services Japan
 

En vedette (20)

男女共同ペアプログラミング勉強会関西の紹介
男女共同ペアプログラミング勉強会関西の紹介男女共同ペアプログラミング勉強会関西の紹介
男女共同ペアプログラミング勉強会関西の紹介
 
Getting Started with Amazon Kinesis
Getting Started with Amazon KinesisGetting Started with Amazon Kinesis
Getting Started with Amazon Kinesis
 
リスク駆動開発
リスク駆動開発リスク駆動開発
リスク駆動開発
 
なんたらアジャイルその前に
なんたらアジャイルその前になんたらアジャイルその前に
なんたらアジャイルその前に
 
Building a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSBuilding a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWS
 
[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例
[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例
[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例
 
20170725 black belt_monitoring_on_aws
20170725 black belt_monitoring_on_aws20170725 black belt_monitoring_on_aws
20170725 black belt_monitoring_on_aws
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
 
20170726 black belt_stepfunctions
20170726 black belt_stepfunctions20170726 black belt_stepfunctions
20170726 black belt_stepfunctions
 
Amazon Redshiftによるリアルタイム分析サービスの構築
Amazon Redshiftによるリアルタイム分析サービスの構築Amazon Redshiftによるリアルタイム分析サービスの構築
Amazon Redshiftによるリアルタイム分析サービスの構築
 
AWS Black Belt Online Seminar 2017 Amazon Connect
AWS Black Belt Online Seminar 2017 Amazon ConnectAWS Black Belt Online Seminar 2017 Amazon Connect
AWS Black Belt Online Seminar 2017 Amazon Connect
 
AWS Black Belt Online Seminar 2017 AWS Shield
AWS Black Belt Online Seminar 2017 AWS ShieldAWS Black Belt Online Seminar 2017 AWS Shield
AWS Black Belt Online Seminar 2017 AWS Shield
 
20170621 aws-black belt-ads-sms
20170621 aws-black belt-ads-sms20170621 aws-black belt-ads-sms
20170621 aws-black belt-ads-sms
 
AWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめ
AWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめAWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめ
AWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめ
 
AWS Black Belt Online Seminar 2017 Deployment on AWS
AWS Black Belt Online Seminar 2017 Deployment on AWSAWS Black Belt Online Seminar 2017 Deployment on AWS
AWS Black Belt Online Seminar 2017 Deployment on AWS
 
AWS Black Belt online seminar 2017 Snowball
AWS Black Belt online seminar 2017 SnowballAWS Black Belt online seminar 2017 Snowball
AWS Black Belt online seminar 2017 Snowball
 
AWS Black Belt Online Seminar 2017 Amazon Aurora
AWS Black Belt Online Seminar 2017 Amazon AuroraAWS Black Belt Online Seminar 2017 Amazon Aurora
AWS Black Belt Online Seminar 2017 Amazon Aurora
 
AWS Black Belt Online Seminar 2017 AWS X-Ray
AWS Black Belt Online Seminar 2017 AWS X-RayAWS Black Belt Online Seminar 2017 AWS X-Ray
AWS Black Belt Online Seminar 2017 AWS X-Ray
 
AWS Black Belt Online Seminar 2017 Amazon GameLift
AWS Black Belt Online Seminar 2017 Amazon GameLiftAWS Black Belt Online Seminar 2017 Amazon GameLift
AWS Black Belt Online Seminar 2017 Amazon GameLift
 
AWS Black Belt Online Seminar 2017 Amazon DynamoDB
AWS Black Belt Online Seminar 2017 Amazon DynamoDB AWS Black Belt Online Seminar 2017 Amazon DynamoDB
AWS Black Belt Online Seminar 2017 Amazon DynamoDB
 

Similaire à (BDT305) Amazon EMR Deep Dive and Best Practices

Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceAmazon Web Services
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceAmazon Web Services
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big DataAmazon Web Services
 
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAmazon Web Services
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts Julien SIMON
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivAmazon Web Services
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSAmazon Web Services
 
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Web Services
 
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...Amazon Web Services
 
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Amazon Web Services
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSAmazon Web Services
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesVladimir Simek
 
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석Amazon Web Services Korea
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
 
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...Amazon Web Services
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRAmazon Web Services
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
 
AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...
AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...
AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...Amazon Web Services
 

Similaire à (BDT305) Amazon EMR Deep Dive and Best Practices (20)

Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
 
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel Aviv
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWS
 
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
 
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
 
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
 
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...
AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...
AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...
 

Plus de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Plus de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Dernier

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 

Dernier (20)

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 

(BDT305) Amazon EMR Deep Dive and Best Practices

  • 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Rahul Pathak, AWS Scott Donaldson, FINRA Clayton Kovar, FINRA October 2015 Amazon EMR Deep Dive & Best Practices BDT305
  • 2. What to expect from the session • Update on the latest Amazon EMR release • Information on advanced capabilities of Amazon EMR • Tips for lowering your Amazon EMR costs • Deep dive into how FINRA uses Amazon EMR and Amazon S3 as their multi-petabyte data warehouse
  • 3. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Rahul Pathak, Sr. Mgr. Amazon EMR (@rahulpathak) October 2015 Amazon EMR Deep Dive & Best Practices
  • 4. Amazon EMR • Managed clusters for Hadoop, Spark, Presto, or any other applications in the Apache/Hadoop stack • Integrated with the AWS platform via EMRFS – connectors for Amazon S3, Amazon DynamoDB, Amazon Kinesis, Amazon Redshift, and AWS KMS • Secure with support for AWS IAM roles, KMS, S3 client-side encryption, Hadoop transparent encryption, Amazon VPC, and HIPAA-eligible • Built in support for resizing clusters and integrated with the Amazon EC2 spot market to help lower costs
  • 5. New Features EMR Release 4.1 • Hadoop KMS with transparent HDFS encryption support • Spark 1.5, Zeppelin 0.6 • Presto 0.119, Airpal • Hive, Oozie, Hue 3.7.1 • Simple APIs for launch and configuration Intelligent Resize • Incrementally scale up based on available capacity • Wait for work to complete before resizing down • Can scale core nodes and HDFS as well as task nodes
  • 6. Leverage Amazon S3 with EMR File System (EMRFS)
  • 7. Amazon S3 as your persistent data store • Separate compute and storage • Resize and shut down Amazon EMR clusters with no data loss • Point multiple Amazon EMR clusters at the same data in Amazon S3 • Easily evolve your analytic infrastructure as technology evolves EMR EMR Amazon S3
  • 8. EMRFS makes it easier to use Amazon S3 • Read-after-write consistency • Very fast list operations • Error handling options • Support for Amazon S3 encryption • Transparent to applications: s3:// Amazon S3
  • 9. Going from HDFS to Amazon S3 CREATE EXTERNAL TABLE serde_regex( host STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' ) LOCATION ‘samples/pig-apache/input/'
  • 10. Going from HDFS to Amazon S3 CREATE EXTERNAL TABLE serde_regex( host STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' ) LOCATION 's3://elasticmapreduce.samples/pig- apache/input/'
  • 11. Amazon S3 EMRFS metadata in Amazon DynamoDB List and read-after-write consistency Faster list operations Consistent view and fast listing using the optional EMRFS metadata layer *Tested using a single node cluster with a m3.xlarge instance. Number of objects Without consistent view With consistent view 1,000,000 147.72 29.70 100,000 12.70 3.69
  • 12. EMRFS client-side encryption Amazon S3 AmazonS3encryptionclients EMRFSenabledfor AmazonS3client-sideencryption Key vendor (AWS KMS or your custom key vendor) (client-side encrypted objects)
  • 13. HDFS is still there if you need it • Iterative workloads • If you’re processing the same dataset more than once • Consider using Spark & RDDs for this too • Disk I/O intensive workloads • Persist data on Amazon S3 and use S3DistCp to copy to/from HDFS for processing
  • 15. File formats Row oriented • Text files • Sequence files • Writable object • Avro data files • Described by schema Columnar format • Object Record Columnar (ORC) • Parquet Logical table Row oriented Column oriented
  • 16. Factors to consider Processing and query tools • Hive, Impala and Presto Evolution of schema • Avro for schema and Presto for storage File format “splittability” • Avoid JSON/XML Files. Use them as records Encryption requirements
  • 17. File sizes Avoid small files • Anything smaller than 100MB Each mapper is a single JVM • CPU time is required to spawn JVMs/mappers Fewer files, matching closely to block size • fewer calls to S3 • fewer network/HDFS requests
  • 18. Dealing with small files Reduce HDFS block size, e.g. 1MB (default is 128MB) • --bootstrap-action s3://elasticmapreduce/bootstrap- actions/configure-hadoop --args “-m,dfs.block.size=1048576” Better: Use S3DistCp to combine smaller files together • S3DistCp takes a pattern and target path to combine smaller input files to larger ones • Supply a target size and compression codec
  • 19. Compression Always compress data files On Amazon S3 • Reduces network traffic between Amazon S3 and Amazon EMR • Speeds Up Your Job Compress mappers and reducer output Amazon EMR compresses inter-node traffic with LZO with Hadoop 1, and Snappy with Hadoop 2
  • 20. Choosing the right compression • Time sensitive, faster compressions are a better choice • Large amount of data, use space efficient compressions • Combined Workload, use gzip Algorithm Splittable? Compression ratio Compress + decompress speed Gzip (DEFLATE) No High Medium bzip2 Yes Very high Slow LZO Yes Low Fast Snappy No Low Very fast
  • 21. Cost saving tips for Amazon EMR Use S3 as your persistent data store – query it using Presto, Hive, Spark, etc. Only pay for compute when you need it Use Amazon EC2 Spot instances to save >80% Use Amazon EC2 Reserved instances for steady workloads
  • 22. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Scott Donaldson, Senior Director Clayton Kovar, Principal Architect EMR & Interactive Analytics
  • 23.
  • 24. EMR is Ubiquitous in our architecture Data Marts (Amazon Redshift) Query Cluster (EMR) Query Cluster (EMR) Auto Scaled EC2 Analytics App Normalization ETL Clusters (EMR) Batch Analytic Clusters (EMR) Adhoc Query Cluster (EMR) Auto Scaled EC2 Analytics App Users Data Providers Auto Scaled EC2 Data Ingestion Services Optimization ETL Clusters (EMR) Shared Metastore (RDS) Query Optimized (S3) Auto Scaled EC2 Data Catalog & Lineage Services Reference Data (RDS) Shared Data Services Auto Scaled EC2 Cluster Mgt & Workflow Services Source of Truth (S3)
  • 25. It starts with the data S3 is your durable system of record Separate your compute and storage Shutdown your cluster when not in use Share data among multiple clusters Fault tolerance and disaster recovery Use EMRFS for consistent view Partition your data for performance Optimize for your query use cases and access patterns Larger files >256MB are more efficient Compact small files into >100MB FINRA Data Manager orchestrates data between storage and compute clusters Unified catalog Manage EMR clusters Track usage & lineage Job orchestration
  • 26. File formats & compression Text format for archival copies on S3 & Amazon Glacier Select compression algorithm for best fit We wanted high compression for archive copy Select a row or columnar format for performance Sequence or AVRO ORC, Parquet, RC File Columnar Benefits: Predicate pushdown Skip unwanted columns Serve multiple query engines: Hive, Presto, Spark Avoid bloated formats with repetitive markup (e.g. XML)
  • 27. Our partition and query strategy Data received as: Users query by: Symbol Group 1 Symbol Group 2 Symbol Group 3 … Symbol Group 100 Symbol & Firm Query Late Data All late records scanned for all queries On Time Data (Processing Date = Event Date) 99.97% of all records are on time Symbol Only Query FirmOnlyQuery
  • 28. Example hive table creation create external table if not exists NEW_ORDERS (…) partitioned by (EVENT_DT DATE, HASH_PRTN_NB SMALLINT) stored as orc location 's3://reinvent/new_orders/' tblproperties ("orc.compress"="SNAPPY"); alter table NEW_ORDERS add if not exists partition (event_dt='2015-10-08', hash_prtn_nb=0) location 's3://reinvent/new_orders/event_dt=2015-10-08/hash_prtn_nb=0/’ … partition (event_dt='2015-10-08', hash_prtn_nb=1000) location 's3://reinvent/new_orders/event_dt=2015-10-08/hash_prtn_nb=1000/’ ; Each record’s hash partition number is calculated by ((pmod(hash(symbol), 100) * 10) + pmod(firm, 10))
  • 29. Made hive on EMR/S3 competitive
  • 30. Partitions are great, but beware… select … from NEW_ORDERS where EVENT_DT between '2015-10-06' and '2015-10-09’ and FIRM = 12345 and (pmod(HASH_PRTN_NB, 10) = pmod(12345, 10) or HASH_PRTN_NB = 1000) -- 1000 is always read Using PMOD around the hash_prtn_nb prevents Hive from using a targeted query on the metastore resulting in millions of partitions returned for pruning
  • 31. Optimized query with enumeration select … from NEW_ORDERS where EVENT_DT >= '2015-10-06' and EVENT_DT <= '2015-10-09’ and FIRM = 12345 and (HASH_PRTN_NB = 5 or HASH_PRTN_NB = 15 … or HASH_PRTN_NB = 985 or HASH_PRTN_NB = 995 or HASH_PRTN_NB = 1000) -- 1000 is always read Using an IN clause was insufficient to avoid the pruning issue Explicitly enumerating all partitions vastly improved query planning time
  • 32. Data security Required to have encryption of all data both at-rest, and in-transit S3 server-side encryption was evaluated and determined to be suitable for purpose Encrypt ephemeral storage on Master, Core, and Task nodes Use a custom bootstrap action with LUKS with a random, memory only key Task nodes don’t have HDFS but Mapper and Reducer temporary files need to also be encrypted Lose the server, lose the data – Remember S3 is our source of truth Use security groups to ensure only the client applications connect to the Master node Hive authentication/authorization was not necessary for our usage scenarios Evaluating transparent encryption (Hadoop 2.6+) in HDFS
  • 33. Selection of the fittest HDFS was cost prohibitive for our use cases Need 30 D2.8XL’s just to store two of our tables: ~$1.5M/yr on HDFS vs ~$120K/yr on S3 Need 90 D2.8XL’s to store all queryable data: ~$4.5M/yr on HDFS vs. $360K/yr on S3 Data locality is desirable but not practical for our scale EMR & S3 with partitioned data is a great fit Tuned queries & data structures on S3 take ~2X if on HDFS under perfect locality conditions Localize data into HDFS on Core nodes using S3DistCp if making 3 or more passes Consider tiered storage External tables in Hive can have a blend of some partitions in HDFS and others in S3 Introduces operational complexity for partition maintenance Doesn’t play well with shared metastore for multiple clusters
  • 34. Darwin rules: Adaptation Take advantage of new instance types Find the right instance type(s) for your workload Prefer a smaller cluster of larger nodes: e.g. 4XL With millions of partitions, more memory is needed for the Master node (HS2) Use CLI based scripts rather than console → Infrastructure is code Node Type Before After Master 1 - R3.4XL 1 - R3.2XL Core 40 - M3.2XL 10 - C3.4XL Task (peak) 100 - M3.2XL 35 - C3.4XL
  • 36. Right size your cluster Transient use cases: ETL and batch analytics Size cluster to complete within ten minutes of an hour boundary to optimize $$ Use Spot when you have flexible SLA to save $$ Use On Demand or Reserved to meet SLA at predictable cost Always On use case: Interactive analytics Size Core based on HDFS needs (statistics, logging, etc) Reserve Master and Core nodes Resize # of Task nodes as demand changes Use Spot on Task nodes to save $$ Keep a ratio of Core to Task of 1:5 to avoid bottlenecks Consider bidding Spot above the On Demand price to ensure greater stability
  • 37. One metastore to rule them all Consider creating a shared hive metastore service Fault tolerance & DR with Multi-AZ RDS Offload metastore hydration of tables and partitions Transient clusters initialize faster Millions of partitions/day can take >7 min/day per table Avoid duplicative effort by separate development teams Separate metastores are needed for Hive 0.13.1, Hive 1.0 and Presto However, you can locate them all on a single RDS instance Utilize FINRA Data Management services to orchestrate metastore updates Register new tables and partitions as the data arrives via notifications
  • 38. Monitor, learn, and optimize Utilize workload management: Fair Scheduler Refactor your code as necessary to remove bottlenecks Optimize transient clusters, size to execute workload 10 minutes from an hour boundary Set hive.mapred.reduce.tasks.speculative.execution = FALSE when writing to external tables in S3 via Map Reduce Use broadcast joins when joining small tables (SET hive.auto.convert.join=true). EMR Step API works fore simple job queuing; use Oozie for more complex jobs
  • 39. The impact Removed obstacles “Before data analysis of this magnitude required intervention from the technology team.” Lowered the cost of curiosity “Analysts are able to quickly obtain a full picture of what happens to an order over time, helping to inform decision making as to whether a rule violation has occurred.” Elasticity allows us to process years of data in days as opposed to months and save money by using Spot market Separately optimize batch and interactive workloads without compromise Increased teams delivery velocity
  • 40. Recap Use Amazon S3 as your durable system of record Use transient clusters as much as possible Resize clusters and use the Spot to more efficiently manage capacity, performance, & cost Move to new instance families to take advantage of performance Monitor to determine when to resize or change instance types Share a persistent Hive metastore in RDS among multiple EMR clusters Be prepared to switch your query engine or execution framework in the future Budget time to experiment for new tools & engines at scale that weren’t possible before
  • 41. Related sessions BDT208 - A Technical Introduction to Amazon Elastic MapReduce Thursday, Oct 8, 12:15 PM - 1:15 PM– Titian 2201B BDT303 - Running Spark & Presto on the Netflix Big Data Platform Thursday, Oct 8, 11:00 AM - 12:00 PM– Palazzo F BDT309 - Best Practices for Apache Spark on Amazon EMR Thursday, Oct 8, 5:30 PM - 6:30 PM– Palazzo F BDT314 - Big Data/Analytics on Amazon EMR & Amazon Redshift Thursday, Oct 8, 1:30 PM - 2:30 PM– Palazzo F