Amazon EMR is one of the largest Hadoop operators in the world. In this session, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters, and other Amazon EMR architectural best practices. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We will also share best practices to keep your Amazon EMR cluster cost-efficient. Finally, we dive into some of our recent launches to keep you current on our latest features.
2. What to expect from the session
• Update on the latest Amazon EMR release
• Information on advanced capabilities of Amazon EMR
• Tips for lowering your Amazon EMR costs
• Deep dive into how FINRA uses Amazon EMR and
Amazon S3 as their multi-petabyte data warehouse
4. Amazon EMR
• Managed clusters for Hadoop, Spark, Presto, or any other applications in
the Apache/Hadoop stack
• Integrated with the AWS platform via EMRFS – connectors for Amazon S3,
Amazon DynamoDB, Amazon Kinesis, Amazon Redshift, and AWS KMS
• Secure with support for AWS IAM roles, KMS, S3 client-side encryption,
Hadoop transparent encryption, Amazon VPC, and HIPAA-eligible
• Built in support for resizing clusters and integrated with the Amazon EC2
spot market to help lower costs
5. New Features
EMR Release 4.1
• Hadoop KMS with transparent HDFS encryption support
• Spark 1.5, Zeppelin 0.6
• Presto 0.119, Airpal
• Hive, Oozie, Hue 3.7.1
• Simple APIs for launch and configuration
Intelligent Resize
• Incrementally scale up based on available capacity
• Wait for work to complete before resizing down
• Can scale core nodes and HDFS as well as task nodes
7. Amazon S3 as your persistent data store
• Separate compute and storage
• Resize and shut down Amazon EMR
clusters with no data loss
• Point multiple Amazon EMR clusters at
the same data in Amazon S3
• Easily evolve your analytic
infrastructure as technology evolves
EMR
EMR
Amazon
S3
8. EMRFS makes it easier to use Amazon S3
• Read-after-write consistency
• Very fast list operations
• Error handling options
• Support for Amazon S3 encryption
• Transparent to applications: s3://
Amazon
S3
9. Going from HDFS to Amazon S3
CREATE EXTERNAL TABLE serde_regex(
host STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
)
LOCATION ‘samples/pig-apache/input/'
10. Going from HDFS to Amazon S3
CREATE EXTERNAL TABLE serde_regex(
host STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
)
LOCATION 's3://elasticmapreduce.samples/pig-
apache/input/'
11. Amazon S3
EMRFS metadata
in Amazon DynamoDB
List and read-after-write consistency
Faster list operations
Consistent view and fast listing using
the optional EMRFS metadata layer
*Tested using a single node cluster with a m3.xlarge instance.
Number of
objects
Without
consistent view
With
consistent view
1,000,000 147.72 29.70
100,000 12.70 3.69
13. HDFS is still there if you need it
• Iterative workloads
• If you’re processing the same dataset more than once
• Consider using Spark & RDDs for this too
• Disk I/O intensive workloads
• Persist data on Amazon S3 and use S3DistCp to copy to/from
HDFS for processing
15. File formats
Row oriented
• Text files
• Sequence files
• Writable object
• Avro data files
• Described by schema
Columnar format
• Object Record Columnar (ORC)
• Parquet
Logical table
Row oriented
Column oriented
16. Factors to consider
Processing and query tools
• Hive, Impala and Presto
Evolution of schema
• Avro for schema and Presto for storage
File format “splittability”
• Avoid JSON/XML Files. Use them as records
Encryption requirements
17. File sizes
Avoid small files
• Anything smaller than 100MB
Each mapper is a single JVM
• CPU time is required to spawn JVMs/mappers
Fewer files, matching closely to block size
• fewer calls to S3
• fewer network/HDFS requests
18. Dealing with small files
Reduce HDFS block size, e.g. 1MB (default is 128MB)
• --bootstrap-action s3://elasticmapreduce/bootstrap-
actions/configure-hadoop --args “-m,dfs.block.size=1048576”
Better: Use S3DistCp to combine smaller files together
• S3DistCp takes a pattern and target path to combine smaller
input files to larger ones
• Supply a target size and compression codec
19. Compression
Always compress data files On Amazon S3
• Reduces network traffic between Amazon S3 and Amazon EMR
• Speeds Up Your Job
Compress mappers and reducer output
Amazon EMR compresses inter-node traffic with LZO with
Hadoop 1, and Snappy with Hadoop 2
20. Choosing the right compression
• Time sensitive, faster compressions are a better choice
• Large amount of data, use space efficient compressions
• Combined Workload, use gzip
Algorithm Splittable? Compression ratio
Compress +
decompress speed
Gzip (DEFLATE) No High Medium
bzip2 Yes Very high Slow
LZO Yes Low Fast
Snappy No Low Very fast
21. Cost saving tips for Amazon EMR
Use S3 as your persistent data store – query it using
Presto, Hive, Spark, etc.
Only pay for compute when you need it
Use Amazon EC2 Spot instances to save >80%
Use Amazon EC2 Reserved instances for steady
workloads
24. EMR is Ubiquitous in our architecture
Data Marts
(Amazon
Redshift)
Query Cluster
(EMR)
Query Cluster
(EMR)
Auto Scaled
EC2
Analytics
App
Normalization
ETL Clusters
(EMR)
Batch Analytic
Clusters
(EMR)
Adhoc Query
Cluster (EMR)
Auto Scaled
EC2
Analytics
App
Users Data
Providers
Auto Scaled
EC2
Data
Ingestion
Services
Optimization
ETL Clusters
(EMR)
Shared Metastore
(RDS)
Query Optimized
(S3)
Auto Scaled EC2
Data
Catalog
& Lineage
Services
Reference Data
(RDS)
Shared Data Services
Auto Scaled
EC2
Cluster Mgt
& Workflow
Services
Source of
Truth (S3)
25. It starts with the data
S3 is your durable system of record
Separate your compute and storage
Shutdown your cluster when not in use
Share data among multiple clusters
Fault tolerance and disaster recovery
Use EMRFS for consistent view
Partition your data for performance
Optimize for your query use cases and access patterns
Larger files >256MB are more efficient
Compact small files into >100MB
FINRA Data Manager orchestrates
data between storage and compute
clusters
Unified catalog
Manage EMR clusters
Track usage & lineage
Job orchestration
26. File formats & compression
Text format for archival copies on S3 & Amazon Glacier
Select compression algorithm for best fit
We wanted high compression for archive copy
Select a row or columnar format for performance
Sequence or AVRO
ORC, Parquet, RC File
Columnar Benefits:
Predicate pushdown
Skip unwanted columns
Serve multiple query engines: Hive, Presto, Spark
Avoid bloated formats with repetitive markup (e.g. XML)
27. Our partition and query strategy
Data
received
as:
Users
query
by:
Symbol Group 1
Symbol Group 2
Symbol Group 3
…
Symbol Group 100
Symbol & Firm Query
Late
Data
All late
records
scanned
for all
queries
On Time Data
(Processing Date = Event Date)
99.97% of all records are on time
Symbol Only Query
FirmOnlyQuery
28. Example hive table creation
create external table if not exists NEW_ORDERS (…)
partitioned by (EVENT_DT DATE, HASH_PRTN_NB SMALLINT)
stored as orc
location 's3://reinvent/new_orders/'
tblproperties ("orc.compress"="SNAPPY");
alter table NEW_ORDERS add if not exists
partition (event_dt='2015-10-08', hash_prtn_nb=0) location
's3://reinvent/new_orders/event_dt=2015-10-08/hash_prtn_nb=0/’
…
partition (event_dt='2015-10-08', hash_prtn_nb=1000) location
's3://reinvent/new_orders/event_dt=2015-10-08/hash_prtn_nb=1000/’
;
Each record’s hash partition number is calculated by
((pmod(hash(symbol), 100) * 10) + pmod(firm, 10))
30. Partitions are great, but beware…
select … from NEW_ORDERS where
EVENT_DT between '2015-10-06' and '2015-10-09’
and FIRM = 12345
and (pmod(HASH_PRTN_NB, 10) = pmod(12345, 10)
or HASH_PRTN_NB = 1000) -- 1000 is always read
Using PMOD around the hash_prtn_nb prevents Hive from using a targeted
query on the metastore resulting in millions of partitions returned for pruning
31. Optimized query with enumeration
select … from NEW_ORDERS where
EVENT_DT >= '2015-10-06' and EVENT_DT <= '2015-10-09’
and FIRM = 12345
and (HASH_PRTN_NB = 5 or HASH_PRTN_NB = 15
… or HASH_PRTN_NB = 985 or HASH_PRTN_NB = 995
or HASH_PRTN_NB = 1000) -- 1000 is always read
Using an IN clause was insufficient to avoid the pruning issue
Explicitly enumerating all partitions vastly improved query planning time
32. Data security
Required to have encryption of all data both at-rest, and in-transit
S3 server-side encryption was evaluated and determined to be suitable for purpose
Encrypt ephemeral storage on Master, Core, and Task nodes
Use a custom bootstrap action with LUKS with a random, memory only key
Task nodes don’t have HDFS but Mapper and Reducer temporary files need to also be encrypted
Lose the server, lose the data – Remember S3 is our source of truth
Use security groups to ensure only the client applications connect to the Master node
Hive authentication/authorization was not necessary for our usage scenarios
Evaluating transparent encryption (Hadoop 2.6+) in HDFS
33. Selection of the fittest
HDFS was cost prohibitive for our use cases
Need 30 D2.8XL’s just to store two of our tables: ~$1.5M/yr on HDFS vs ~$120K/yr on S3
Need 90 D2.8XL’s to store all queryable data: ~$4.5M/yr on HDFS vs. $360K/yr on S3
Data locality is desirable but not practical for our scale
EMR & S3 with partitioned data is a great fit
Tuned queries & data structures on S3 take ~2X if on HDFS under perfect locality conditions
Localize data into HDFS on Core nodes using S3DistCp if making 3 or more passes
Consider tiered storage
External tables in Hive can have a blend of some partitions in HDFS and others in S3
Introduces operational complexity for partition maintenance
Doesn’t play well with shared metastore for multiple clusters
34. Darwin rules: Adaptation
Take advantage of new instance types
Find the right instance type(s) for your workload
Prefer a smaller cluster of larger nodes: e.g. 4XL
With millions of partitions, more memory is needed for the Master node (HS2)
Use CLI based scripts rather than console → Infrastructure is code
Node Type Before After
Master 1 - R3.4XL 1 - R3.2XL
Core 40 - M3.2XL 10 - C3.4XL
Task (peak) 100 - M3.2XL 35 - C3.4XL
36. Right size your cluster
Transient use cases: ETL and batch analytics
Size cluster to complete within ten minutes of an hour boundary to optimize $$
Use Spot when you have flexible SLA to save $$
Use On Demand or Reserved to meet SLA at predictable cost
Always On use case: Interactive analytics
Size Core based on HDFS needs (statistics, logging, etc)
Reserve Master and Core nodes
Resize # of Task nodes as demand changes
Use Spot on Task nodes to save $$
Keep a ratio of Core to Task of 1:5 to avoid bottlenecks
Consider bidding Spot above the On Demand price to ensure greater stability
37. One metastore to rule them all
Consider creating a shared hive metastore service
Fault tolerance & DR with Multi-AZ RDS
Offload metastore hydration of tables and partitions
Transient clusters initialize faster
Millions of partitions/day can take >7 min/day per table
Avoid duplicative effort by separate development teams
Separate metastores are needed for Hive 0.13.1, Hive 1.0 and Presto
However, you can locate them all on a single RDS instance
Utilize FINRA Data Management services to orchestrate metastore updates
Register new tables and partitions as the data arrives via notifications
38. Monitor, learn, and optimize
Utilize workload management: Fair Scheduler
Refactor your code as necessary to remove bottlenecks
Optimize transient clusters, size to execute workload 10 minutes from an hour boundary
Set hive.mapred.reduce.tasks.speculative.execution = FALSE when writing to external tables
in S3 via Map Reduce
Use broadcast joins when joining small tables (SET hive.auto.convert.join=true).
EMR Step API works fore simple job queuing; use Oozie for more complex jobs
39. The impact
Removed obstacles
“Before data analysis of this magnitude required intervention from the technology team.”
Lowered the cost of curiosity
“Analysts are able to quickly obtain a full picture of what happens to an order over time,
helping to inform decision making as to whether a rule violation has occurred.”
Elasticity allows us to process years of data in days as opposed to months and save
money by using Spot market
Separately optimize batch and interactive workloads without compromise
Increased teams delivery velocity
40. Recap
Use Amazon S3 as your durable system of record
Use transient clusters as much as possible
Resize clusters and use the Spot to more efficiently manage capacity, performance, & cost
Move to new instance families to take advantage of performance
Monitor to determine when to resize or change instance types
Share a persistent Hive metastore in RDS among multiple EMR clusters
Be prepared to switch your query engine or execution framework in the future
Budget time to experiment for new tools & engines at scale that weren’t possible before
41. Related sessions
BDT208 - A Technical Introduction to Amazon Elastic MapReduce
Thursday, Oct 8, 12:15 PM - 1:15 PM– Titian 2201B
BDT303 - Running Spark & Presto on the Netflix Big Data Platform
Thursday, Oct 8, 11:00 AM - 12:00 PM– Palazzo F
BDT309 - Best Practices for Apache Spark on Amazon EMR
Thursday, Oct 8, 5:30 PM - 6:30 PM– Palazzo F
BDT314 - Big Data/Analytics on Amazon EMR & Amazon Redshift
Thursday, Oct 8, 1:30 PM - 2:30 PM– Palazzo F