Design patterns and best practices for data analytics with amazon emr (ABD305)

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS re:INVENT
Design Patterns and Best Practices
for Data Analytics with Amazon EMR
J o n a t h a n F r i t z , P r i n c i p a l P r o d u c t M a n a g e r – A m a z o n E M R
A n y a B i d a , S e n i o r M e m b e r o f T e c h n i c a l S t a f f - S a l e s f o r c e
A B D 3 0 5

Overview
- Intro and architectures
- Using Amazon EC2 Spot and Auto Scaling
- Security overview
- Ad-hoc and advanced workflows
- Apache Spark and Amazon EMR at Salesforce

Easy to use
Launch a cluster in minutes
What is Amazon EMR?
Low cost
Pay per-second
Open-source variety
Latest versions of software
Managed
Spend less time monitoring
Secure
Easy to enable options
Flexible
Full customization and control

Open-source applications
Storage
S3 (EMRFS), HDFS
YARN
Cluster Resource Management
Batch
MapReduce
Interactive
Tez
In Memory
Spark
Applications
Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop
HBase/Phoenix
Presto
Streaming
Flink
Zeppelin - Notebooks
MXNet
Hue - SQL
Ganglia - Monitoring
Livy – Job Server
Connectorsto
AWSservices
Amazon EMR
service

Use the AWS Glue Data Catalog
• Support for Spark, Hive
and Presto
• Auto-generate schema
and partitions
• Managed table updates

HBase for random access at massive scale

Real-time and batch processing

New – Deep learning with GPU instances
Use P3 instances
with GPUs

Lower your costs
T i p s t o

Transient or long-running clusters
• Amazon Linux AMI with preinstalled customizations for faster cluster creation
• Auto Scaling to minimize costs for long-running clusters

EC2 Spot and instance fleets
• EMR will select optimal EC2 AZ
• Provision across instance types
• Switch to on-demand

Use Auto Scaling
Scaling options
Threshold
CloudWatch or custom metric

Auto Scaling
• EMR scales-in at YARN task completion
• Selectively removes nodes with no running tasks
• yarn.resourcemanager.decommissioning.timeout
• Default timeout is one hour
• Spark scale-in contributions
• Spark specific blacklisting of tasks
• Unregistering cached data and shuffle blocks
• Advanced error handling

Secure your cluster
T i p s t o

Encryption
• Spark
• Tez
• MapReduce
• Presto
• HBase
• Hive
• Pig

Authentication LDAP
HiveServer2
Presto Coordinator
Spark Thrift Server
Hue Server
Zeppelin Server
AWS credentials
EMR Step (EMR API)
EC2 key pair
SSH as “hadoop”

New – Authentication with Kerberos
Microsoft
Active Directory
KDC
Users
YARN RMDoAs
Service principals for
all cluster nodes
Master Node

Authorization
• Storage-based
• EMRFS/S3
• HDFS
• HiveServer2 and Presto (SQL-based)
• HBase
• YARN queues
• Fine-grained access control by cluster tag (IAM)
• Apache Ranger on edge node (using CloudFormation)

New – EMRFS fine-grained authorization
Context
User: aduser
Group: analyst
IAM role: analytics_prod
Can map IAM roles to user, group, or S3 prefix
Context
User: aduser2
Group: dev
IAM role: analytics_dev

Security configuration demo

Submit workflows
T i p s t o

Use Livy as an ad-hoc Spark job server
Custom
Application

Oozie and Airflow for DAGs of jobs

Select Amazon EMR customers

Apache Spark and Amazon EMR at Salesforce
abida@salesforce.com @anyabida1
Anya Bida, Senior Member of Technical Staff at Salesforce

Fastest Growing Top 5
Enterprise Software Company
$5.4BFY15
$4.1BFY14
$3.1BFY13
$6.7BFY16
$2.3BFY12
$1.7BFY11
$2.56BFY18Q2 revenue
$8.4BFY17 revenue
2009 • 2010 • 2011
2012 • 2013 • 2014
2015 • 2016 • 2017
September
2016
2011 • 2012 • 2013
2014 • 2015 • 2016 • 2017
The world’s most
innovative companies
“Innovator of
the Decade”

Overview
Our Goal
Getting started with EMR
Spark primer
Monitor multiple viewpoints
Use AWS Identity and Access Management (IAM) roles
Isolate Environments

Complete ML Pipeline
ETL
Feature Engineering
Model Training
Model Evaluation
Deploy & Operationalize Models
Score & Update Models
Support Batch & Real Time

Selecting a tool
ETL
Feature Engineering
Model Training
Model Evaluation
Deploy & Operationalize Models
Score & Update Models
Support Batch & Real Time
Supports each step of our ML pipeline
Scales for small & large jobs
Good ML Libraries
Active user base
Ability to deploy production ready code

We wanted Spark…now how to deploy it?
EC2
• Support for batch / streaming
• Integrates with our tooling
• Spin up / down clusters
• Larger / smaller clusters
• Support for different versions of Hadoop, Spark
• Storage & compute options

EC2
• Storage & Compute options
Need: Management

Need: Management
EC2
• Storage & Compute options

Provision EMR
Simplest approach
Input bucket
Monitoring
aws emr create-cluster
--applications Name=Hadoop
Name=Spark Name=Ganglia
--tags
--ec2-attributes
--release-label
--log-uri
…
--name
--instance-groups
--region

Provision EMR
Simplest approach
Input bucket
Monitoring
--applications Name=Hadoop
Name=Spark Name=Ganglia
# tag for cost analysis by project
--tags
--ec2-attributes
--release-label
# send logs to S3
--log-uri
…
# use naming conventions for service
discovery
# <region>-<project>-<version>-<env>
--name
# CORE nodes used for writes to HDFS
# TASK nodes used for compute - try spot
instances here for starters
--instance-groups
--region

Simplest approach
Input bucket
EMR cluster
Output bucket
Logs
bucket

Spark Primer
Apache Spark
Driver Program
SparkContext
Node
Executor Cache
TaskTask
Cluster Manager
Node
Executor Cache
TaskTask

Spark on Amazon EMR
Apache Spark
Master Node Core Node
Cluster Manager
ResourceManager
NameNode
NodeManager
Datanode
Task Node
NodeManagerNodeManager

Spark on Amazon EMR
Apache Spark
Master Node Core Node
Executor Cache
TaskTask
Cluster Manager
ResourceManager
NameNode
Task Node
Executor Cache
TaskTask
Driver Program
SparkContext
Executor Cache
TaskTask
NodeManager
Datanode
NodeManagerNodeManager

Properties related to Dynamic Allocation
Property Value
Spark.dynamicAllocation.enabled true
Spark.shuffle.service.enabled true
spark.dynamicAllocation.minExecutors 5
spark.dynamicAllocation.maxExecutors 17
spark.dynamicAllocation.initalExecutors 0
sparkdynamicAllocation.executorIdleTime 60s
spark.dynamicAllocation.schedulerBacklogTimeout 5s
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout 5s
Optional

Where do I find Metrics?
Ganglia
windowing, dashboarding
Cloudwatch
YARNMemoryAvailablePercentage
Logs?

Anatomy of a Spark Job
High Performance Spark, Karau & Warren, O’Reilly
Spark Context/Spark Session Object
Actions (e.g., collect, saveAsTextFile)
Wide transformations; (sort, groupByKey)
Computation to evaluate one partition
(combine narrow transforms)
Spark
Application
Job
Stage Stage
Task Task

Simplest approach
Input bucket
EMR cluster
Output bucket
Reads from S3
• Jar files too!
Write Intermediate files
• MEM or Disk?
• Local? HDFS? Amazon S3?
Writes to S3
• Data available after cluster is
terminated

Cache Persist Checkpoint Local Checkpoint
local mem cache MEM MEM MEM
local disk DISK DISK
HDFS / S3 Specify dir
If exec is decommed, are
writes available?
No No Yes No
If job finishes are writes
available?
No No Yes No
Preserve lineage graph? Yes Yes No No
RDD Re-use

Cache Persist Checkpoint Local Checkpoint
local mem cache MEM MEM MEM
local disk DISK DISK
HDFS / S3 Specify dir
If exec is decommed, are
writes available?
No No Yes No
If job finishes are writes
available?
No No Yes No
Preserve lineage graph? Yes Yes No No
RDD Re-use
Persist to improve speed, Checkpoint to improve fault tolerance

Overview
Our Goal
Getting started with EMR
Spark primer
Use IAM Roles
Isolate Environments

https://light.co/camera

Understand resource allocation
Understanding Memory Management in Spark For Fun And Profit Shivnath Babu (Duke University, Unravel Data Systems)
Mayuresh Kunjir (Duke University)
Node Memory
Container Memory
8Gb
Node Memory
Container
Memory
8Gb

Can my 8Gb container launch on this cluster?
Node
Memory
Node
Memory
Node
Memory
4Gb
used
8Gb
total
8Gb

Can my 8Gb container launch on this cluster?
Scale-out Rule: Num Containers Pending
Node
Memory
Node
Memory
Node
Memory
4Gb
used
8Gb
total
8Gb

Connectivity viewer
https://www.linkedin.com/in/vaibhavt/
Vaibhav Tandon

Use IAM roles
Cluster Manager
Scheduler
IAM
IAM Roles
• User has an IAM Role

Use IAM roles
Cluster Manager
Scheduler
IAM
IAM Roles
• Job has an IAM Role
IAM

Use IAM roles
Cluster Manager
Scheduler
IAM
IAM Roles
• IAM Roles determine read /
write access to data
IAM
Out
Logs
IAM
In

Use IAM roles
Every user, service, & job should have specific, auditable permissions.
New: EMRFS fine-grained access control!!
Cluster Manager
Scheduler
IAM
IAM Roles
• IAM Roles determine read /
write access to data
IAM
Out
Logs
IAM
…
--service-role
In

Isolate environments
Need: Build and release? Multitenancy?
Cluster Manager
Scheduler
Out
Logs
In

Cluster Manager
Scheduler
virtual private cloud Out
Logs
In

Cluster Manager
Scheduler
VPC subnet
virtual private cloud
VPC subnet
Out
Logs
In

Cluster Manager
Scheduler
VPC subnet
security group
security group
security group
VPC subnet
Out
Logs
In

Cluster Manager
Scheduler
VPC subnet
security group
security group
security group
VPC subnet
Out
Logs
In
…
--ec2-attributes '{
"KeyName":"",
"InstanceProfile":"",
"ServiceAccessSecurityGroup":"",
"SubnetId":"”,
"EmrManagedSlaveSecurityGroup":"",
"EmrManagedMasterSecurityGroup":""}'}'

Cluster Manager
Scheduler
In
Out
Logs
VPC subnet
security group
IAM
security group
security group
VPC subnet
IAM
Environment

VPC subnet
security group
security group
security group
VPC subnet
IAM
Environment
Dev Staging
Canary Prod

VPC subnet
security group
security group
security group
VPC subnet
IAM
Environment
Dev Staging
Canary Prod
Automation
• Use Cloudformation or
Terraform
• Upgrades use the same
provisioning script + DNS
Upsert

VPC subnet
security group
security group
security group
VPC subnet
IAM
Environment
Availability
Zone
region
Dev Staging
Canary Prod

VPC subnet
security group
security group
security group
VPC subnet
IAM
Environment
Availability
Zone
region
Dev Staging
Canary Prod
Availability
Zone
region
Dev Staging
Canary Prod
Availability
Zone
Availability
Zone

Did we just automate ourselves
out of our jobs?
Nope. Now we have time to take on new projects and grow…

THANK YOU!
J o n a t h a n F r i t z – j o n f r i t z @ a m a z o n . c o m
A n y a B i d a – a b i d a @ s a l e s f o r c e . c o m
a w s . a m a z o n . c o m / e m r
a w s . a m a z o n . c o m / b l o g s / b i g - d a t a
a w s . a m a z o n . c o m / b l o g s / a i

Extra slides
`spark.history.fs.cleaner.enabled=true`

Design patterns and best practices for data analytics with amazon emr (ABD305)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Design patterns and best practices for data analytics with amazon emr (ABD305)

Similaire à Design patterns and best practices for data analytics with amazon emr (ABD305) (20)

Plus de Amazon Web Services

Plus de Amazon Web Services (20)

Dernier

Dernier (20)

Design patterns and best practices for data analytics with amazon emr (ABD305)