Contenu connexe Similaire à Best practices for Running Spark jobs on Amazon EMR with Spot Instances | AWS Summit Tel Aviv 2019 (20) Plus de Amazon Web Services (20) Best practices for Running Spark jobs on Amazon EMR with Spot Instances | AWS Summit Tel Aviv 20191. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Best practices for Running Spark jobs on
Amazon EMR with Spot Instances
Ran Sheinberg
Specialist SA – EC2 Spot
Amazon Web Services
Eyal Lanxner
Chief Technology Officer
Feedvisor
D A T 3 0 3
Daniel Haviv
Specialist SA - Analytics
Amazon Web Services
Anatoli Atamanov
VP Operations & IT
Feedvisor
2. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
• Amazon EC2 Spot Instances
• Amazon EMR recap
• Spark best practices
• EMR Instance Fleets with Spot Instances
• Customer story - Feedvisor
3. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
4. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EC2 purchase options
5. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
$0.27 $0.29$0.50
1b 1c1a
8XL
$0.30 $0.16$0.214XL
$0.07 $0.08$0.082XL
$0.05 $0.04$0.04XL
$0.01 $0.04$0.01L
C4
$1.76
On
Demand
$0.88
$0.44
$0.22
$0.11
EC2 Spot pools - instance type flexibility
Each instance family
Each instance size
Each Availability Zone (61)
In every region (20)
Is a separate Spot pool
R5
M4
C5
I3 M5d
R4 D2
C4
R5d
6. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Running applications at extreme scale
single HPC cluster of 1 million vCPUs
7. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EC2 Spot integrations
Auto
Scaling
AWS
Batch
Amazon
EMR
AWS Data
Pipeline
Amazon Elastic
Container Service
AWS
CloudFormation
Amazon Elastic
Container Service
for Kubernetes
AWS Thinkbox
Deadline
8. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
No Bidding
Spot is easy
Minimal interruptions <5% Low, Predictable Prices
9. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
No Bidding
Spot is easy
10. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Pricing Model
New smooth pricing
November 2017
11. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Spot Instance Advisor
https://aws.amazon.com/ec2/spot/instance-advisor/
12. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Main takeaways for Spot Instances
• Build instance-type agnostic workloads
• No bidding, no price spikes
• New instance families generally have higher interruption rates – Spot Instance Advisor
• Architect for fault-tolerance to be Spot ready
13. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
14. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EMR: Enterprise-grade Hadoop & Spark
Scale to any size
• Scale compute (EMR) &
storage (S3) independently
• Store, and process any
amount of data—PB to EBs
• Provision one, hundreds,
or thousands of nodes
• Auto-scaling
Data Lake
on AWS
Amazon EMR
15. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Enterprise-grade Hadoop & Spark
Highly available and durable
• S3 is designed to deliver 99.999999999% durability
• EMR monitors your cluster—replacing poorly performing
& failed nodes, and restarting services
• Monitor your clusters using Amazon CloudWatch
• Built-in console to view job history & browse logs
• EMR has on-cluster HDFS for data persistence
16. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Enterprise-grade Hadoop & Spark
Highly secure
• Encryption of data at rest and in-transit
• ML-powered security with Amazon Macie
• Network isolation using Amazon VPC
• Access and permissions control with IAM policies
• Log, and audit activity with AWS CloudTrail
• Microsoft AD integration with Kerberos support
17. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EMR node types
Master node: The node that manages the
cluster. The master node tracks the status of
tasks and monitors the health of the cluster.
Core nodes: The node that runs tasks and
stores data in the Hadoop Distributed File
System (HDFS) on your cluster.
Task nodes: The node that only runs tasks and
does not store data in HDFS. Task nodes are
optional.
Master instance fleet
HDFS HDFS
Amazon EMR cluster
Task instance fleetCore instance fleet
18. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
# Parallelized
nodes
Time
# Parallelized
nodes
Time
Job running time: 1 hourJob running time: 10 hours
Parallelization
19. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
20. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
21. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Breaking the monolith
22. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
23. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Reducing shuffle
10x longer
24. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Reducing shuffle | Explode Group
user_id visits_array
2121123 ["28/01/2018, ”29/01/2018”, "01/01/2019”]
2323434 [ "01/11/2017”, "01/12/2017”]
9959594 [ "01/01/2017”, "02/01/2017”, "03/01/2017”,
"04/01/2017”, "05/01/2017”, "06/01/2017”]
25. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Reducing shuffle | Explode Group
# user_id visits_array
1 2121123 28/01/2018
2 2121123 29/01/2018
3 2121123 01/01/2019
4 2323434 01/11/2017
5 2323434 01/12/2017
6 9959594 01/01/2017
7 9959594 02/01/2017
8 9959594 03/01/2017
9 9959594 04/01/2017
10 9959594 05/01/2017
11 9959594 06/01/2017
26. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Reducing shuffle | Explode Group
val countVisitsUDF = (array: Seq[String]) => {
array.length
}
spark.udf.register("countVisits", countVisitsUDF )
spark.sql("""SELECT user_id, countVisits(arr)
FROM tab""").show
27. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Reducing shuffle | Explode Group
spark.sql("SELECT user_id,
sum(aggregate(arr, 0, (acc, x) -> acc +1)) summary
FROM tab
GROUP BY user_id").show
28. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Reducing shuffle | Explode Group
29. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
30. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Sizing Executors
31. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Sizing Executors Example
spark-submit --executor-cores 15 --executor-memory 90G
32. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Sizing Executors Example
spark-submit --executor-cores 15 --executor-memory 90G
Cores Memory (GB)
15 90
2 12
3 18
4 24
5 30
33. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
34. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
• Specify target capacity as a mix of instance types and families (up to 5)
• Amazon EMR will attempt to fulfill capacity from the most suitable pools
• Amazon EMR automatically replaces interrupted or failed instances with
one of the instance types that you specified
EMR Instance Fleets
35. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
EMR Instance Fleets: Choosing instances
36. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
• Spot Instances with a specified uninterrupted duration (1-6 hours)
• Ideal for jobs that take a known time to complete and must meet an SLA
• Lower discount
Instance Fleets: Spot Block
37. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
38. Running Spark Jobs on Amazon
EMR With Spot Instances
Eyal Lanxner
CTO and Co-Founder
38
Anatoli Atamanov
VP Operations & IT
40. Problem Complexity (Example):
Pricing Optimization
C o n s i d e r a t i o n s S p e e d
Cost structure
Marketplace fees
Product attributes & rating
Product goals & constraints
Competing listings & products
Competitive pricing
Marketplace ranking
Orders & sales
…
43. New EMR Architecture
Apache Airflow
Transient dedicated clusters:
MASTER x1 (On-Demand)
• m4.xlarge
CORE x10 (Spot Instances, EMR Instance Fleets)
• m4.4xlarge
• r4.4xlarge
• r3.4xlarge
S3 Datalake
Job 1 Job 2 Job N
46. Thank You!
Get in touch with us at info@feedvisor.com
46
We are hiring!
Apply on https://feedvisor.com/about/careers/tel-aviv/
eyal.lanxner@feedvisor.com
anatoli.atamanov@feedvisor.com
47. Thank you!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Ran Sheinberg
Specialist SA – EC2 Spot
Amazon Web Services
Eyal Lanxner
Chief Technology Officer
Feedvisor
Daniel Haviv
Specialist SA - Analytics
Amazon Web Services
Anatoli Atamanov
VP Operations & IT
Feedvisor
Please complete the survey
http://bit.ly/2SAOf2tBlog post
http://bit.ly/EMRSparkSpot