Apache Flink vs Apache Spark - Reproducible experiments on cloud.

•

5 j'aime•8,230 vues

http://shelan.org/blog/2016/01/31/reproducible-experiment-to-compare-apache-spark-and-apache-flink-batch-processing/ http://blog.ashansa.org/2016/02/stream-processing-is-becoming-crucial.html Batch Processing. https://github.com/karamel-lab/batch-processing-comparison Stream Processing. https://github.com/karamel-lab/stream-processing-comparison

Technologie

Reproducible distributed
experiments on cloud
vs
Shelan Perera
Ashansa Perera
Kamal Hakimzadeh

“
Reproducing experiments
with
minimal effort

Spark and Flink
▷ Batch Processing vs. Stream Processing
▷ Micro Batching vs. Natural Data Flow
▷ Good fit for scalable deployment in the
cloud

Motivation
▷ Validate Performance claims
▷ Take off deployment overhead
▷ Design reproducible experiments

Karamel =>
“Framework for
reproducible
distributed
experiments”

Benchmark - Batch
Teragen - To generate data
(Hadoop)
Terasort - Benchmarking Algorithm
(Spark, Flink)

To make
Dongwon Kim’s
comparison
reproducible.
http://www.slideshare.net/ssuser6bb12d/a-
comparative-performance-evaluation-of-apache-flink

1 Namenode ⇒ Master
(Low processing )
2 Worker nodes ⇒ Slaves
(High processing )
Our Deployment

EC2
Slave
EC2
Master
Deployment
Hadoop
Name
Node
Spark
Master
Flink Job
Manager
Spark
Worker
Flink Task
Manager
Hadoop
Data Node
Karamel
x 2
Karamel Config

Configuration
Master /
Namenode
2.6
4
16
80
Slave /
Worker
2.5
16
122
1600
CPU (GHz)
No of vCPUs
Memory (GB)
Storage :SSD (GB)
(m3.xlarge) (i2.4xlarge)

Experiment
Hadoop MR : Teragen
HDFS
Spark/Flink : Terasort
200/ 400/ 600 GB

▷ Spark : Does not
overlap stages
▷ Flink : Do pipelining
Mainly because...

Collectl- Monitor
● Tool used to collect and
draw results.
● https://github.
com/shelan/collectl-
monitoring

Outcome
▷ Performance Comparison Results
▷ Karamel experiments to
reproduce the same results with
minimal effort

EC2 claims 800 GB disks, But
Disk File system (DF) does shows
only 30GB.
If you are using I2 or R3 instances you
should create a file system and
partition disks manually.

Large Spark or Flink Batch applications
can fail with not enough disk space
Configure Flink temp directory and
Spark local directory to a partition with
at least enough space to store the total
input.

Reproducing experiments on EC2 may
cost you a lot
Spot instances which allow to reduce
the cost by 10x is also supported by
Karamel

IncompatibleClassChangeError when
running StreamBench built for MR2 on
hadoop2.x
No explicitly defined dependencies for
previous versions, but one of the
dependencies (mahout) had internal
references to hadoop1.x jar

Summary
▷ Introducing reproducible experiments on
cloud
▷ Performance Comparison of Spark and
Flink
▷ Reproducible experiments are available
online (https://github.com/karamel-lab)

Recommandé

A Comparative Performance Evaluation of Apache FlinkDongwon Kim

Low Latency Execution For Apache SparkJen Aman

Spark Summit EU talk by Luca CanaliSpark Summit

Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Spark Summit

Managing Apache Spark Workload and Automatic OptimizingDatabricks

Functional Comparison and Performance Evaluation of Streaming FrameworksHuafeng Wang

Tracing the Breadcrumbs: Apache Spark Workload DiagnosticsDatabricks

Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungSpark Summit

Recommandé

A Comparative Performance Evaluation of Apache FlinkDongwon Kim

Low Latency Execution For Apache SparkJen Aman

Spark Summit EU talk by Luca CanaliSpark Summit

Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Spark Summit

Managing Apache Spark Workload and Automatic OptimizingDatabricks

Functional Comparison and Performance Evaluation of Streaming FrameworksHuafeng Wang

Tracing the Breadcrumbs: Apache Spark Workload DiagnosticsDatabricks

Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungSpark Summit

Performance Comparison of Streaming Big Data PlatformsDataWorks Summit/Hadoop Summit

Transactional writes to cloud storage with Eric LiangDatabricks

Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingDatabricks

Memory Management in Apache SparkDatabricks

Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks

Spark Summit EU talk by Jorg SchadSpark Summit

Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank

Overview of Cascading 3.0 on Apache Flink Cascading

Continuous Application with FAIR Scheduler with Robert XueDatabricks

Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.

Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Spark Summit

Re-Architecting Spark For Performance UnderstandabilityJen Aman

Spark on MesosJen Aman

SSR: Structured Streaming for R and Machine Learningfelixcss

Building Value Within the Heavy Vehicle Industry Using Big Data and Streaming...DataWorks Summit

Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...Flink Forward

Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)Spark Summit

Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Databricks

Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSpark Summit

Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit

Dongwon Kim – A Comparative Performance Evaluation of FlinkFlink Forward

Why Apache Flink is better than Spark by Rubén CasadoBig Data Spain

Contenu connexe

Tendances

Performance Comparison of Streaming Big Data PlatformsDataWorks Summit/Hadoop Summit

Transactional writes to cloud storage with Eric LiangDatabricks

Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingDatabricks

Memory Management in Apache SparkDatabricks

Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks

Spark Summit EU talk by Jorg SchadSpark Summit

Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank

Overview of Cascading 3.0 on Apache Flink Cascading

Continuous Application with FAIR Scheduler with Robert XueDatabricks

Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.

Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Spark Summit

Re-Architecting Spark For Performance UnderstandabilityJen Aman

Spark on MesosJen Aman

SSR: Structured Streaming for R and Machine Learningfelixcss

Building Value Within the Heavy Vehicle Industry Using Big Data and Streaming...DataWorks Summit

Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...Flink Forward

Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)Spark Summit

Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Databricks

Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSpark Summit

Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit

Tendances (20)

Performance Comparison of Streaming Big Data Platforms

Transactional writes to cloud storage with Eric Liang

Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling

Memory Management in Apache Spark

Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...

Spark Summit EU talk by Jorg Schad

Extreme Apache Spark: how in 3 months we created a pipeline that can process ...

Overview of Cascading 3.0 on Apache Flink

Continuous Application with FAIR Scheduler with Robert Xue

Top 5 Mistakes to Avoid When Writing Apache Spark Applications

Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...

Re-Architecting Spark For Performance Understandability

Spark on Mesos

SSR: Structured Streaming for R and Machine Learning

Building Value Within the Heavy Vehicle Industry Using Big Data and Streaming...

Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...

Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)

Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...

Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma

Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...

En vedette

Dongwon Kim – A Comparative Performance Evaluation of FlinkFlink Forward

Why Apache Flink is better than Spark by Rubén CasadoBig Data Spain

Flink vs. SparkSlim Baltagi

Apache Flink community Update for March 2016 - Slim BaltagiSlim Baltagi

K. Tzoumas & S. Ewen – Flink Forward KeynoteFlink Forward

Tran Nam-Luc – Stale Synchronous Parallel Iterations on FlinkFlink Forward

Alexander Kolb – Flink. Yet another Streaming Framework?Flink Forward

Marton Balassi – Stateful Stream ProcessingFlink Forward

Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Flink Forward

Matthias J. Sax – A Tale of Squirrels and StormsFlink Forward

Kamal Hakimzadeh – Reproducible Distributed ExperimentsFlink Forward

Ufuc Celebi – Stream & Batch Processing in one SystemFlink Forward

S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionFlink Forward

Apache Flink Training: DataSet API BasicsFlink Forward

Vasia Kalavri – Training: Gelly School Flink Forward

William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...Flink Forward

Mikio Braun – Data flow vs. procedural programming Flink Forward

Apache Flink Training: System OverviewFlink Forward

Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkFlink Forward

Michael Häusler – Everyday flinkFlink Forward

En vedette (20)

Dongwon Kim – A Comparative Performance Evaluation of Flink

Why Apache Flink is better than Spark by Rubén Casado

Flink vs. Spark

Apache Flink community Update for March 2016 - Slim Baltagi

K. Tzoumas & S. Ewen – Flink Forward Keynote

Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink

Alexander Kolb – Flink. Yet another Streaming Framework?

Marton Balassi – Stateful Stream Processing

Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?

Matthias J. Sax – A Tale of Squirrels and Storms

Kamal Hakimzadeh – Reproducible Distributed Experiments

Ufuc Celebi – Stream & Batch Processing in one System

S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

Apache Flink Training: DataSet API Basics

Vasia Kalavri – Training: Gelly School

William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...

Mikio Braun – Data flow vs. procedural programming

Apache Flink Training: System Overview

Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink

Michael Häusler – Everyday flink

Similaire à Apache Flink vs Apache Spark - Reproducible experiments on cloud.

10 things i wish i'd known before using spark in productionParis Data Engineers !

Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Spark Summit

(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services

Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks

Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks

Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks

Spark on YARNAdarsh Pannu

Deep Dive into GPU Support in Apache Spark 3.xDatabricks

Mixing Analytic Workloads with Greenplum and Apache SparkVMware Tanzu

Emr spark tuning demystifiedOmid Vahdaty

Profiling & Testing with SparkRoger Rafanell Mas

Data science with spark on amazon EMR - Pop-up Loft Tel AvivAmazon Web Services

Spark 2.x Troubleshooting GuideIBM

From HDFS to S3: Migrate Pinterest Apache Spark ClustersDatabricks

Module01NPN Training

Apache Spark WorkshopMichael Spector

Spark & Yarn better together 1.2Jianfeng Zhang

Apache Spark overviewDataArt

How to build your query engine in sparkPeng Cheng

Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAlluxio, Inc.

Similaire à Apache Flink vs Apache Spark - Reproducible experiments on cloud. (20)

10 things i wish i'd known before using spark in production

Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...

(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR

Resource-Efficient Deep Learning Model Selection on Apache Spark

Build Large-Scale Data Analytics and AI Pipeline Using RayDP

Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...

Spark on YARN

Deep Dive into GPU Support in Apache Spark 3.x

Mixing Analytic Workloads with Greenplum and Apache Spark

Emr spark tuning demystified

Profiling & Testing with Spark

Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

Spark 2.x Troubleshooting Guide

From HDFS to S3: Migrate Pinterest Apache Spark Clusters

Module01

Apache Spark Workshop

Spark & Yarn better together 1.2

Apache Spark overview

How to build your query engine in spark

Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio

Dernier

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

Connecting the Dots for Information Discovery.pdfNeo4j

Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3

Time Series Foundation Models - current state and future directionsNathaniel Shimoni

What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina

Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda

How to write a Business Continuity PlanDatabarracks

Sample pptx for embedding into website for demoHarshalMandlekar2

Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3

Data governance with Unity Catalog PresentationKnoldus Inc.

Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González

From Family Reminiscence to Scholarly Archive .Alan Dix

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney

The State of Passkeys with FIDO Alliance.pptxLoriGlavin3

A Framework for Development in the AI AgeCprime

Dernier (20)

The Ultimate Guide to Choosing WordPress Pros and Cons

Connecting the Dots for Information Discovery.pdf

Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx

Time Series Foundation Models - current state and future directions

What is DBT - The Ultimate Data Build Tool.pdf

Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...

How to write a Business Continuity Plan

Sample pptx for embedding into website for demo

Emixa Mendix Meetup 11 April 2024 about Mendix Native development

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx

Data governance with Unity Catalog Presentation

Generative Artificial Intelligence: How generative AI works.pdf

From Family Reminiscence to Scholarly Archive .

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...

The State of Passkeys with FIDO Alliance.pptx

A Framework for Development in the AI Age

Apache Flink vs Apache Spark - Reproducible experiments on cloud.

1. Reproducible distributed experiments on cloud vs Shelan Perera Ashansa Perera Kamal Hakimzadeh

2. “ Reproducing experiments with minimal effort

3. Spark and Flink ▷ Batch Processing vs. Stream Processing ▷ Micro Batching vs. Natural Data Flow ▷ Good fit for scalable deployment in the cloud

4. Motivation ▷ Validate Performance claims ▷ Take off deployment overhead ▷ Design reproducible experiments

5. Karamel => “Framework for reproducible distributed experiments”

6. Benchmark - Batch Teragen - To generate data (Hadoop) Terasort - Benchmarking Algorithm (Spark, Flink)

7. To make Dongwon Kim’s comparison reproducible. http://www.slideshare.net/ssuser6bb12d/a- comparative-performance-evaluation-of-apache-flink

8. 1 Namenode ⇒ Master (Low processing ) 2 Worker nodes ⇒ Slaves (High processing ) Our Deployment

9. EC2 Slave EC2 Master Deployment Hadoop Name Node Spark Master Flink Job Manager Spark Worker Flink Task Manager Hadoop Data Node Karamel x 2 Karamel Config

10. Configuration Master / Namenode 2.6 4 16 80 Slave / Worker 2.5 16 122 1600 CPU (GHz) No of vCPUs Memory (GB) Storage :SSD (GB) (m3.xlarge) (i2.4xlarge)

11. Experiment Hadoop MR : Teragen HDFS Spark/Flink : Terasort 200/ 400/ 600 GB

12. Results Batch Processing

13. Application Performance

14. Flink 1.5 x Faster than Spark

15. ▷ Spark : Does not overlap stages ▷ Flink : Do pipelining Mainly because...

16. Collectl- Monitor ● Tool used to collect and draw results. ● https://github. com/shelan/collectl- monitoring

17. System Performance -CPU (%)

18. System Performance -Memory (GB)

19. System Performance -Disk (MB/s)

20. System Performance -Network (KB/s)

21. Load Balancing -Workers (CPU %)

22. Load Balancing -Workers (CPU %)

23. Outcome ▷ Performance Comparison Results ▷ Karamel experiments to reproduce the same results with minimal effort

24. How not to reproduce “our problems”

25. EC2 claims 800 GB disks, But Disk File system (DF) does shows only 30GB. If you are using I2 or R3 instances you should create a file system and partition disks manually.

26. Large Spark or Flink Batch applications can fail with not enough disk space Configure Flink temp directory and Spark local directory to a partition with at least enough space to store the total input.

27. Reproducing experiments on EC2 may cost you a lot Spot instances which allow to reduce the cost by 10x is also supported by Karamel

28. IncompatibleClassChangeError when running StreamBench built for MR2 on hadoop2.x No explicitly defined dependencies for previous versions, but one of the dependencies (mahout) had internal references to hadoop1.x jar

29. Summary ▷ Introducing reproducible experiments on cloud ▷ Performance Comparison of Spark and Flink ▷ Reproducible experiments are available online (https://github.com/karamel-lab)

30. Thanks ..!!