Berlin buzzwords 2018

•Télécharger en tant que PPTX, PDF•

0 j'aime•41 vues

DataFrame is an awesome interface for data manipulation in Spark but when the complexity grows outside of the capabilities of Spark itself, you need to resort to "violence". In this talk I will explain one of the projects which became too complex to be executed using the DataFrame API and had to be rewritten into a custom code applied using mapPartitions function. We will cover some of the tips and tricks for reducing lineage complexity, share our process of analyzing pain points and get into details of mapPartitions functionality to leverage Spark's distributed processing capabilities and reliability while executing custom code.

Logiciels

When DataFrames fail,
resort to mapPartitions
aka “What to do when Spark can’t handle it”
Berlin Buzzwords, June 2018
Matija Gobec
matija.gobec@smartcat.io
@mad_max0204

Complex problem with an easy solution
(or not)

2k+ lines of resulting dataframe tree schema

Logically correct solution
● Data analysis process (data exploration)
● Get to the correct output ASAP
● Validate result
● Run on subset (if possible)
○ Quick iterations
○ Tends to eliminate data issues

Correct solution
● Correct result
● Repeatable performance
● Scalable and optimized
● Proper data handling

spark.read.jdbc
What in the world?
Specify columnName for proper distribution
numPartitions defines parallelism
Query for bounds (upper and lower)
val df = (spark.read.jdbc(url=jdbcUrl,
table="some_table",
columnName="some_column",
lowerBound=1L,
upperBound=100000L,
numPartitions=100,
connectionProperties=connectionProperties))

(Re)partitioning
df.repartition(100) - repartition to a number of partitions
df.repartition(100, col("some_column")) - repartition and split on column
df.coalesce(10) - coalesce (narrow dependency)

(Re)partitioning
Partition by a join field
Repartition is expensive but can help down the road
coalesce vs repartition
HashPartitioner
Number of partitions divisible by available cores
(usually from 2 to 10 x cores)

Data locality
PROCESS_LOCAL - data is in the same JVM
NODE_LOCAL - data is on the same node (HDFS, local executor)
RACK_LOCAL, ANY - data is not on the executors
NO_PREF - no locality preference

Broadcasting variables
Keeps a copy of data on each worker
Reduces the size of a serialized task as data is on the worker already
Optimizes join operations

Checkpointing
Truncate lineage graph
Reliable - HDFS storage (sc.setCheckpointDir)
Local - node local temp storage

Metrics
Spark WebUI is your friend
Use Spark History Server
JSON metrics over REST API
Collect JMX metrics (Dropwizard)

Let’s use it
Spark is a distributed computing framework
Spark provides resource management
Spark provides controllable work distribution
Spark is highly configurable
Why not?

MapPartitions function
/**
* Return a new RDD by applying a function to each partition of this RDD.
*
* `preservesPartitioning` indicates whether the input function preserves the partitioner, which
* should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
*/
def mapPartitions[U: ClassTag](
f: Iterator[T] => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U]

MapPartitions function - indexed
/**
* Return a new RDD by applying a function to each partition of this RDD, while tracking the index
* of the original partition.
*
* `preservesPartitioning` indicates whether the input function preserves the partitioner, which
* should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
*/
def mapPartitionsWithIndex[U: ClassTag](
f: (Int, Iterator[T]) => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U]

MapPartitions example
def mapFunction(args): Iterator[Row] => Iterator[Row] = iterator => {
// Processing based on the partition data
// Example:
// val ids = iterator.toList.map(row => row.getInt(0))
// Processor.doWork(ids)
iterator
}

Other usages
Applying custom processing code
Running ML pipeline
Using third party libraries
Misc

Results
Processing time dropped from ~96 hours to ~3.5hours (28 times)
Adding new entities doesn’t significantly impact complexity/performance
Additional control which we leveraged down the road
We managed to use all the resources
We managed to use all the resources
Raised code complexity
PROs
CONs

Matija Gobec
matija.gobec@smartcat.io
@mad_max0204
Thank you

Recommandé

Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...Citus Data

JSONB Tricks: Operators, Indexes, and When (Not) to Use It | PostgresOpen 201...Citus Data

Photon Technical Deep Dive: How to Think VectorizedDatabricks

ACADILD:: HADOOP LESSON Padma shree. T

No more struggles with Apache Spark workloads in productionChetan Khatri

Distributing Queries the Citus Way | PostgresConf US 2018 | Marco SlotCitus Data

Dynamic Width File in Spark_2016Subhasish Guha

Distributed Query ProcessingMythili Kannan

Recommandé

Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...Citus Data

JSONB Tricks: Operators, Indexes, and When (Not) to Use It | PostgresOpen 201...Citus Data

Photon Technical Deep Dive: How to Think VectorizedDatabricks

ACADILD:: HADOOP LESSON Padma shree. T

No more struggles with Apache Spark workloads in productionChetan Khatri

Distributing Queries the Citus Way | PostgresConf US 2018 | Marco SlotCitus Data

Dynamic Width File in Spark_2016Subhasish Guha

Distributed Query ProcessingMythili Kannan

Scaling up data science applicationsKexin Xie

Semi joinAlokeparna Choudhury

Hive query optimization infinityShashwat Shriparv

SQL Joins and Query OptimizationBrian Gallagher

CS 542 -- Query OptimizationJ Singh

A Graph-Based Method For Cross-Entity Threat DetectionJen Aman

ETL and pivoting in sparkSubhasish Guha

Hadoop map reduce conceptsSubhas Kumar Ghosh

Database , 8 Query OptimizationAli Usman

Overview of query evaluationavniS

Guacamole Fiesta: What do avocados and databases have in common?ArangoDB Database

Unit 3 lecture-2vishal choudhary

Is there a perfect data-parallel programming language? (Experiments with More...Julian Hyde

Monitoring Postgres at Scale | PostgresConf US 2018 | Lukas FittlCitus Data

Basic Tutorial of Association Mapping by Avjinder KalerAvjinder (Avi) Kaler

2. R-basics, Vectors, Arrays, Matrices, Factorskrishna singh

Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab

Как приготовить тестовые данные для Big Data проекта. Пример из практикиSQALab

Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Spark Summit

Map reduce: beyond word countJeff Patti

Big Data Analytics with Apache SparkMarcoYuriFujiiMelo

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks

Contenu connexe

Tendances

Scaling up data science applicationsKexin Xie

Semi joinAlokeparna Choudhury

Hive query optimization infinityShashwat Shriparv

SQL Joins and Query OptimizationBrian Gallagher

CS 542 -- Query OptimizationJ Singh

A Graph-Based Method For Cross-Entity Threat DetectionJen Aman

ETL and pivoting in sparkSubhasish Guha

Hadoop map reduce conceptsSubhas Kumar Ghosh

Database , 8 Query OptimizationAli Usman

Overview of query evaluationavniS

Guacamole Fiesta: What do avocados and databases have in common?ArangoDB Database

Unit 3 lecture-2vishal choudhary

Is there a perfect data-parallel programming language? (Experiments with More...Julian Hyde

Monitoring Postgres at Scale | PostgresConf US 2018 | Lukas FittlCitus Data

Basic Tutorial of Association Mapping by Avjinder KalerAvjinder (Avi) Kaler

2. R-basics, Vectors, Arrays, Matrices, Factorskrishna singh

Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab

Как приготовить тестовые данные для Big Data проекта. Пример из практикиSQALab

Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Spark Summit

Map reduce: beyond word countJeff Patti

Tendances (20)

Scaling up data science applications

Semi join

Hive query optimization infinity

SQL Joins and Query Optimization

CS 542 -- Query Optimization

A Graph-Based Method For Cross-Entity Threat Detection

ETL and pivoting in spark

Hadoop map reduce concepts

Database , 8 Query Optimization

Overview of query evaluation

Guacamole Fiesta: What do avocados and databases have in common?

Unit 3 lecture-2

Is there a perfect data-parallel programming language? (Experiments with More...

Monitoring Postgres at Scale | PostgresConf US 2018 | Lukas Fittl

Basic Tutorial of Association Mapping by Avjinder Kaler

2. R-basics, Vectors, Arrays, Matrices, Factors

Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...

Как приготовить тестовые данные для Big Data проекта. Пример из практики

Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...

Map reduce: beyond word count

Similaire à Berlin buzzwords 2018

Big Data Analytics with Apache SparkMarcoYuriFujiiMelo

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks

Tuning and Debugging in Apache SparkDatabricks

Learning spark ch04 - Working with Key/Value Pairsphanleson

Tuning and Debugging in Apache SparkPatrick Wendell

04 spark-pair rdd-rdd-persistenceVenkat Datla

Databricks spark-knowledge-base-1Rahul Kumar

From Query Plan to Query Performance: Supercharging your Apache Spark Queries...Databricks

Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy

Optimizations in Spark; RDD, DataFrameKnoldus Inc.

Apache Spark: What? Why? When?Massimo Schenone

Apache Spark and DataStax EnablementVincent Poncet

Beyond shuffling - Strata London 2016Holden Karau

Quick Guide to Refresh Spark skillsRavindra kumar

AI與大數據數據處理 Spark實戰(20171216)Paul Chao

Bigdata processing with Spark - part IIArjen de Vries

Introduction to Spark - DataFactZDataFactZ

Drizzles Approach To Improving Performance Of The ServerPerconaPerformance

Apache Spark - San Diego Big Data Meetup Jan 14th 2015cdmaxime

Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks

Similaire à Berlin buzzwords 2018 (20)

Big Data Analytics with Apache Spark

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...

Tuning and Debugging in Apache Spark

Learning spark ch04 - Working with Key/Value Pairs

Tuning and Debugging in Apache Spark

04 spark-pair rdd-rdd-persistence

Databricks spark-knowledge-base-1

From Query Plan to Query Performance: Supercharging your Apache Spark Queries...

Apache spark - Architecture , Overview & libraries

Optimizations in Spark; RDD, DataFrame

Apache Spark: What? Why? When?

Apache Spark and DataStax Enablement

Beyond shuffling - Strata London 2016

Quick Guide to Refresh Spark skills

AI與大數據數據處理 Spark實戰(20171216)

Bigdata processing with Spark - part II

Introduction to Spark - DataFactZ

Drizzles Approach To Improving Performance Of The Server

Apache Spark - San Diego Big Data Meetup Jan 14th 2015

Koalas: Making an Easy Transition from Pandas to Apache Spark

Dernier

Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfryanfarris8

%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba

Software Quality Assurance Interview QuestionsArshad QA

Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...Nitya salvi

%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba

AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health

LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456KiaraTiradoMicha

Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionOnePlan Solutions

Define the academic and professional writing..pdfPearlKirahMaeRagusta1

10 Trends Likely to Shape Enterprise Technology in 2024Mind IT Systems

The title is not connected to what is insideshinachiaurasa2

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls

Right Money Management App For Your Financial GoalsJhone kinadey

8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba

Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls

The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...kalichargn70th171

BUS PASS MANGEMENT SYSTEM USING PHP.pptxalwaysnagaraju26

Dernier (20)

Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf

%in Midrand+277-882-255-28 abortion pills for sale in midrand

Software Quality Assurance Interview Questions

Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...

%in tembisa+277-882-255-28 abortion pills for sale in tembisa

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...

LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456

Introducing Microsoft’s new Enterprise Work Management (EWM) Solution

Define the academic and professional writing..pdf

10 Trends Likely to Shape Enterprise Technology in 2024

The title is not connected to what is inside

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

Right Money Management App For Your Financial Goals

8257 interfacing 2 in microprocessor for btech students

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein

Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...

BUS PASS MANGEMENT SYSTEM USING PHP.pptx

Berlin buzzwords 2018

2. When DataFrames fail, resort to mapPartitions aka “What to do when Spark can’t handle it” Berlin Buzzwords, June 2018 Matija Gobec matija.gobec@smartcat.io @mad_max0204

3. Why this talk?

4. Complex problem with an easy solution (or not)

5. 2k+ lines of resulting dataframe tree schema

6. Logically correct vs Correct

7. Logically correct solution ● Data analysis process (data exploration) ● Get to the correct output ASAP ● Validate result ● Run on subset (if possible) ○ Quick iterations ○ Tends to eliminate data issues

8. Correct solution ● Correct result ● Repeatable performance ● Scalable and optimized ● Proper data handling

9. Let’s dig in

10. spark.read.jdbc What in the world? Specify columnName for proper distribution numPartitions defines parallelism Query for bounds (upper and lower) val df = (spark.read.jdbc(url=jdbcUrl, table="some_table", columnName="some_column", lowerBound=1L, upperBound=100000L, numPartitions=100, connectionProperties=connectionProperties))

11. (Re)partitioning df.repartition(100) - repartition to a number of partitions df.repartition(100, col("some_column")) - repartition and split on column df.coalesce(10) - coalesce (narrow dependency)

12. (Re)partitioning Partition by a join field Repartition is expensive but can help down the road coalesce vs repartition HashPartitioner Number of partitions divisible by available cores (usually from 2 to 10 x cores)

13. Data locality PROCESS_LOCAL - data is in the same JVM NODE_LOCAL - data is on the same node (HDFS, local executor) RACK_LOCAL, ANY - data is not on the executors NO_PREF - no locality preference

14. Data locality

15. Work distribution

16. Work distribution

17. Broadcasting variables Keeps a copy of data on each worker Reduces the size of a serialized task as data is on the worker already Optimizes join operations

18. Checkpointing Truncate lineage graph Reliable - HDFS storage (sc.setCheckpointDir) Local - node local temp storage

19. Metrics Spark WebUI is your friend Use Spark History Server JSON metrics over REST API Collect JMX metrics (Dropwizard)

20. Event timeline

21. Storage tab

22. RDD storage details

23. But what if all fails?

24. What is Apache Spark?

25. Let’s use it Spark is a distributed computing framework Spark provides resource management Spark provides controllable work distribution Spark is highly configurable Why not?

26. MapPartitions function /** * Return a new RDD by applying a function to each partition of this RDD. * * `preservesPartitioning` indicates whether the input function preserves the partitioner, which * should be `false` unless this is a pair RDD and the input function doesn't modify the keys. */ def mapPartitions[U: ClassTag]( f: Iterator[T] => Iterator[U], preservesPartitioning: Boolean = false): RDD[U]

27. MapPartitions function - indexed /** * Return a new RDD by applying a function to each partition of this RDD, while tracking the index * of the original partition. * * `preservesPartitioning` indicates whether the input function preserves the partitioner, which * should be `false` unless this is a pair RDD and the input function doesn't modify the keys. */ def mapPartitionsWithIndex[U: ClassTag]( f: (Int, Iterator[T]) => Iterator[U], preservesPartitioning: Boolean = false): RDD[U]

28. MapPartitions example def mapFunction(args): Iterator[Row] => Iterator[Row] = iterator => { // Processing based on the partition data // Example: // val ids = iterator.toList.map(row => row.getInt(0)) // Processor.doWork(ids) iterator }

29. Other usages Applying custom processing code Running ML pipeline Using third party libraries Misc

30. How about them numbers?

31. Results Processing time dropped from ~96 hours to ~3.5hours (28 times) Adding new entities doesn’t significantly impact complexity/performance Additional control which we leveraged down the road We managed to use all the resources We managed to use all the resources Raised code complexity PROs CONs

32. Q&A

33. Matija Gobec matija.gobec@smartcat.io @mad_max0204 Thank you