Back in the old days of Apache Spark, using Python with Spark was an exercise in patience. Data was moving up and down from Python to Scala, being serialised constantly. Leveraging SparkSQL and avoiding UDFs made things better, likewise did the constant improvement of the optimisers (Catalyst and Tungsten). But, after Spark 2.3, PySpark has sped up tremendously thanks to the addition of the Arrow serialisers. In this talk you will learn how the Spark Scala core communicates with the Python processes, how data is exchanged across both sub-systems and the development efforts present and underway to make it as fast as possible.
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Internals of Speeding up PySpark with Arrow
1.
2.
3.
4. WHOAMI
> Ruben Berenguel (@berenguel)
> PhD in Mathematics
> (big) data consultant
> Lead data engineer using Python, Go and Scala
> Right now at Affectv
#UnifiedDataAnalytics #SparkAISummit
6. What is Pandas?
> Python Data Analysis library
#UnifiedDataAnalytics #SparkAISummit
7. What is Pandas?
> Python Data Analysis library
> Used everywhere data and Python appear in job offers
#UnifiedDataAnalytics #SparkAISummit
8. What is Pandas?
> Python Data Analysis library
> Used everywhere data and Python appear in job offers
> Efficient (is columnar and has a C and Cython backend)
#UnifiedDataAnalytics #SparkAISummit
28. WHAT IS ARROW?
> Cross-language in-memory columnar format library
#UnifiedDataAnalytics #SparkAISummit
29. WHAT IS ARROW?
> Cross-language in-memory columnar format library
> Optimised for efficiency across languages
#UnifiedDataAnalytics #SparkAISummit
30. WHAT IS ARROW?
> Cross-language in-memory columnar format library
> Optimised for efficiency across languages
> Integrates seamlessly with Pandas
#UnifiedDataAnalytics #SparkAISummit
40. ! ❤
> Arrow uses RecordBatches
> Pandas uses blocks handled by a BlockManager
#UnifiedDataAnalytics #SparkAISummit
41. ! ❤
> Arrow uses RecordBatches
> Pandas uses blocks handled by a BlockManager
> You can convert an Arrow Table into a Pandas
DataFrame easily
#UnifiedDataAnalytics #SparkAISummit
44. WHAT IS SPARK?
> Distributed Computation framework
#UnifiedDataAnalytics #SparkAISummit
45. WHAT IS SPARK?
> Distributed Computation framework
> Open source
#UnifiedDataAnalytics #SparkAISummit
46. WHAT IS SPARK?
> Distributed Computation framework
> Open source
> Easy to use
#UnifiedDataAnalytics #SparkAISummit
47. WHAT IS SPARK?
> Distributed Computation framework
> Open source
> Easy to use
> Scales horizontally and vertically
#UnifiedDataAnalytics #SparkAISummit
95. compute
IS RUN ON EACH
EXECUTOR AND STARTS
A PYTHON WORKER VIA
PythonRunner
96.
97.
98.
99.
100.
101. Workers act as standalone processors of streams of
data
#UnifiedDataAnalytics #SparkAISummit
102. Workers act as standalone processors of streams of
data
> Connects back to the JVM that started it
#UnifiedDataAnalytics #SparkAISummit
103. Workers act as standalone processors of streams of
data
> Connects back to the JVM that started it
> Load included Python libraries
#UnifiedDataAnalytics #SparkAISummit
104. Workers act as standalone processors of streams of
data
> Connects back to the JVM that started it
> Load included Python libraries
> Deserializes the pickled function coming from the
stream
#UnifiedDataAnalytics #SparkAISummit
105. Workers act as standalone processors of streams of
data
> Connects back to the JVM that started it
> Load included Python libraries
> Deserializes the pickled function coming from the
stream
> Applies the function to the data coming from the stream
#UnifiedDataAnalytics #SparkAISummit
106. Workers act as standalone processors of streams of
data
> Connects back to the JVM that started it
> Load included Python libraries
> Deserializes the pickled function coming from the
stream
> Applies the function to the data coming from the stream
> Sends the output back
#UnifiedDataAnalytics #SparkAISummit
112. DEPENDING ON THE FUNCTION, THE
OPTIMISER WILL CHOOSE
PythonUDFRunner
OR
PythonArrowRunner
(BOTH EXTEND PythonRunner)
#UnifiedDataAnalytics #SparkAISummit
113.
114.
115.
116.
117.
118.
119.
120.
121.
122.
123. IF WE CAN DEFINE OUR FUNCTIONS
USING PANDAS Series
TRANSFORMATIONS WE CAN SPEED UP
PYSPARK CODE FROM 3X TO 100X!
#UnifiedDataAnalytics #SparkAISummit
145. RESOURCES
> Spark documentation
> High Performance Spark by Holden Karau
> The Internals of Apache Spark 2.4.2 by Jacek Laskowski
> Spark's Github
> Become a contributor
#UnifiedDataAnalytics #SparkAISummit
151. ARROW
Arrow's home
Arrow's github
Arrow speed benchmarks
Arrow to Pandas conversion benchmarks
Post: Streaming columnar data with Apache Arrow
Post: Why Pandas users should be excited by Apache Arrow
Code: Arrow-Pandas compatibility layer code
Code: Arrow Table code
PyArrow in-memory data model
Ballista: a POC distributed compute platform (Rust)
PyJava: POC on Java/Scala and Python data interchange with Arrow
#UnifiedDataAnalytics #SparkAISummit
152. PANDAS
Pandas' home
Pandas' github
Guide: Idiomatic Pandas
Code: Pandas internals
Design: Pandas internals
Talk: Demystifying Pandas' internals, by Marc Garcia
Memory Layout of Multidimensional Arrays in numpy
#UnifiedDataAnalytics #SparkAISummit
153. SPARK/PYSPARK
Code: PySpark serializers
JIRA: First steps to using Arrow (only in the PySpark driver)
Post: Speeding up PySpark with Apache Arrow
Original JIRA issue: Vectorized UDFs in Spark
Initial doc draft
Post by Bryan Cutler (leader for the Vec UDFs PR)
Post: Introducing Pandas UDF for PySpark
Code: org.apache.spark.sql.vectorized
Post by Bryan Cutler: Spark toPandas() with Arrow, a Detailed Look
#UnifiedDataAnalytics #SparkAISummit