SlideShare a Scribd company logo
1 of 37
Download to read offline
www.twosigma.com
Improving Python and Spark
Performance and Interoperability
February 9, 2017All Rights Reserved
Wes McKinney @wesmckinn
Spark Summit East 2017
February 9, 2017
Me
February 9, 2017
•  Currently: Software Architect at Two Sigma Investments
•  Creator of Python pandas project
•  PMC member for Apache Arrow and Apache Parquet
•  Other Python projects: Ibis, Feather, statsmodels
•  Formerly: Cloudera, DataPad, AQR
•  Author of Python for Data Analysis
All Rights Reserved 2
Important Legal Information
The information presented here is offered for informational purposes only and should not be
used for any other purpose (including, without limitation, the making of investment decisions).
Examples provided herein are for illustrative purposes only and are not necessarily based on
actual data. Nothing herein constitutes: an offer to sell or the solicitation of any offer to buy any
security or other interest; tax advice; or investment advice. This presentation shall remain the
property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to
require the return of this presentation at any time.
Some of the images, logos or other material used herein may be protected by copyright and/or
trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that
created the material and are used purely for identification and comment as fair use under
international copyright and/or trademark laws. Use of such image, copyright or trademark does
not imply any association with such organization (or endorsement of such organization) by Two
Sigma, nor vice versa.
Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved
This talk
4February 9, 2017
•  Why some parts of PySpark are “slow”
•  Technology that can help make things faster
•  Work we have done to make improvements
•  Future roadmap
All Rights Reserved
Python and Spark
February 9, 2017
•  Spark is implemented in Scala, runs on the Java virtual machine (JVM)
•  Spark has Python and R APIs with partial or full coverage for many parts of
the Scala Spark API
•  In some Spark tasks, Python is only a scripting front-end.
•  This means no interpreted Python code is executed once the Spark
job starts
•  Other PySpark jobs suffer performance and interoperability issues that we’re
going to analyze in this talk
All Rights Reserved 5
Spark DataFrame performance
February 9, 2017All Rights Reserved
Source: https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
6
Spark DataFrame performance can be misleading
February 9, 2017
•  Spark DataFrames are an example of Python as a DSL / scripting front end
•  Excepting UDFs (.map(…) or sqlContext.registerFunction), no Python code is
evaluated in the Spark job
•  Python API calls create SQL query plans inside the JVM — so Scala and
Python versions are computationally identical
All Rights Reserved 7
Spark DataFrames as deferred DSL
February 9, 2017
	
young	=	users[users.age	<	21]	
young.groupBy(“gender”).count()	
All Rights Reserved 8
Spark DataFrames as deferred DSL
February 9, 2017
SELECT	gender,	COUNT(*)	
FROM	users		
WHERE	age	<	21	
GROUP	BY	1	
All Rights Reserved 9
Spark DataFrames as deferred DSL
February 9, 2017
Aggregation[table]	
		table:	
				Table:	users	
		metrics:	
				count	=	Count[int64]	
						Table:	ref_0	
		by:	
				gender	=	Column[array(string)]	'gender'	from	users	
		predicates:	
				Less[array(boolean)]	
						age	=	Column[array(int32)]	'age'	from	users	
						Literal[int8]	
								21	
All Rights Reserved 10
Where Python code and Spark meet
February 9, 2017
•  Unfortunately, many PySpark jobs cannot be expressed entirely as
DataFrame operations or other built-in Scala constructs
•  Spark-Scala interacts with in-memory Python in key ways:
•  Reading and writing in-memory datasets to/from the Spark driver
•  Evaluating custom Python code (user-defined functions)
All Rights Reserved 11
How PySpark lambda functions work
February 9, 2017
•  The anatomy of
All Rights Reserved
rdd.map(lambda	x:	…	)	
df.withColumn(py_func(...))	
Scala
RDD
Python worker
Python worker
Python worker
Python worker
Python worker
see PythonRDD.scala
12
PySpark lambda performance problems
February 9, 2017
•  See 2016 talk “High Performance Python on Apache Spark”
•  http://www.slideshare.net/wesm/high-performance-python-on-apache-spark
•  Problems
•  Inefficient data movement (serialization / deserialization)
•  Scalar computation model: object boxing and interpreter overhead
•  General summary: PySpark is not currently designed to achieve high
performance in the way that pandas and NumPy are.
All Rights Reserved 13
Other issues with PySpark lambdas
February 9, 2017
•  Computation model unlike what pandas users are used to
•  In dataframe.map(f), the Python function f	only sees one Row at a time
•  A more natural and efficient vectorized API would be:
•  dataframe.map_pandas(lambda	df:	…)	
All Rights Reserved 14
February 9, 2017All Rights Reserved
Apache
Arrow
15
Apache Arrow: Process and Move Data Fast
February 9, 2017
•  New Top-level Apache project as of February 2016
•  Collaboration amongst broad set of OSS projects around shared needs
•  Language-independent columnar data structures
•  Metadata for describing schemas / chunks of data
•  Protocol for moving data between processes with minimal serialization
overhead
All Rights Reserved 16
High performance data interchange
February 9, 2017All Rights Reserved
Today With Arrow
Source: Apache Arrow
17
What does Apache Arrow give you?
February 9, 2017
•  Zero-copy columnar data: Complex table and array data structures that can
reference memory without copying it
•  Ultrafast messaging: Language-agnostic metadata, batch/file-based and
streaming binary formats
•  Complex schema support: Flat and nested data types
•  C++, Python, and Java Implementations: with integration tests
All Rights Reserved 18
Arrow binary wire formats
February 9, 2017All Rights Reserved 19
Extreme performance to pandas from Arrow streams
February 9, 2017All Rights Reserved 20
PyArrow file and streaming API
February 9, 2017
from	pyarrow	import	StreamReader	
	
reader	=	StreamReader(stream)	
	
#	pyarrow.Table	
table	=	reader.read_all()	
	
#	Convert	to	pandas	
df	=	table.to_pandas()	
All Rights Reserved 21
For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of
this presentation for important disclosure information.
Making DataFrame.toPandas faster
February 9, 2017
•  Background
•  Spark’s toPandas transfers in-memory from the Spark driver to Python and
converts it to a pandas.DataFrame. It is very slow
•  Joint work with Bryan Cutler (IBM), Li Jin (Two Sigma), and Yin Xusen
(IBM). See SPARK-13534 on JIRA
•  Test case: transfer 128MB Parquet file with 8 DOUBLE columns
All Rights Reserved 22
February 9, 2017All Rights Reserved
conda	install	pyarrow	-c	conda-forge	
23
For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of
this presentation for important disclosure information.
Making DataFrame.toPandas faster
February 9, 2017All Rights Reserved
df	=	sqlContext.read.parquet('example2.parquet')	
df	=	df.cache()	
df.count()	
Then
%%prun	-s	cumulative	
dfs	=	[df.toPandas()	for	i	in	range(5)]	
24
For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of
this presentation for important disclosure information.
Making DataFrame.toPandas faster
February 9, 2017All Rights Reserved
	94483943	function	calls	(94478223	primitive	calls)	in	62.492	seconds	
	
			ncalls		tottime		percall		cumtime		percall	filename:lineno(function)	
								5				1.458				0.292			62.492			12.498	dataframe.py:1570(toPandas)	
								5				0.661				0.132			54.759			10.952	dataframe.py:382(collect)	
	10485765				0.669				0.000			46.823				0.000	rdd.py:121(_load_from_socket)	
						715				0.002				0.000			46.139				0.065	serializers.py:141(load_stream)	
						710				0.002				0.000			45.950				0.065	serializers.py:448(loads)	
	10485760				4.969				0.000			32.853				0.000	types.py:595(fromInternal)	
					1391				0.004				0.000				7.445				0.005	socket.py:562(readinto)	
							18				0.000				0.000				7.283				0.405	java_gateway.py:1006(send_command)	
								5				0.000				0.000				6.262				1.252	frame.py:943(from_records)	
25
For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of
this presentation for important disclosure information.
Making DataFrame.toPandas faster
February 9, 2017All Rights Reserved
Now,	using	pyarrow	
	
%%prun	-s	cumulative	
dfs	=	[df.toPandas(useArrow)	for	i	in	range(5)]	
26
For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of
this presentation for important disclosure information.
Making DataFrame.toPandas faster
February 9, 2017All Rights Reserved
	38585	function	calls	(38535	primitive	calls)	in	9.448	seconds	
	
			ncalls		tottime		percall		cumtime		percall	filename:lineno(function)	
								5				0.001				0.000				9.448				1.890	dataframe.py:1570(toPandas)	
								5				0.000				0.000				9.358				1.872	dataframe.py:394(collectAsArrow)	
					6271				9.330				0.001				9.330				0.001	{method	'recv_into'	of	'_socket.socket‘}	
							15				0.000				0.000				9.229				0.615	java_gateway.py:860(send_command)	
							10				0.000				0.000				0.123				0.012	serializers.py:141(load_stream)	
								5				0.085				0.017				0.089				0.018	{method	'to_pandas'	of	'pyarrow.Table‘}	
27
pip	install	memory_profiler	
February 9, 2017All Rights Reserved
%%memit	-i	0.0001	
pdf	=	None	
pdf	=	df.toPandas()	
gc.collect()	
	
peak	memory:	1223.16	MiB,		
increment:	1018.20	MiB	
	
	
28
Plot thickens: memory use
February 9, 2017All Rights Reserved
%%memit	-i	0.0001	
pdf	=	None	
pdf	=	df.toPandas(useArrow=True)	
gc.collect()	
	
peak	memory:	334.08	MiB,		
increment:	258.31	MiB	
	
29
Summary of results
February 9, 2017
•  Current version: average 12.5s (10.2 MB/s)
•  Deseralization accounts for 88% of time; the rest is waiting for Spark to
send the data
•  Peak memory use 8x (~1GB) the size of the dataset
•  Arrow version
•  Average wall clock time of 1.89s (6.61x faster, 67.7 MB/s)
•  Deserialization accounts for 1% of total time
•  Peak memory use 2x the size of the dataset (1 memory doubling)
•  Time for Spark to send data 25% higher (1866ms vs 1488 ms)
All Rights Reserved 30
Aside: reading Parquet directly in Python
February 9, 2017
import	pyarrow.parquet	as	pq	
	
%%timeit		
df	=	pq.read_table(‘example2.parquet’).to_pandas()	
10	loops,	best	of	3:	175	ms	per	loop	
	
	
All Rights Reserved 31
Digging deeper
February 9, 2017
•  Why does it take Spark ~1.8 seconds to send 128MB of data over the wire?
val	collectedRows	=	queryExecution.executedPlan.executeCollect()	
cnvtr.internalRowsToPayload(collectedRows,	this.schema)	
All Rights Reserved
Array[InternalRow]	
32
Digging deeper
February 9, 2017
•  In our 128MB test case, on average:
•  75% of time is being spent collecting Array[InternalRow]	from the task
executors
•  25% of the time is spent on a single-threaded conversion of all the data
from Array[InternalRow] to ArrowRecordBatch	
•  We can go much faster by performing the Spark SQL -> Arrow
conversion locally on the task executors, then streaming the batches to
Python
All Rights Reserved 33
Future architecture
February 9, 2017All Rights Reserved
Task executor
Task executor
Task executor
Task executor
Arrow RecordBatch
Arrow RecordBatch
Arrow RecordBatch
Arrow RecordBatch
Arrow Schema
Spark driver
Python
34
For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no
guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of
this presentation for important disclosure information.
Hot off the presses
February 9, 2017All Rights Reserved
			ncalls		tottime		percall		cumtime		percall	filename:lineno(function)	
								5				0.000				0.000				5.928				1.186	dataframe.py:1570(toPandas)	
								5				0.000				0.000				5.838				1.168	dataframe.py:394(collectAsArrow)	
					5919				0.005				0.000				5.824				0.001	socket.py:561(readinto)	
					5919				5.809				0.001				5.809				0.001	{method	'recv_into'	of	'_socket.socket‘}	
...	
								5				0.086				0.017				0.091				0.018	{method	'to_pandas'	of	'pyarrow.Table‘}	
35
Patch from February 8: 38% perf improvement
The work ahead
February 9, 2017
•  Luckily, speeding up toPandas and speeding up Lambda / UDF functions is
architecturally the same type of problem
•  Reasonably clear path to making toPandas even faster
•  How can you get involved?
•  Keep an eye on Spark ASF JIRA
•  Contribute to Apache Arrow (Java, C++, Python, other languages)
•  Join the Arrow and Spark mailing lists
All Rights Reserved 36
Thank you
February 9, 2017
•  Bryan Cutler, Li Jin, and Yin Xusen, for building the integration Spark-Arrow
integration
•  Apache Arrow community
•  Spark Summit organizers
•  Two Sigma and IBM, for supporting this work
All Rights Reserved 37

More Related Content

What's hot

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Databricks
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...Spark Summit
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSAmazon Web Services
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...confluent
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with PythonGokhan Atil
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon Web Services
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architectureSohil Jain
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
The Oracle RAC Family of Solutions - Presentation
The Oracle RAC Family of Solutions - PresentationThe Oracle RAC Family of Solutions - Presentation
The Oracle RAC Family of Solutions - PresentationMarkus Michalewicz
 

What's hot (20)

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
The Oracle RAC Family of Solutions - Presentation
The Oracle RAC Family of Solutions - PresentationThe Oracle RAC Family of Solutions - Presentation
The Oracle RAC Family of Solutions - Presentation
 

Similar to Improving Python and Spark (PySpark) Performance and Interoperability

Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Databricks
 
Future of pandas
Future of pandasFuture of pandas
Future of pandasJeff Reback
 
Future of Pandas - Jeff Reback
Future of Pandas - Jeff RebackFuture of Pandas - Jeff Reback
Future of Pandas - Jeff RebackTwo Sigma
 
Improving Pandas and PySpark performance and interoperability with Apache Arrow
Improving Pandas and PySpark performance and interoperability with Apache ArrowImproving Pandas and PySpark performance and interoperability with Apache Arrow
Improving Pandas and PySpark performance and interoperability with Apache ArrowPyData
 
Improving Pandas and PySpark interoperability with Apache Arrow
Improving Pandas and PySpark interoperability with Apache ArrowImproving Pandas and PySpark interoperability with Apache Arrow
Improving Pandas and PySpark interoperability with Apache ArrowLi Jin
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowJulien Le Dem
 
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure DataHow To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure DataTaro L. Saito
 
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...Edureka!
 
SQL Data Warehousing in SAP HANA (Sefan Linders)
SQL Data Warehousing in SAP HANA (Sefan Linders)SQL Data Warehousing in SAP HANA (Sefan Linders)
SQL Data Warehousing in SAP HANA (Sefan Linders)Twan van den Broek
 
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDeploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDatabricks
 
A short introduction to Spark and its benefits
A short introduction to Spark and its benefitsA short introduction to Spark and its benefits
A short introduction to Spark and its benefitsJohan Picard
 
Introduction to pyspark new
Introduction to pyspark newIntroduction to pyspark new
Introduction to pyspark newAnam Mahmood
 
Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...
Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...
Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...Databricks
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!Edureka!
 
5 reasons why spark is in demand!
5 reasons why spark is in demand!5 reasons why spark is in demand!
5 reasons why spark is in demand!Edureka!
 
Simplifying AI integration on Apache Spark
Simplifying AI integration on Apache SparkSimplifying AI integration on Apache Spark
Simplifying AI integration on Apache SparkDatabricks
 
QCon 2018 | Gimel | PayPal's Analytic Platform
QCon 2018 | Gimel | PayPal's Analytic PlatformQCon 2018 | Gimel | PayPal's Analytic Platform
QCon 2018 | Gimel | PayPal's Analytic PlatformDeepak Chandramouli
 
Datameer6 for prospects - june 2016_v2
Datameer6 for prospects - june 2016_v2Datameer6 for prospects - june 2016_v2
Datameer6 for prospects - june 2016_v2Datameer
 
NRB - LUXEMBOURG MAINFRAME DAY 2017 - Data Spark and the Data Federation
NRB - LUXEMBOURG MAINFRAME DAY 2017 - Data Spark and the Data FederationNRB - LUXEMBOURG MAINFRAME DAY 2017 - Data Spark and the Data Federation
NRB - LUXEMBOURG MAINFRAME DAY 2017 - Data Spark and the Data FederationNRB
 
NRB - BE MAINFRAME DAY 2017 - Data spark and the data federation
NRB - BE MAINFRAME DAY 2017 - Data spark and the data federation NRB - BE MAINFRAME DAY 2017 - Data spark and the data federation
NRB - BE MAINFRAME DAY 2017 - Data spark and the data federation NRB
 

Similar to Improving Python and Spark (PySpark) Performance and Interoperability (20)

Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...
 
Future of pandas
Future of pandasFuture of pandas
Future of pandas
 
Future of Pandas - Jeff Reback
Future of Pandas - Jeff RebackFuture of Pandas - Jeff Reback
Future of Pandas - Jeff Reback
 
Improving Pandas and PySpark performance and interoperability with Apache Arrow
Improving Pandas and PySpark performance and interoperability with Apache ArrowImproving Pandas and PySpark performance and interoperability with Apache Arrow
Improving Pandas and PySpark performance and interoperability with Apache Arrow
 
Improving Pandas and PySpark interoperability with Apache Arrow
Improving Pandas and PySpark interoperability with Apache ArrowImproving Pandas and PySpark interoperability with Apache Arrow
Improving Pandas and PySpark interoperability with Apache Arrow
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
 
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure DataHow To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
 
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
 
SQL Data Warehousing in SAP HANA (Sefan Linders)
SQL Data Warehousing in SAP HANA (Sefan Linders)SQL Data Warehousing in SAP HANA (Sefan Linders)
SQL Data Warehousing in SAP HANA (Sefan Linders)
 
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDeploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
 
A short introduction to Spark and its benefits
A short introduction to Spark and its benefitsA short introduction to Spark and its benefits
A short introduction to Spark and its benefits
 
Introduction to pyspark new
Introduction to pyspark newIntroduction to pyspark new
Introduction to pyspark new
 
Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...
Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...
Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
 
5 reasons why spark is in demand!
5 reasons why spark is in demand!5 reasons why spark is in demand!
5 reasons why spark is in demand!
 
Simplifying AI integration on Apache Spark
Simplifying AI integration on Apache SparkSimplifying AI integration on Apache Spark
Simplifying AI integration on Apache Spark
 
QCon 2018 | Gimel | PayPal's Analytic Platform
QCon 2018 | Gimel | PayPal's Analytic PlatformQCon 2018 | Gimel | PayPal's Analytic Platform
QCon 2018 | Gimel | PayPal's Analytic Platform
 
Datameer6 for prospects - june 2016_v2
Datameer6 for prospects - june 2016_v2Datameer6 for prospects - june 2016_v2
Datameer6 for prospects - june 2016_v2
 
NRB - LUXEMBOURG MAINFRAME DAY 2017 - Data Spark and the Data Federation
NRB - LUXEMBOURG MAINFRAME DAY 2017 - Data Spark and the Data FederationNRB - LUXEMBOURG MAINFRAME DAY 2017 - Data Spark and the Data Federation
NRB - LUXEMBOURG MAINFRAME DAY 2017 - Data Spark and the Data Federation
 
NRB - BE MAINFRAME DAY 2017 - Data spark and the data federation
NRB - BE MAINFRAME DAY 2017 - Data spark and the data federation NRB - BE MAINFRAME DAY 2017 - Data spark and the data federation
NRB - BE MAINFRAME DAY 2017 - Data spark and the data federation
 

More from Wes McKinney

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackWes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackWes McKinney
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Wes McKinney
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data ScienceWes McKinney
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Wes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningWes McKinney
 

More from Wes McKinney (20)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 

Recently uploaded

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 

Recently uploaded (20)

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Improving Python and Spark (PySpark) Performance and Interoperability

  • 1. www.twosigma.com Improving Python and Spark Performance and Interoperability February 9, 2017All Rights Reserved Wes McKinney @wesmckinn Spark Summit East 2017 February 9, 2017
  • 2. Me February 9, 2017 •  Currently: Software Architect at Two Sigma Investments •  Creator of Python pandas project •  PMC member for Apache Arrow and Apache Parquet •  Other Python projects: Ibis, Feather, statsmodels •  Formerly: Cloudera, DataPad, AQR •  Author of Python for Data Analysis All Rights Reserved 2
  • 3. Important Legal Information The information presented here is offered for informational purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell or the solicitation of any offer to buy any security or other interest; tax advice; or investment advice. This presentation shall remain the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at any time. Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa. Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved
  • 4. This talk 4February 9, 2017 •  Why some parts of PySpark are “slow” •  Technology that can help make things faster •  Work we have done to make improvements •  Future roadmap All Rights Reserved
  • 5. Python and Spark February 9, 2017 •  Spark is implemented in Scala, runs on the Java virtual machine (JVM) •  Spark has Python and R APIs with partial or full coverage for many parts of the Scala Spark API •  In some Spark tasks, Python is only a scripting front-end. •  This means no interpreted Python code is executed once the Spark job starts •  Other PySpark jobs suffer performance and interoperability issues that we’re going to analyze in this talk All Rights Reserved 5
  • 6. Spark DataFrame performance February 9, 2017All Rights Reserved Source: https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html 6
  • 7. Spark DataFrame performance can be misleading February 9, 2017 •  Spark DataFrames are an example of Python as a DSL / scripting front end •  Excepting UDFs (.map(…) or sqlContext.registerFunction), no Python code is evaluated in the Spark job •  Python API calls create SQL query plans inside the JVM — so Scala and Python versions are computationally identical All Rights Reserved 7
  • 8. Spark DataFrames as deferred DSL February 9, 2017 young = users[users.age < 21] young.groupBy(“gender”).count() All Rights Reserved 8
  • 9. Spark DataFrames as deferred DSL February 9, 2017 SELECT gender, COUNT(*) FROM users WHERE age < 21 GROUP BY 1 All Rights Reserved 9
  • 10. Spark DataFrames as deferred DSL February 9, 2017 Aggregation[table] table: Table: users metrics: count = Count[int64] Table: ref_0 by: gender = Column[array(string)] 'gender' from users predicates: Less[array(boolean)] age = Column[array(int32)] 'age' from users Literal[int8] 21 All Rights Reserved 10
  • 11. Where Python code and Spark meet February 9, 2017 •  Unfortunately, many PySpark jobs cannot be expressed entirely as DataFrame operations or other built-in Scala constructs •  Spark-Scala interacts with in-memory Python in key ways: •  Reading and writing in-memory datasets to/from the Spark driver •  Evaluating custom Python code (user-defined functions) All Rights Reserved 11
  • 12. How PySpark lambda functions work February 9, 2017 •  The anatomy of All Rights Reserved rdd.map(lambda x: … ) df.withColumn(py_func(...)) Scala RDD Python worker Python worker Python worker Python worker Python worker see PythonRDD.scala 12
  • 13. PySpark lambda performance problems February 9, 2017 •  See 2016 talk “High Performance Python on Apache Spark” •  http://www.slideshare.net/wesm/high-performance-python-on-apache-spark •  Problems •  Inefficient data movement (serialization / deserialization) •  Scalar computation model: object boxing and interpreter overhead •  General summary: PySpark is not currently designed to achieve high performance in the way that pandas and NumPy are. All Rights Reserved 13
  • 14. Other issues with PySpark lambdas February 9, 2017 •  Computation model unlike what pandas users are used to •  In dataframe.map(f), the Python function f only sees one Row at a time •  A more natural and efficient vectorized API would be: •  dataframe.map_pandas(lambda df: …) All Rights Reserved 14
  • 15. February 9, 2017All Rights Reserved Apache Arrow 15
  • 16. Apache Arrow: Process and Move Data Fast February 9, 2017 •  New Top-level Apache project as of February 2016 •  Collaboration amongst broad set of OSS projects around shared needs •  Language-independent columnar data structures •  Metadata for describing schemas / chunks of data •  Protocol for moving data between processes with minimal serialization overhead All Rights Reserved 16
  • 17. High performance data interchange February 9, 2017All Rights Reserved Today With Arrow Source: Apache Arrow 17
  • 18. What does Apache Arrow give you? February 9, 2017 •  Zero-copy columnar data: Complex table and array data structures that can reference memory without copying it •  Ultrafast messaging: Language-agnostic metadata, batch/file-based and streaming binary formats •  Complex schema support: Flat and nested data types •  C++, Python, and Java Implementations: with integration tests All Rights Reserved 18
  • 19. Arrow binary wire formats February 9, 2017All Rights Reserved 19
  • 20. Extreme performance to pandas from Arrow streams February 9, 2017All Rights Reserved 20
  • 21. PyArrow file and streaming API February 9, 2017 from pyarrow import StreamReader reader = StreamReader(stream) # pyarrow.Table table = reader.read_all() # Convert to pandas df = table.to_pandas() All Rights Reserved 21
  • 22. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this presentation for important disclosure information. Making DataFrame.toPandas faster February 9, 2017 •  Background •  Spark’s toPandas transfers in-memory from the Spark driver to Python and converts it to a pandas.DataFrame. It is very slow •  Joint work with Bryan Cutler (IBM), Li Jin (Two Sigma), and Yin Xusen (IBM). See SPARK-13534 on JIRA •  Test case: transfer 128MB Parquet file with 8 DOUBLE columns All Rights Reserved 22
  • 23. February 9, 2017All Rights Reserved conda install pyarrow -c conda-forge 23
  • 24. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this presentation for important disclosure information. Making DataFrame.toPandas faster February 9, 2017All Rights Reserved df = sqlContext.read.parquet('example2.parquet') df = df.cache() df.count() Then %%prun -s cumulative dfs = [df.toPandas() for i in range(5)] 24
  • 25. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this presentation for important disclosure information. Making DataFrame.toPandas faster February 9, 2017All Rights Reserved 94483943 function calls (94478223 primitive calls) in 62.492 seconds ncalls tottime percall cumtime percall filename:lineno(function) 5 1.458 0.292 62.492 12.498 dataframe.py:1570(toPandas) 5 0.661 0.132 54.759 10.952 dataframe.py:382(collect) 10485765 0.669 0.000 46.823 0.000 rdd.py:121(_load_from_socket) 715 0.002 0.000 46.139 0.065 serializers.py:141(load_stream) 710 0.002 0.000 45.950 0.065 serializers.py:448(loads) 10485760 4.969 0.000 32.853 0.000 types.py:595(fromInternal) 1391 0.004 0.000 7.445 0.005 socket.py:562(readinto) 18 0.000 0.000 7.283 0.405 java_gateway.py:1006(send_command) 5 0.000 0.000 6.262 1.252 frame.py:943(from_records) 25
  • 26. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this presentation for important disclosure information. Making DataFrame.toPandas faster February 9, 2017All Rights Reserved Now, using pyarrow %%prun -s cumulative dfs = [df.toPandas(useArrow) for i in range(5)] 26
  • 27. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this presentation for important disclosure information. Making DataFrame.toPandas faster February 9, 2017All Rights Reserved 38585 function calls (38535 primitive calls) in 9.448 seconds ncalls tottime percall cumtime percall filename:lineno(function) 5 0.001 0.000 9.448 1.890 dataframe.py:1570(toPandas) 5 0.000 0.000 9.358 1.872 dataframe.py:394(collectAsArrow) 6271 9.330 0.001 9.330 0.001 {method 'recv_into' of '_socket.socket‘} 15 0.000 0.000 9.229 0.615 java_gateway.py:860(send_command) 10 0.000 0.000 0.123 0.012 serializers.py:141(load_stream) 5 0.085 0.017 0.089 0.018 {method 'to_pandas' of 'pyarrow.Table‘} 27
  • 28. pip install memory_profiler February 9, 2017All Rights Reserved %%memit -i 0.0001 pdf = None pdf = df.toPandas() gc.collect() peak memory: 1223.16 MiB, increment: 1018.20 MiB 28
  • 29. Plot thickens: memory use February 9, 2017All Rights Reserved %%memit -i 0.0001 pdf = None pdf = df.toPandas(useArrow=True) gc.collect() peak memory: 334.08 MiB, increment: 258.31 MiB 29
  • 30. Summary of results February 9, 2017 •  Current version: average 12.5s (10.2 MB/s) •  Deseralization accounts for 88% of time; the rest is waiting for Spark to send the data •  Peak memory use 8x (~1GB) the size of the dataset •  Arrow version •  Average wall clock time of 1.89s (6.61x faster, 67.7 MB/s) •  Deserialization accounts for 1% of total time •  Peak memory use 2x the size of the dataset (1 memory doubling) •  Time for Spark to send data 25% higher (1866ms vs 1488 ms) All Rights Reserved 30
  • 31. Aside: reading Parquet directly in Python February 9, 2017 import pyarrow.parquet as pq %%timeit df = pq.read_table(‘example2.parquet’).to_pandas() 10 loops, best of 3: 175 ms per loop All Rights Reserved 31
  • 32. Digging deeper February 9, 2017 •  Why does it take Spark ~1.8 seconds to send 128MB of data over the wire? val collectedRows = queryExecution.executedPlan.executeCollect() cnvtr.internalRowsToPayload(collectedRows, this.schema) All Rights Reserved Array[InternalRow] 32
  • 33. Digging deeper February 9, 2017 •  In our 128MB test case, on average: •  75% of time is being spent collecting Array[InternalRow] from the task executors •  25% of the time is spent on a single-threaded conversion of all the data from Array[InternalRow] to ArrowRecordBatch •  We can go much faster by performing the Spark SQL -> Arrow conversion locally on the task executors, then streaming the batches to Python All Rights Reserved 33
  • 34. Future architecture February 9, 2017All Rights Reserved Task executor Task executor Task executor Task executor Arrow RecordBatch Arrow RecordBatch Arrow RecordBatch Arrow RecordBatch Arrow Schema Spark driver Python 34
  • 35. For illustration purposes only. Not an offer to buy or sell securities. Two Sigma may modify its investment approach and portfolio parameters in the future in any manner that it believes is consistent with its fiduciary duty to its clients. There is no guarantee that Two Sigma or its products will be successful in achieving any or all of their investment objectives. Moreover, all investments involve some degree of risk, not all of which will be successfully mitigated. Please see the last page of this presentation for important disclosure information. Hot off the presses February 9, 2017All Rights Reserved ncalls tottime percall cumtime percall filename:lineno(function) 5 0.000 0.000 5.928 1.186 dataframe.py:1570(toPandas) 5 0.000 0.000 5.838 1.168 dataframe.py:394(collectAsArrow) 5919 0.005 0.000 5.824 0.001 socket.py:561(readinto) 5919 5.809 0.001 5.809 0.001 {method 'recv_into' of '_socket.socket‘} ... 5 0.086 0.017 0.091 0.018 {method 'to_pandas' of 'pyarrow.Table‘} 35 Patch from February 8: 38% perf improvement
  • 36. The work ahead February 9, 2017 •  Luckily, speeding up toPandas and speeding up Lambda / UDF functions is architecturally the same type of problem •  Reasonably clear path to making toPandas even faster •  How can you get involved? •  Keep an eye on Spark ASF JIRA •  Contribute to Apache Arrow (Java, C++, Python, other languages) •  Join the Arrow and Spark mailing lists All Rights Reserved 36
  • 37. Thank you February 9, 2017 •  Bryan Cutler, Li Jin, and Yin Xusen, for building the integration Spark-Arrow integration •  Apache Arrow community •  Spark Summit organizers •  Two Sigma and IBM, for supporting this work All Rights Reserved 37