Koalas is an open source project that provides pandas APIs on top of Apache Spark. Pandas is the standard tool for data science and it is typically the first step to explore and manipulate a data set, but pandas does not scale well to big data. Koalas fills the gap by providing pandas equivalent APIs that work on Apache Spark.
There are also many libraries trying to scale pandas APIs, such as Vaex, Modin, and so on. Dask is one of them and very popular among pandas users, and also works on its own cluster similar to Koalas which is on top of Spark cluster. In this talk, we will introduce Koalas and its current status, and the comparison between Koalas and Dask, including benchmarking.
5. What’s Koalas?
Announced April 24, 2019
Provides a drop-in replacement for pandas
- enabling efficient scaling out to hundred of worker nodes
For pandas users
- Scale out the pandas code using Koalas
- Make learning PySpark much easier
For PySpark users
- More productive by pandas-like functions
6. pandas
Authored by Wes McKinney in 2008
The standard tool for data manipulation
and analysis in Python
Deeply integrated into Python data
science ecosystem
- NumPy
- Matplotlib
- scikit-learn
Stack Overflow Trends
7. Apache Spark
De facto unified analytics engine for large-scale data processing
- Streaming
- ETL
- ML
Originally created at UC Berkeley by Databricks’ founders
PySpark for Python;
also APIs support for Scala/Java, R, and SQL
8. Koalas DataFrame and PySpark DataFrame
- Follow the structure of pandas
- Provide pandas APIs
- Implement index/identifier
- More compliant with the
relations/tables in relational
databases
- Does not have unique row identifiers
PySpark DataFrame
Koalas DataFrame
9. Koalas DataFrame and PySpark DataFrame
- Follow the structure of pandas
- Provide pandas APIs
- Implement index/identifier
- Translate pandas APIs into a logical
plan of Spark SQL
- The plan will be optimized and
executed by Spark SQL engine
- More compliant with the
relations/tables in relational
databases
- Does not have unique row identifiers
PySpark DataFrame
Koalas DataFrame
20. Introduction of Dask
• A parallel computing framework
• Written in pure python
• Using blocked algorithms and
task scheduling
21. Dask is different from Koalas
Koalas Dask
Execution engine
Apache Spark, a unified analytics engine
for large-scale data processing
Dask, a graph execution engine
Aim
Abstraction
Collections
22. Dask is different from Koalas
Koalas Dask
Execution engine
Apache Spark, a unified analytics engine
for large-scale data processing
Dask, a graph execution engine
Aim
A single codebase that works with both
pandas and Spark
Scale pandas workflow
Abstraction
Collections
23. Dask is different from Koalas
Koalas Dask
Execution engine
Apache Spark, a unified analytics engine
for large-scale data processing
Dask, a graph execution engine
Aim
A single codebase that works with both
pandas and Spark
Scale pandas workflow
Abstraction Query plan Task graph and task scheduler
Collections
24. Dask is different from Koalas
Koalas Dask
Execution engine
Apache Spark, a unified analytics engine
for large-scale data processing
Dask, a graph execution engine
Aim
A single codebase that works with both
pandas and Spark
Scale pandas workflow
Abstraction Query plan Task graph and task scheduler
Collections DataFrame Array, DataFrame, Bag
25. Benchmark setup - Methodology
• Dataset
157 GB Yellow Taxi Trip Records (2009 - 2013)
• Operations
Basic statistical calculations
Joins
Grouping
• Operations were applied to
The whole dataset
Filtered data (36% whole dataset)
Cached filtered data (36% whole dataset)
The scenario used in this benchmark was inspired by https://github.com/xdssio/big_data_benchmarks.
26. Benchmark setup - Environment
• Local execution
A single i3.16xlarge VM:
(488 GB memory | 64 cores | 25 Gigabit Ethernet)
• Distributed execution
1 driver node, 3 worker nodes
Each node is a i3.4xlarge VM:
(122 GB memory | 16 cores | 10 Gigabit Ethernet)
27. Benchmark results - Overview
Geometric Mean Simple Average
Local execution 2.1x 4x
Distributed execution 4.6x 7.9x
Koalas outperformed Dask:
28. Benchmark results - On the whole dataset
Local execution: Koalas is ~1.2x
faster
Distributed execution: Koalas is ~2x
faster
29. Benchmark results - On the filtered data
Local execution: Koalas is ~6x faster Distributed execution: Koalas is ~9x
faster
30. Benchmark results - On the cached filtered data
Local execution: Koalas is ~1.4x faster Distributed execution: Koalas is ~5x
faster
31. Why is Koalas fast?
● Query plan optimization by Catalyst
● Whole-stage code generation
32. Why is Koalas fast - Catalyst optimizer
Query plan of mean calculation on the filtered data
• Before the Catalyst’s optimization
# Pseudocode
expr_filter = (df.tip_amt >= 1) &
(df.tip_amt <= 5)
df[expr_filter].fare_amt.mean()
33. Why is Koalas fast - Catalyst optimizer
Query plan of mean calculation on the filtered data
• Before the Catalyst optimization
• After the Catalyst optimization
# Pseudocode
expr_filter = (df.tip_amt >= 1) &
(df.tip_amt <= 5)
df[expr_filter].fare_amt.mean()
34. Why is Koalas fast - Whole-stage code generation
~650%
improvement
~1200%
improvement
35. Benchmark conclusions
• SQL optimizers improve the performance of DataFrame
APIs
• Caching accelerates both Koalas and Dask dramatically
• Koalas outperforms Dask in the majority of use cases
Reference blog post : Benchmark: Koalas (PySpark) and Dask
37. Version 1.0.0~1.8.0
▪ Improve Plotly backend support, and
switch the default plotting backend
to Plotly
▪ Extension dtypes support
▪ More Index types
▪ Create Index from Series or Index
objects
▪ Support setting to a Series via
attribute access
▪ Operations between Series and Index
▪ Standardize binary operations
between int and str columns
▪ Index operations support
▪ Better type support
▪ Return type annotations for major
Koalas objects
38. Version 1.0.0~1.8.0
▪ Support for non-string names
▪ Non-named Series support
▪ Wider support of in-place update
▪ Improve distributed-sequence
default index
▪ pandas 1.1, 1.1.4 support
▪ Better pandas API coverage
▪ Introduced koalas and Spark
accessors
▪ Improve testing infrastructure
▪ Apache Spark 3.0 support
▪ Python 3.8 support
▪ Support for API extensions
▪ Better type hints support
39. Porting Koalas to Spark
SPIP: Support pandas API layer on PySpark
https://issues.apache.org/jira/browse/SPARK-
34849