SlideShare une entreprise Scribd logo
1  sur  40
Télécharger pour lire hors ligne
Koalas
How Well Does Koalas Work?
Takuya Ueshin, Xinrong Meng
Software Engineer @ Databricks
About Us
Takuya Ueshin
▪ Software Engineer @ Databricks
▪ Apache Spark committer and PMC
member
▪ Focusing on Spark SQL and PySpark
▪ Koalas maintainer
Xinrong Meng
▪ Software Engineer @ Databricks
▪ Koalas maintainer
Agenda
▪ Introduction of Koalas
pandas
PySpark
▪ Koalas Internal
▪ Benchmark
Introduction of Dask
Koalas benchmark against Dask
▪ Koalas Updates
Introduction of Koalas
What’s Koalas?
Announced April 24, 2019
Provides a drop-in replacement for pandas
- enabling efficient scaling out to hundred of worker nodes
For pandas users
- Scale out the pandas code using Koalas
- Make learning PySpark much easier
For PySpark users
- More productive by pandas-like functions
pandas
Authored by Wes McKinney in 2008
The standard tool for data manipulation
and analysis in Python
Deeply integrated into Python data
science ecosystem
- NumPy
- Matplotlib
- scikit-learn
Stack Overflow Trends
Apache Spark
De facto unified analytics engine for large-scale data processing
- Streaming
- ETL
- ML
Originally created at UC Berkeley by Databricks’ founders
PySpark for Python;
also APIs support for Scala/Java, R, and SQL
Koalas DataFrame and PySpark DataFrame
- Follow the structure of pandas
- Provide pandas APIs
- Implement index/identifier
- More compliant with the
relations/tables in relational
databases
- Does not have unique row identifiers
PySpark DataFrame
Koalas DataFrame
Koalas DataFrame and PySpark DataFrame
- Follow the structure of pandas
- Provide pandas APIs
- Implement index/identifier
- Translate pandas APIs into a logical
plan of Spark SQL
- The plan will be optimized and
executed by Spark SQL engine
- More compliant with the
relations/tables in relational
databases
- Does not have unique row identifiers
PySpark DataFrame
Koalas DataFrame
Koalas Internal
InternalFrame
Internal Immutable metadata.
- The current PySpark DataFrame
- PySpark Columns
- Index names/data column names
- Index dtypes/data dtypes
- Provides conversions between PySpark DataFrame
and pandas DataFrame
InternalFrame
Internal Immutable metadata.
- The current PySpark DataFrame
- PySpark Columns
- Index names/data column names
- Index dtypes/data dtypes
- Provides conversions between PySpark DataFrame
and pandas DataFrame
InternalFrame
Internal Immutable metadata.
- The current PySpark DataFrame
- PySpark Columns
- Index names/data column names
- Index dtypes/data dtypes
- Provides conversions between PySpark DataFrame
and pandas DataFrame
InternalFrame
Internal Immutable metadata.
- The current PySpark DataFrame
- PySpark Columns
- Index names/data column names
- Index dtypes/data dtypes
- Provides conversions between PySpark DataFrame
and pandas DataFrame
InternalFrame
Internal Immutable metadata.
- The current PySpark DataFrame
- PySpark Columns
- Index names/data column names
- Index dtypes/data dtypes
- Provides conversions between PySpark DataFrame
and pandas DataFrame
InternalFrame
Koalas
DataFrame
PySpark
DataFrame
InternalFrame
- index/data_spark_columns
- index_names/column_labels
- index/data_dtypes
InternalFrame
Koalas
DataFrame
InternalFrame
- index/data_spark_columns
- index_names/column_labels
- index/data_dtypes
PySpark
DataFrame
Koalas
DataFrame
InternalFrame
- index/data_spark_columns
- index_names/column_labels
- index/data_dtypes
PySpark
DataFrame
API call
copy with new
state
InternalFrame
Koalas
DataFrame
InternalFrame
- index/data_spark_columns
- index_names/column_labels
- index/data_dtypes
PySpark
DataFrame
Koalas
DataFrame
InternalFrame
- index/data_spark_columns
- index_names/column_labels
- index/data_dtypes
API call
Only updates metadata
copy with new
state
Benchmark
Introduction of Dask
• A parallel computing framework
• Written in pure python
• Using blocked algorithms and
task scheduling
Dask is different from Koalas
Koalas Dask
Execution engine
Apache Spark, a unified analytics engine
for large-scale data processing
Dask, a graph execution engine
Aim
Abstraction
Collections
Dask is different from Koalas
Koalas Dask
Execution engine
Apache Spark, a unified analytics engine
for large-scale data processing
Dask, a graph execution engine
Aim
A single codebase that works with both
pandas and Spark
Scale pandas workflow
Abstraction
Collections
Dask is different from Koalas
Koalas Dask
Execution engine
Apache Spark, a unified analytics engine
for large-scale data processing
Dask, a graph execution engine
Aim
A single codebase that works with both
pandas and Spark
Scale pandas workflow
Abstraction Query plan Task graph and task scheduler
Collections
Dask is different from Koalas
Koalas Dask
Execution engine
Apache Spark, a unified analytics engine
for large-scale data processing
Dask, a graph execution engine
Aim
A single codebase that works with both
pandas and Spark
Scale pandas workflow
Abstraction Query plan Task graph and task scheduler
Collections DataFrame Array, DataFrame, Bag
Benchmark setup - Methodology
• Dataset
157 GB Yellow Taxi Trip Records (2009 - 2013)
• Operations
Basic statistical calculations
Joins
Grouping
• Operations were applied to
The whole dataset
Filtered data (36% whole dataset)
Cached filtered data (36% whole dataset)
The scenario used in this benchmark was inspired by https://github.com/xdssio/big_data_benchmarks.
Benchmark setup - Environment
• Local execution
A single i3.16xlarge VM:
(488 GB memory | 64 cores | 25 Gigabit Ethernet)
• Distributed execution
1 driver node, 3 worker nodes
Each node is a i3.4xlarge VM:
(122 GB memory | 16 cores | 10 Gigabit Ethernet)
Benchmark results - Overview
Geometric Mean Simple Average
Local execution 2.1x 4x
Distributed execution 4.6x 7.9x
Koalas outperformed Dask:
Benchmark results - On the whole dataset
Local execution: Koalas is ~1.2x
faster
Distributed execution: Koalas is ~2x
faster
Benchmark results - On the filtered data
Local execution: Koalas is ~6x faster Distributed execution: Koalas is ~9x
faster
Benchmark results - On the cached filtered data
Local execution: Koalas is ~1.4x faster Distributed execution: Koalas is ~5x
faster
Why is Koalas fast?
● Query plan optimization by Catalyst
● Whole-stage code generation
Why is Koalas fast - Catalyst optimizer
Query plan of mean calculation on the filtered data
• Before the Catalyst’s optimization
# Pseudocode
expr_filter = (df.tip_amt >= 1) &
(df.tip_amt <= 5)
df[expr_filter].fare_amt.mean()
Why is Koalas fast - Catalyst optimizer
Query plan of mean calculation on the filtered data
• Before the Catalyst optimization
• After the Catalyst optimization
# Pseudocode
expr_filter = (df.tip_amt >= 1) &
(df.tip_amt <= 5)
df[expr_filter].fare_amt.mean()
Why is Koalas fast - Whole-stage code generation
~650%
improvement
~1200%
improvement
Benchmark conclusions
• SQL optimizers improve the performance of DataFrame
APIs
• Caching accelerates both Koalas and Dask dramatically
• Koalas outperforms Dask in the majority of use cases
Reference blog post : Benchmark: Koalas (PySpark) and Dask
Koalas updates
Version 1.0.0~1.8.0
▪ Improve Plotly backend support, and
switch the default plotting backend
to Plotly
▪ Extension dtypes support
▪ More Index types
▪ Create Index from Series or Index
objects
▪ Support setting to a Series via
attribute access
▪ Operations between Series and Index
▪ Standardize binary operations
between int and str columns
▪ Index operations support
▪ Better type support
▪ Return type annotations for major
Koalas objects
Version 1.0.0~1.8.0
▪ Support for non-string names
▪ Non-named Series support
▪ Wider support of in-place update
▪ Improve distributed-sequence
default index
▪ pandas 1.1, 1.1.4 support
▪ Better pandas API coverage
▪ Introduced koalas and Spark
accessors
▪ Improve testing infrastructure
▪ Apache Spark 3.0 support
▪ Python 3.8 support
▪ Support for API extensions
▪ Better type hints support
Porting Koalas to Spark
SPIP: Support pandas API layer on PySpark
https://issues.apache.org/jira/browse/SPARK-
34849
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Contenu connexe

Tendances

そんなトランザクションマネージャで大丈夫か?
そんなトランザクションマネージャで大丈夫か?そんなトランザクションマネージャで大丈夫か?
そんなトランザクションマネージャで大丈夫か?
takezoe
 

Tendances (20)

ちょっと理解に自信がないな という皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)
ちょっと理解に自信がないなという皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)ちょっと理解に自信がないなという皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)
ちょっと理解に自信がないな という皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...
202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...
202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...
 
データインターフェースとしてのHadoop ~HDFSとクラウドストレージと私~ (NTTデータ テクノロジーカンファレンス 2019 講演資料、2019...
データインターフェースとしてのHadoop ~HDFSとクラウドストレージと私~ (NTTデータ テクノロジーカンファレンス 2019 講演資料、2019...データインターフェースとしてのHadoop ~HDFSとクラウドストレージと私~ (NTTデータ テクノロジーカンファレンス 2019 講演資料、2019...
データインターフェースとしてのHadoop ~HDFSとクラウドストレージと私~ (NTTデータ テクノロジーカンファレンス 2019 講演資料、2019...
 
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
 
機械学習の定番プラットフォームSparkの紹介
機械学習の定番プラットフォームSparkの紹介機械学習の定番プラットフォームSparkの紹介
機械学習の定番プラットフォームSparkの紹介
 
NetflixにおけるPresto/Spark活用事例
NetflixにおけるPresto/Spark活用事例NetflixにおけるPresto/Spark活用事例
NetflixにおけるPresto/Spark活用事例
 
大量のデータ処理や分析に使えるOSS Apache Spark入門(Open Source Conference 2021 Online/Kyoto 発表資料)
大量のデータ処理や分析に使えるOSS Apache Spark入門(Open Source Conference 2021 Online/Kyoto 発表資料)大量のデータ処理や分析に使えるOSS Apache Spark入門(Open Source Conference 2021 Online/Kyoto 発表資料)
大量のデータ処理や分析に使えるOSS Apache Spark入門(Open Source Conference 2021 Online/Kyoto 発表資料)
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Apache Hadoopの新機能Ozoneの現状
Apache Hadoopの新機能Ozoneの現状Apache Hadoopの新機能Ozoneの現状
Apache Hadoopの新機能Ozoneの現状
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
 
噛み砕いてKafka Streams #kafkajp
噛み砕いてKafka Streams #kafkajp噛み砕いてKafka Streams #kafkajp
噛み砕いてKafka Streams #kafkajp
 
Solving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsSolving PostgreSQL wicked problems
Solving PostgreSQL wicked problems
 
そんなトランザクションマネージャで大丈夫か?
そんなトランザクションマネージャで大丈夫か?そんなトランザクションマネージャで大丈夫か?
そんなトランザクションマネージャで大丈夫か?
 
Apache Spark チュートリアル
Apache Spark チュートリアルApache Spark チュートリアル
Apache Spark チュートリアル
 
NTTデータが考えるデータ基盤の次の一手 ~AI活用のために知っておくべき新潮流とは?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
NTTデータが考えるデータ基盤の次の一手 ~AI活用のために知っておくべき新潮流とは?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)NTTデータが考えるデータ基盤の次の一手 ~AI活用のために知っておくべき新潮流とは?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
NTTデータが考えるデータ基盤の次の一手 ~AI活用のために知っておくべき新潮流とは?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
大量のデータ処理や分析に使えるOSS Apache Sparkのご紹介(Open Source Conference 2020 Online/Kyoto ...
大量のデータ処理や分析に使えるOSS Apache Sparkのご紹介(Open Source Conference 2020 Online/Kyoto ...大量のデータ処理や分析に使えるOSS Apache Sparkのご紹介(Open Source Conference 2020 Online/Kyoto ...
大量のデータ処理や分析に使えるOSS Apache Sparkのご紹介(Open Source Conference 2020 Online/Kyoto ...
 
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI Initiatives
 

Similaire à Koalas: How Well Does Koalas Work?

Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 

Similaire à Koalas: How Well Does Koalas Work? (20)

Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Koalas: Interoperability Between Koalas and Apache Spark
Koalas: Interoperability Between Koalas and Apache SparkKoalas: Interoperability Between Koalas and Apache Spark
Koalas: Interoperability Between Koalas and Apache Spark
 
Koalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkKoalas: Pandas on Apache Spark
Koalas: Pandas on Apache Spark
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
Data processing with spark in r &amp; python
Data processing with spark in r &amp; pythonData processing with spark in r &amp; python
Data processing with spark in r &amp; python
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network Analytics
 

Plus de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
gajnagarg
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
gajnagarg
 

Dernier (20)

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
 

Koalas: How Well Does Koalas Work?

  • 1. Koalas How Well Does Koalas Work? Takuya Ueshin, Xinrong Meng Software Engineer @ Databricks
  • 2. About Us Takuya Ueshin ▪ Software Engineer @ Databricks ▪ Apache Spark committer and PMC member ▪ Focusing on Spark SQL and PySpark ▪ Koalas maintainer Xinrong Meng ▪ Software Engineer @ Databricks ▪ Koalas maintainer
  • 3. Agenda ▪ Introduction of Koalas pandas PySpark ▪ Koalas Internal ▪ Benchmark Introduction of Dask Koalas benchmark against Dask ▪ Koalas Updates
  • 5. What’s Koalas? Announced April 24, 2019 Provides a drop-in replacement for pandas - enabling efficient scaling out to hundred of worker nodes For pandas users - Scale out the pandas code using Koalas - Make learning PySpark much easier For PySpark users - More productive by pandas-like functions
  • 6. pandas Authored by Wes McKinney in 2008 The standard tool for data manipulation and analysis in Python Deeply integrated into Python data science ecosystem - NumPy - Matplotlib - scikit-learn Stack Overflow Trends
  • 7. Apache Spark De facto unified analytics engine for large-scale data processing - Streaming - ETL - ML Originally created at UC Berkeley by Databricks’ founders PySpark for Python; also APIs support for Scala/Java, R, and SQL
  • 8. Koalas DataFrame and PySpark DataFrame - Follow the structure of pandas - Provide pandas APIs - Implement index/identifier - More compliant with the relations/tables in relational databases - Does not have unique row identifiers PySpark DataFrame Koalas DataFrame
  • 9. Koalas DataFrame and PySpark DataFrame - Follow the structure of pandas - Provide pandas APIs - Implement index/identifier - Translate pandas APIs into a logical plan of Spark SQL - The plan will be optimized and executed by Spark SQL engine - More compliant with the relations/tables in relational databases - Does not have unique row identifiers PySpark DataFrame Koalas DataFrame
  • 11. InternalFrame Internal Immutable metadata. - The current PySpark DataFrame - PySpark Columns - Index names/data column names - Index dtypes/data dtypes - Provides conversions between PySpark DataFrame and pandas DataFrame
  • 12. InternalFrame Internal Immutable metadata. - The current PySpark DataFrame - PySpark Columns - Index names/data column names - Index dtypes/data dtypes - Provides conversions between PySpark DataFrame and pandas DataFrame
  • 13. InternalFrame Internal Immutable metadata. - The current PySpark DataFrame - PySpark Columns - Index names/data column names - Index dtypes/data dtypes - Provides conversions between PySpark DataFrame and pandas DataFrame
  • 14. InternalFrame Internal Immutable metadata. - The current PySpark DataFrame - PySpark Columns - Index names/data column names - Index dtypes/data dtypes - Provides conversions between PySpark DataFrame and pandas DataFrame
  • 15. InternalFrame Internal Immutable metadata. - The current PySpark DataFrame - PySpark Columns - Index names/data column names - Index dtypes/data dtypes - Provides conversions between PySpark DataFrame and pandas DataFrame
  • 17. InternalFrame Koalas DataFrame InternalFrame - index/data_spark_columns - index_names/column_labels - index/data_dtypes PySpark DataFrame Koalas DataFrame InternalFrame - index/data_spark_columns - index_names/column_labels - index/data_dtypes PySpark DataFrame API call copy with new state
  • 18. InternalFrame Koalas DataFrame InternalFrame - index/data_spark_columns - index_names/column_labels - index/data_dtypes PySpark DataFrame Koalas DataFrame InternalFrame - index/data_spark_columns - index_names/column_labels - index/data_dtypes API call Only updates metadata copy with new state
  • 20. Introduction of Dask • A parallel computing framework • Written in pure python • Using blocked algorithms and task scheduling
  • 21. Dask is different from Koalas Koalas Dask Execution engine Apache Spark, a unified analytics engine for large-scale data processing Dask, a graph execution engine Aim Abstraction Collections
  • 22. Dask is different from Koalas Koalas Dask Execution engine Apache Spark, a unified analytics engine for large-scale data processing Dask, a graph execution engine Aim A single codebase that works with both pandas and Spark Scale pandas workflow Abstraction Collections
  • 23. Dask is different from Koalas Koalas Dask Execution engine Apache Spark, a unified analytics engine for large-scale data processing Dask, a graph execution engine Aim A single codebase that works with both pandas and Spark Scale pandas workflow Abstraction Query plan Task graph and task scheduler Collections
  • 24. Dask is different from Koalas Koalas Dask Execution engine Apache Spark, a unified analytics engine for large-scale data processing Dask, a graph execution engine Aim A single codebase that works with both pandas and Spark Scale pandas workflow Abstraction Query plan Task graph and task scheduler Collections DataFrame Array, DataFrame, Bag
  • 25. Benchmark setup - Methodology • Dataset 157 GB Yellow Taxi Trip Records (2009 - 2013) • Operations Basic statistical calculations Joins Grouping • Operations were applied to The whole dataset Filtered data (36% whole dataset) Cached filtered data (36% whole dataset) The scenario used in this benchmark was inspired by https://github.com/xdssio/big_data_benchmarks.
  • 26. Benchmark setup - Environment • Local execution A single i3.16xlarge VM: (488 GB memory | 64 cores | 25 Gigabit Ethernet) • Distributed execution 1 driver node, 3 worker nodes Each node is a i3.4xlarge VM: (122 GB memory | 16 cores | 10 Gigabit Ethernet)
  • 27. Benchmark results - Overview Geometric Mean Simple Average Local execution 2.1x 4x Distributed execution 4.6x 7.9x Koalas outperformed Dask:
  • 28. Benchmark results - On the whole dataset Local execution: Koalas is ~1.2x faster Distributed execution: Koalas is ~2x faster
  • 29. Benchmark results - On the filtered data Local execution: Koalas is ~6x faster Distributed execution: Koalas is ~9x faster
  • 30. Benchmark results - On the cached filtered data Local execution: Koalas is ~1.4x faster Distributed execution: Koalas is ~5x faster
  • 31. Why is Koalas fast? ● Query plan optimization by Catalyst ● Whole-stage code generation
  • 32. Why is Koalas fast - Catalyst optimizer Query plan of mean calculation on the filtered data • Before the Catalyst’s optimization # Pseudocode expr_filter = (df.tip_amt >= 1) & (df.tip_amt <= 5) df[expr_filter].fare_amt.mean()
  • 33. Why is Koalas fast - Catalyst optimizer Query plan of mean calculation on the filtered data • Before the Catalyst optimization • After the Catalyst optimization # Pseudocode expr_filter = (df.tip_amt >= 1) & (df.tip_amt <= 5) df[expr_filter].fare_amt.mean()
  • 34. Why is Koalas fast - Whole-stage code generation ~650% improvement ~1200% improvement
  • 35. Benchmark conclusions • SQL optimizers improve the performance of DataFrame APIs • Caching accelerates both Koalas and Dask dramatically • Koalas outperforms Dask in the majority of use cases Reference blog post : Benchmark: Koalas (PySpark) and Dask
  • 37. Version 1.0.0~1.8.0 ▪ Improve Plotly backend support, and switch the default plotting backend to Plotly ▪ Extension dtypes support ▪ More Index types ▪ Create Index from Series or Index objects ▪ Support setting to a Series via attribute access ▪ Operations between Series and Index ▪ Standardize binary operations between int and str columns ▪ Index operations support ▪ Better type support ▪ Return type annotations for major Koalas objects
  • 38. Version 1.0.0~1.8.0 ▪ Support for non-string names ▪ Non-named Series support ▪ Wider support of in-place update ▪ Improve distributed-sequence default index ▪ pandas 1.1, 1.1.4 support ▪ Better pandas API coverage ▪ Introduced koalas and Spark accessors ▪ Improve testing infrastructure ▪ Apache Spark 3.0 support ▪ Python 3.8 support ▪ Support for API extensions ▪ Better type hints support
  • 39. Porting Koalas to Spark SPIP: Support pandas API layer on PySpark https://issues.apache.org/jira/browse/SPARK- 34849
  • 40. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.