2. 2
About Me
- Software Engineer @databricks
- Apache Spark Committer
- Twitter: @ueshin
- GitHub: github.com/ueshin
3. Typical journey of a data scientist
Education (MOOCs, books, universities) → pandas
Analyze small data sets → pandas
Analyze big data sets → DataFrame in Spark
3
4. pandas
Authored by Wes McKinney in 2008
The standard tool for data manipulation and analysis in Python
Deeply integrated into Python data science ecosystem, e.g. numpy, matplotlib
Can deal with a lot of different situations, including:
- basic statistical analysis
- handling missing data
- time series, categorical variables, strings
4
5. Apache Spark
De facto unified analytics engine for large-scale data processing
(Streaming, ETL, ML)
Created at UC Berkeley by Databricks’ founders
PySpark API for Python; also API support for Scala, R and SQL
5
7. A short example
7
import pandas as pd
df = pd.read_csv("my_data.csv")
df.columns = [‘x’, ‘y’, ‘z1’]
df[‘x2’] = df.x * df.x
df = (spark.read
.option("inferSchema", "true")
.option("comment", True)
.csv("my_data.csv"))
df = df.toDF(‘x’, ‘y’, ‘z1’)
df = df.withColumn(‘x2’, df.x*df.x)
8. Koalas
Announced April 24, 2019
Pure Python library
Aims at providing the pandas API on top of Apache Spark:
- unifies the two ecosystems with a familiar API
- seamless transition between small and large data
8
9. Quickly gaining traction
9
Weekly releases!
> 100 patches merged since
announcement at
Spark Summit (April 24)
> 10 significant contributors
outside of Databricks
> 2.5k daily downloads
10. A short example
10
import pandas as pd
df = pd.read_csv("my_data.csv")
df.columns = [‘x’, ‘y’, ‘z1’]
df[‘x2’] = df.x * df.x
import databricks.koalas as ks
df = ks.read_csv("my_data.csv")
df.columns = [‘x’, ‘y’, ‘z1’]
df[‘x2’] = df.x * df.x
11. Key Differences
Spark is more lazy by nature:
- most operations only happen when displaying or writing a
DataFrame
Spark does not implicitly have ordering
Performance when working at scale
11
12. Current status
Weekly releases, very active community with daily changes
The most common functions have been implemented:
- 30% of the DataFrame API
- 30% of the Series API
- to_datetime, get_dummies, …
Special thanks to Hyukjin Kwon and Takuya Ueshin for major contributions to
the project
12
13. What to expect soon?
Performance enhancements, e.g. caching
Plotting
Better indexing support
More data manipulation (string and date manipulations)
13
14. Getting started
pip install koalas
conda install koalas
Look for docs and updates on github.com/databricks/koalas
14
15. Do you have suggestions or requests?
Submit requests to github.com/databricks/koalas/issues
Very easy to contribute
github.com/databricks/koalas/blob/master/CONTRIBUTING.md
15