20161215 python pandas-spark四方山話

•

7 likes•1,234 views

Ryuji Tamagawa

2016/12/15 インサイトテクノロジーさんの三木会でお話しした内容のスライドです。PythonとかPandasとかSparkとか。

Technology

•
• Python 2000
(**)
• db tech showcase MongoDB
•
• FB: Ryuji Tamagawa
• Twitter : tamagawa_ryuji

Python
Pandas Python
Jupyter Notebook
Jenkins
Spark 2.0

• Spark API RDD ~1.3 DataFrame
/ DataSet 1.4~
• DataFrame API
RDD API Python Spark

DataFrame
• RDB /
• R Pandas Spark
Spark
R / Pandas
Spark
+

CSV
zip
RDB
Parquet
Excel
CSV
Feather
Spark
Pandas / Spark

•
• CPU
•
• Pandas read_csv zip CSV
Pandas

2
• CSV CPU
Pandas zip CSV
CPU …
• Parquet !
•

: Parquet
I/O
•
• Spark Parquet
• Python Parquet

•
• I/O Pandas
• Spark
• DataFrame Pandas → Spark
Spark → Pandas Pandas → Spark
• Apache Arrow

Apache Spark 2.0
• 1.x
• 2.0
1.x
• DataFrame API Python
• databricks  
http://go.databricks.com/mastering-apache-spark-2.0
•

Spark 2.0
• CPU
• CPU
• SQL DataFrame
• + SSD
• CSV zip
Pandas read_csv

Python + Spark
• Python serialize
• DataFrame API UDF
UDF Scala/Java
• http://www.slideshare.net/dragan10/performant-data-processing-with-pyspark-sparkr-
and-dataframe-api
Executor
JVM
DataFrame,
Cached
Python
lambda items:
items[0] == ‘abc’
transfer
DataFrame,
result
transfer
Driver

What's hot

Beginner Apache Spark PresentationNidhin Pattaniyil

StackStormを1年間データ基盤で使ってみてぶつかったトラブルとその解決策の共有Yoshiyasu SAEKI

Brug af Solr i IMPACTIMPACT

Growing a Data Pipeline for AnalyticsRoberto Agostino Vitillo

Sparkler Presentation for Spark Summit East 2017Karanjeet Singh

Денис Головняк - Продвинутый поиск с помощью Search APILEDC 2016

Final_showNitay Alon

ストリーム処理を支えるキューイングシステムの選び方Yoshiyasu SAEKI

Cassandra + Hadoop @ApacheCon Jeremy Hanna

Introduing sparkTaotao Li

使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化－曾書庭台灣資料科學年會

The Evolution of Hadoop at Spotify - Through Failures and PainRafał Wojdyła

MongoDB & Hadoop, Sittin' in a TreeMongoDB

ニュースパスのクローラーアーキテクチャとマイクロサービスmosa siru

Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit

Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copyUwe Korn

Apache Spark Super Happy Funtimes - CHUG 2016Holden Karau

Go, memcached, microservicesmosa siru

Microsoft Azure + RDmitry Petukhov

Fluentd - Flexible, Stable, ScalableShu Ting Tseng

What's hot (20)

Beginner Apache Spark Presentation

StackStormを1年間データ基盤で使ってみてぶつかったトラブルとその解決策の共有

Brug af Solr i IMPACT

Growing a Data Pipeline for Analytics

Sparkler Presentation for Spark Summit East 2017

Денис Головняк - Продвинутый поиск с помощью Search API

Final_show

ストリーム処理を支えるキューイングシステムの選び方

Cassandra + Hadoop @ApacheCon

Introduing spark

使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化－曾書庭

The Evolution of Hadoop at Spotify - Through Failures and Pain

MongoDB & Hadoop, Sittin' in a Tree

ニュースパスのクローラーアーキテクチャとマイクロサービス

Debugging PySpark: Spark Summit East talk by Holden Karau

Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy

Apache Spark Super Happy Funtimes - CHUG 2016

Go, memcached, microservices

Microsoft Azure + R

Fluentd - Flexible, Stable, Scalable

Similar to 20161215 python pandas-spark四方山話

Contributing to pandas (Korean)Younggun Kim

data science toolkit 101: set up Python, Spark, & JupyterRaj Singh

PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...Uwe Korn

Accelerating Big Data beyond the JVM - Fosdem 2018Holden Karau

Apache Arrow and Pandas UDF on Apache SparkTakuya UESHIN

Wisely Chen Spark Talk At Spark Gathering in Taiwan Wisely chen

Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks

Overview of Apache Spark 2.3: What’s New? with Sameer AgarwalDatabricks

Spark7poovarasu maniandan

Fluentd: Unified Logging Layer at CWT2014N Masahiro

Spark Streamingによるリアルタイムユーザ属性推定Yoshiyasu SAEKI

Docker and FluentdN Masahiro

Hands on with Apache SparkDan Lynn

Jump Start on Apache® Spark™ 2.x with Databricks Databricks

Jumpstart on Apache Spark 2.2 on DatabricksDatabricks

Big data beyond the JVM - DDTX 2018Holden Karau

Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido

Penny coventry fiddler-spsbe23BIWUG

ApacheCon Europe Big Data 2016 – Parquet in practice & detailUwe Korn

OSINT tools for security auditing with pythonJose Manuel Ortega Candel

Similar to 20161215 python pandas-spark四方山話 (20)

Contributing to pandas (Korean)

data science toolkit 101: set up Python, Spark, & Jupyter

PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...

Accelerating Big Data beyond the JVM - Fosdem 2018

Apache Arrow and Pandas UDF on Apache Spark

Wisely Chen Spark Talk At Spark Gathering in Taiwan

Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang

Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal

Spark7

Fluentd: Unified Logging Layer at CWT2014

Spark Streamingによるリアルタイムユーザ属性推定

Docker and Fluentd

Hands on with Apache Spark

Jump Start on Apache® Spark™ 2.x with Databricks

Jumpstart on Apache Spark 2.2 on Databricks

Big data beyond the JVM - DDTX 2018

Data Science at Scale: Using Apache Spark for Data Science at Bitly

Penny coventry fiddler-spsbe23

ApacheCon Europe Big Data 2016 – Parquet in practice & detail

OSINT tools for security auditing with python

Recently uploaded

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent

Understanding the Laravel MVC ArchitecturePixlogix Infotech

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Histor y of HAM Radio presentation slidevu2urc

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Recently uploaded (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

Understanding the Laravel MVC Architecture

08448380779 Call Girls In Friends Colony Women Seeking Men

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

Breaking the Kubernetes Kill Chain: Host Path Mount

Handwritten Text Recognition for manuscripts and early printed texts

Unblocking The Main Thread Solving ANRs and Frozen Frames

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

CNv6 Instructor Chapter 6 Quality of Service

Maximizing Board Effectiveness 2024 Webinar.pptx

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

Boost PC performance: How more available memory can improve productivity

SQL Database Design For Developers at php[tek] 2024

Presentation on how to chat with PDF using ChatGPT code interpreter

Histor y of HAM Radio presentation slide

My Hashitalk Indonesia April 2024 Presentation

20161215 python pandas-spark四方山話

1. Python, Pandas, Spark 2.0 Sky

3. • • Python 2000 (**) • db tech showcase MongoDB • • FB: Ryuji Tamagawa • Twitter : tamagawa_ryuji

5. 2017

6. • Python Spark •

7. • • Python / Pandas • Spark 2.0

8. Part 1 :

9. • • • csv

10. Python Pandas Python Jupyter Notebook Jenkins Spark 2.0

11. • Spark API RDD ~1.3 DataFrame / DataSet 1.4~ • DataFrame API RDD API Python Spark

12. DataFrame • RDB / • R Pandas Spark Spark R / Pandas Spark +

13. Part 2 :

14. CSV zip RDB Parquet Excel CSV Feather Spark Pandas / Spark

15. • • CPU • • Pandas read_csv zip CSV Pandas

16. 2 • CSV CPU Pandas zip CSV CPU … • Parquet ! •

17. : Parquet I/O • • Spark Parquet • Python Parquet

18. HDFS / S3 Parquet Parquet

19. SSD Parquet Parquet

20. Parquet No No Yes HDD

21. • • I/O Pandas • Spark • DataFrame Pandas → Spark Spark → Pandas Pandas → Spark • Apache Arrow

22. CPU ~2010 2010~ SSD CPU  

23. Apache Spark 2.0 • 1.x • 2.0 1.x • DataFrame API Python • databricks   http://go.databricks.com/mastering-apache-spark-2.0 •

24. Spark 2.0 • CPU • CPU • SQL DataFrame • + SSD • CSV zip Pandas read_csv

25. Python + Spark • Python serialize • DataFrame API UDF UDF Scala/Java • http://www.slideshare.net/dragan10/performant-data-processing-with-pyspark-sparkr- and-dataframe-api Executor JVM DataFrame, Cached Python lambda items: items[0] == ‘abc’ transfer DataFrame, result transfer Driver

20161215 python pandas-spark四方山話

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 20161215 python pandas-spark四方山話

Similar to 20161215 python pandas-spark四方山話 (20)

More from Ryuji Tamagawa

More from Ryuji Tamagawa (20)

Recently uploaded

Recently uploaded (20)

20161215 python pandas-spark四方山話