Rust is for "Big Data"

•

7 j'aime•2,648 vues

Andy Grove

Presentation given at the Boulder/Denver Rust Meetup on 4/11/18.

Logiciels

Rust is for “Big Data”
Andy Grove @ Boulder/Denver Rust Meetup 4/11/18

About Me
• I’ve been a software engineer for ~30 years
• 20 years of that using Java
• Also some management/founder roles
• In my day job I mostly work with Scala, Spark, Parquet, Kudu, Thrift,
and HDFS
• Yay! I'm a Big Data Engineer
TM
• I have been learning Rust in my spare time on and off over the past
couple years
• One of my goals for 2018 was to become proﬁcient in Rust so I
decided to take on a substantial project

What’s wrong with Spark/JVM?
• Spark is actually pretty neat, but …
• Garbage collection overheads can be huge
• OutOfMemory errors are common
• Java serialization is inefﬁcient, even with Kryo
• Expensive up-front query planning and code-generation make it
inefﬁcient for interactive queries and small data sets
• Difﬁcult to conﬁgure, monitor, and debug
• Generally row-oriented, even when working with columnar data
sources

Let’s build something better!
• Rust > JVM:
• Raw performance of compiled code
• Efﬁcient memory usage
• Predictable memory usage
• No serialization overhead to map raw bytes to Rust
structs
• Access to hardware (SIMD, DMA, etc)

Keep Calm and Keep Columnar
• Column-oriented > Row-oriented
• Just load the columns you need from disk (efﬁcient
projections)
• “a > b” and “a + b” are now vectorized operations that can
take advantage of SIMD (Same Instruction, Multiple Data)
• Apache Arrow is a standardized columnar in-memory
format for zero-copy data interchange between systems
• Apache Parquet is a columnar ﬁle-format with efﬁcient per-
column encoding and compression

DataFusion
• DataFusion is a proof-of-concept of a modern distributed compute
platform, implemented in Rust
• Programming model is similar to Apache Spark (DataFrame and SQL
APIs)
• Apache Arrow is used for the core memory model
• Apache Parquet is partially supported (read-only and no support for
nested types yet)
• CSV is supported too (where there is Big Data, there is CSV)
• etcd is used for co-ordination between nodes
• Kubernetes/Docker deployment model (planned)

First Benchmark
• Simple job to convert lat/lng pairs into ESRI WKT
(Well-known text) format
• SELECT ST_AsText(ST_Point(lat, lng)) FROM locations
• Reads from CSV ﬁle
• Calls two UDFs, and creates one UDT
• Writes results to CSV ﬁle
• Single thread, single core

Detailed Results
(throughput rows/second)
# Rows DataFusion 0.2.6 Apache Spark 2.2.1 Ratio
10^1 18,191
1,044,030 256,213 4
2 7,523.8
10^2 47,489 437 108.7
10^3 607,057 3,731 162.7
10^4 820,819 32,258 25.4
10^5 957,025 181,159 5.3
10^6 1,044,030 256,213 4.1
10^7 797,224 268,853 3.0
10^8 1,026,443 271,022 3.8
10^9 958,960 282,576 3.4

Thanks!
• Resources:
• DataFusion: https://datafusion.rs/
• My blog: https://andygrove.io
• Apache Arrow: https://arrow.apache.org/
• Contact me:
• LinkedIn: https://www.linkedin.com/in/andygrove/
• Twitter: @andygrove73
• Email: andygrove73@gmail.com

Recommandé

Rust is for Robots!Andy Grove

Rust & Apache Arrow @ RMSAndy Grove

Extending Pandas using Apache Arrow and NumbaUwe Korn

PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...Uwe Korn

Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...Uwe Korn

Future of pandasJeff Reback

Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney

Apache Arrow - An OverviewDremio Corporation

Recommandé

Rust is for Robots!Andy Grove

Rust & Apache Arrow @ RMSAndy Grove

Extending Pandas using Apache Arrow and NumbaUwe Korn

PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...Uwe Korn

Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...Uwe Korn

Future of pandasJeff Reback

Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney

Apache Arrow - An OverviewDremio Corporation

Ursa Labs and Apache Arrow in 2019Wes McKinney

Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copyUwe Korn

Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney

ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney

How Apache Arrow and Parquet boost cross-language interoperabilityUwe Korn

PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney

Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney

pandas.(to/from)_sql is simple but not fastUwe Korn

Apache Spark talk @ The Amsterdam Applied Machine Learning meetup groupfvanvollenhoven

Apache Arrow at DataEngConf Barcelona 2018Wes McKinney

Scylla Summit 2022: New AWS Instances Perfect for ScyllaDBScyllaDB

Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Spark Summit

Performant data processing with PySpark, SparkR and DataFrame APIRyuji Tamagawa

Should I use a document database?Oren Eini

Introduction to apache sparkUserReport

HUG_Ireland_Apache_Arrow_Tomer_Shiran John Mulhall

Yet another intro to Apache SparkSimon Lia-Jonassen

Big Data Certifications Workshop - 201711 - Introduction and Database EssentialsDurga Gadiraju

Adios hadoop, Hola Spark! T3chfest 2015dhiguero

Build Low-Latency Applications in Rust on ScyllaDBScyllaDB

Apache Spark FundamentalsZahra Eskandari

Big Data (NJ SQL Server User Group)Don Demcsak

Contenu connexe

Tendances

Ursa Labs and Apache Arrow in 2019Wes McKinney

Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copyUwe Korn

Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney

ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney

How Apache Arrow and Parquet boost cross-language interoperabilityUwe Korn

PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney

Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney

pandas.(to/from)_sql is simple but not fastUwe Korn

Apache Spark talk @ The Amsterdam Applied Machine Learning meetup groupfvanvollenhoven

Apache Arrow at DataEngConf Barcelona 2018Wes McKinney

Scylla Summit 2022: New AWS Instances Perfect for ScyllaDBScyllaDB

Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Spark Summit

Performant data processing with PySpark, SparkR and DataFrame APIRyuji Tamagawa

Should I use a document database?Oren Eini

Introduction to apache sparkUserReport

HUG_Ireland_Apache_Arrow_Tomer_Shiran John Mulhall

Yet another intro to Apache SparkSimon Lia-Jonassen

Big Data Certifications Workshop - 201711 - Introduction and Database EssentialsDurga Gadiraju

Adios hadoop, Hola Spark! T3chfest 2015dhiguero

Build Low-Latency Applications in Rust on ScyllaDBScyllaDB

Tendances (20)

Ursa Labs and Apache Arrow in 2019

Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy

Apache Arrow Workshop at VLDB 2019 / BOSS Session

ACM TechTalks : Apache Arrow and the Future of Data Frames

How Apache Arrow and Parquet boost cross-language interoperability

PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"

Apache Arrow -- Cross-language development platform for in-memory data

pandas.(to/from)_sql is simple but not fast

Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group

Apache Arrow at DataEngConf Barcelona 2018

Scylla Summit 2022: New AWS Instances Perfect for ScyllaDB

Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)

Performant data processing with PySpark, SparkR and DataFrame API

Should I use a document database?

Introduction to apache spark

HUG_Ireland_Apache_Arrow_Tomer_Shiran

Yet another intro to Apache Spark

Big Data Certifications Workshop - 201711 - Introduction and Database Essentials

Adios hadoop, Hola Spark! T3chfest 2015

Build Low-Latency Applications in Rust on ScyllaDB

Similaire à Rust is for "Big Data"

Apache Spark FundamentalsZahra Eskandari

Big Data (NJ SQL Server User Group)Don Demcsak

Intro to Apache SparkMarius Soutier

20151015 zagreb spark_notebooksAndrey Vykhodtsev

Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari

20160524 ibm fast data meetupshinolajla

Introduction to Cassandra and CQL for Java developersJulien Anguenot

Large Scale Data Analytics with Spark and Cassandra on the DSE PlatformDataStax Academy

Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen

Apache Spark on HDinsight TrainingSynergetics Learning and Cloud Consulting

A Java Implementer's Guide to Better Apache Spark PerformanceTim Ellison

Big Data Beyond the JVM - Strata San Jose 2018Holden Karau

Apache Spark in IndustryDorian Beganovic

Giraph+Gora in ApacheCon14Renato Javier Marroquín Mogrovejo

Scala in Model-Driven development for Apparel Cloud PlatformTomoharu ASAMI

Apache spark-melbourne-april-2015-meetupNed Shawa

PyData Frankfurt - (Efficient) Data Exchange with "Foreign" EcosystemsUwe Korn

Lightning Fast Dataframes with PolarsAlberto Danese

Intro to Big Data and NoSQLDon Demcsak

Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow

Similaire à Rust is for "Big Data" (20)

Apache Spark Fundamentals

Big Data (NJ SQL Server User Group)

Intro to Apache Spark

20151015 zagreb spark_notebooks

Apache Spark for Everyone - Women Who Code Workshop

20160524 ibm fast data meetup

Introduction to Cassandra and CQL for Java developers

Large Scale Data Analytics with Spark and Cassandra on the DSE Platform

Etu Solution Day 2014 Track-D: 掌握Impala和Spark

Apache Spark on HDinsight Training

A Java Implementer's Guide to Better Apache Spark Performance

Big Data Beyond the JVM - Strata San Jose 2018

Apache Spark in Industry

Giraph+Gora in ApacheCon14

Scala in Model-Driven development for Apparel Cloud Platform

Apache spark-melbourne-april-2015-meetup

PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems

Lightning Fast Dataframes with Polars

Intro to Big Data and NoSQL

Big Data Developers Moscow Meetup 1 - sql on hadoop

Dernier

CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies

Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol

Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky

A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska

Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López

Lecture # 8 software design and architecture (SDA).pptesrabilgic2

Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions

What is Advanced Excel and what are some best practices for designing and cre...Technogeeks

Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed

Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort

Salesforce Implementation Services PPT By ABSYZABSYZ Inc

Large Language Models for Test Case Evolution and RepairLionel Briand

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services

Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden

英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0

SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler

Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki

Cyber security and its impact on E commercemanigoyal112

Post Quantum Cryptography – The Impact on Identityteam-WIBU

SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa

Dernier (20)

CRM Contender Series: HubSpot vs. Salesforce

Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha

Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...

A healthy diet for your Java application Devoxx France.pdf

Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...

Lecture # 8 software design and architecture (SDA).ppt

Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...

What is Advanced Excel and what are some best practices for designing and cre...

Unveiling Design Patterns: A Visual Guide with UML Diagrams

Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)

Salesforce Implementation Services PPT By ABSYZ

Large Language Models for Test Case Evolution and Repair

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...

Simplifying Microservices & Apps - The art of effortless development - Meetup...

英国UN学位证,北安普顿大学毕业证书1:1制作

SensoDat: Simulation-based Sensor Dataset of Self-driving Cars

Machine Learning Software Engineering Patterns and Their Engineering

Cyber security and its impact on E commerce

Post Quantum Cryptography – The Impact on Identity

SpotFlow: Tracking Method Calls and States at Runtime

Rust is for "Big Data"

1. Rust is for “Big Data” Andy Grove @ Boulder/Denver Rust Meetup 4/11/18

2. About Me • I’ve been a software engineer for ~30 years • 20 years of that using Java • Also some management/founder roles • In my day job I mostly work with Scala, Spark, Parquet, Kudu, Thrift, and HDFS • Yay! I'm a Big Data Engineer TM • I have been learning Rust in my spare time on and off over the past couple years • One of my goals for 2018 was to become proﬁcient in Rust so I decided to take on a substantial project

3. What’s wrong with Spark/JVM? • Spark is actually pretty neat, but … • Garbage collection overheads can be huge • OutOfMemory errors are common • Java serialization is inefficient, even with Kryo • Expensive up-front query planning and code-generation make it inefficient for interactive queries and small data sets • Difficult to configure, monitor, and debug • Generally row-oriented, even when working with columnar data sources

4. A typical day in Spark-land …

5. Let’s build something better! • Rust > JVM: • Raw performance of compiled code • Efﬁcient memory usage • Predictable memory usage • No serialization overhead to map raw bytes to Rust structs • Access to hardware (SIMD, DMA, etc)

6. Keep Calm and Keep Columnar • Column-oriented > Row-oriented • Just load the columns you need from disk (efficient projections) • “a > b” and “a + b” are now vectorized operations that can take advantage of SIMD (Same Instruction, Multiple Data) • Apache Arrow is a standardized columnar in-memory format for zero-copy data interchange between systems • Apache Parquet is a columnar file-format with efficient per- column encoding and compression

8. DataFusion • DataFusion is a proof-of-concept of a modern distributed compute platform, implemented in Rust • Programming model is similar to Apache Spark (DataFrame and SQL APIs) • Apache Arrow is used for the core memory model • Apache Parquet is partially supported (read-only and no support for nested types yet) • CSV is supported too (where there is Big Data, there is CSV) • etcd is used for co-ordination between nodes • Kubernetes/Docker deployment model (planned)

9. Arrow Memory Layout

10. Source code example

11. First Benchmark • Simple job to convert lat/lng pairs into ESRI WKT (Well-known text) format • SELECT ST_AsText(ST_Point(lat, lng)) FROM locations • Reads from CSV ﬁle • Calls two UDFs, and creates one UDT • Writes results to CSV ﬁle • Single thread, single core

12.

13. Detailed Results (throughput rows/second) # Rows DataFusion 0.2.6 Apache Spark 2.2.1 Ratio 10^1 18,191 1,044,030 256,213 4 2 7,523.8 10^2 47,489 437 108.7 10^3 607,057 3,731 162.7 10^4 820,819 32,258 25.4 10^5 957,025 181,159 5.3 10^6 1,044,030 256,213 4.1 10^7 797,224 268,853 3.0 10^8 1,026,443 271,022 3.8 10^9 958,960 282,576 3.4

14. Thanks! • Resources: • DataFusion: https://datafusion.rs/ • My blog: https://andygrove.io • Apache Arrow: https://arrow.apache.org/ • Contact me: • LinkedIn: https://www.linkedin.com/in/andygrove/ • Twitter: @andygrove73 • Email: andygrove73@gmail.com