SpotFlow: Tracking Method Calls and States at Runtime
Rust is for "Big Data"
1. Rust is for “Big Data”
Andy Grove @ Boulder/Denver Rust Meetup 4/11/18
2. About Me
• I’ve been a software engineer for ~30 years
• 20 years of that using Java
• Also some management/founder roles
• In my day job I mostly work with Scala, Spark, Parquet, Kudu, Thrift,
and HDFS
• Yay! I'm a Big Data Engineer
TM
• I have been learning Rust in my spare time on and off over the past
couple years
• One of my goals for 2018 was to become proficient in Rust so I
decided to take on a substantial project
3. What’s wrong with Spark/JVM?
• Spark is actually pretty neat, but …
• Garbage collection overheads can be huge
• OutOfMemory errors are common
• Java serialization is inefficient, even with Kryo
• Expensive up-front query planning and code-generation make it
inefficient for interactive queries and small data sets
• Difficult to configure, monitor, and debug
• Generally row-oriented, even when working with columnar data
sources
5. Let’s build something better!
• Rust > JVM:
• Raw performance of compiled code
• Efficient memory usage
• Predictable memory usage
• No serialization overhead to map raw bytes to Rust
structs
• Access to hardware (SIMD, DMA, etc)
6. Keep Calm and Keep Columnar
• Column-oriented > Row-oriented
• Just load the columns you need from disk (efficient
projections)
• “a > b” and “a + b” are now vectorized operations that can
take advantage of SIMD (Same Instruction, Multiple Data)
• Apache Arrow is a standardized columnar in-memory
format for zero-copy data interchange between systems
• Apache Parquet is a columnar file-format with efficient per-
column encoding and compression
7.
8. DataFusion
• DataFusion is a proof-of-concept of a modern distributed compute
platform, implemented in Rust
• Programming model is similar to Apache Spark (DataFrame and SQL
APIs)
• Apache Arrow is used for the core memory model
• Apache Parquet is partially supported (read-only and no support for
nested types yet)
• CSV is supported too (where there is Big Data, there is CSV)
• etcd is used for co-ordination between nodes
• Kubernetes/Docker deployment model (planned)
11. First Benchmark
• Simple job to convert lat/lng pairs into ESRI WKT
(Well-known text) format
• SELECT ST_AsText(ST_Point(lat, lng)) FROM locations
• Reads from CSV file
• Calls two UDFs, and creates one UDT
• Writes results to CSV file
• Single thread, single core