SlideShare a Scribd company logo
1 of 39
© 2016 Dremio Corporation @DremioHQ
The future of column-oriented data
processing with Arrow and Parquet
Jacques Nadeau, CTO Dremio, VP Apache Arrow
Julien Le Dem, Principal Architect Dremio, VP Apache Parquet
© 2016 Dremio Corporation @DremioHQ
• CTO of Dremio
• Apache member
• VP Apache Arrow
• Apache PMCs: Arrow, Calcite,
Drill, Incubator
Julien Le Dem
@J_
Jacques Nadeau
@intjesus
• Principal Architect at Dremio
• Formerly Tech Lead at Twitter on
Data Platforms.
• Creator of Parquet
• Apache member
• Apache PMCs: Arrow, Incubator,
Kudu, Pig, Parquet
© 2016 Dremio Corporation @DremioHQ
Agenda
• Community Driven Standard
• Interoperability and Ecosystem
• Benefits of Columnar representation
– On disk (Apache Parquet)
– In memory (Apache Arrow)
• Future of columnar
© 2016 Dremio Corporation @DremioHQ
Community Driven Standard
© 2016 Dremio Corporation @DremioHQ
An open source standard
• Parquet: Common need for on disk columnar.
• Arrow: Common need for in memory columnar.
• Arrow building on the success of Parquet.
• Benefits:
– Share the effort
– Create an ecosystem
• Standard from the start
© 2016 Dremio Corporation @DremioHQ
The Apache Arrow Project
• New Top-level Apache Software Foundation project
– Announced Feb 17, 2016
• Focused on Columnar In-Memory Analytics
1. 10-100x speedup on many workloads
2. Common data layer enables companies to choose best of
breed systems
3. Designed to work with any programming language
4. Support for both relational and complex data as-is
• Developers from 13+ major open source projects involved
– A significant % of the world’s data will be processed through
Arrow!
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
© 2016 Dremio Corporation @DremioHQ
Interoperability and Ecosystem
© 2016 Dremio Corporation @DremioHQ
Shared Need => Open Source Opportunity
“We are also considering switching to
a columnar canonical in-memory
format for data that needs to be
materialized during query processing,
in order to take advantage of SIMD
instructions” -Impala Team
“A large fraction of the CPU time is spent
waiting for data to be fetched from main
memory…we are designing cache-friendly
algorithms and data structures so Spark
applications will spend less time waiting to
fetch data from memory and more time
doing useful work” – Spark Team
“Drill provides a flexible hierarchical
columnar data model that can
represent complex, highly dynamic
and evolving data models and allows
efficient processing of it without need
to flatten or materialize.” -Drill Team
© 2016 Dremio Corporation @DremioHQ
High Performance Sharing & Interchange
Before With Arrow
• Each system has its own internal memory
format
• 70-80% CPU wasted on serialization and
deserialization
• Functionality duplication and unnecessary
conversions
• All systems utilize the same memory
format
• No overhead for cross-system
communication
• Projects can share functionality (eg:
Parquet-to-Arrow reader)
© 2016 Dremio Corporation @DremioHQ
Benefits of Columnar formats
@EmrgencyKittens
© 2016 Dremio Corporation @DremioHQ
Columnar layout
Logical table
representation
Row layout
Column layout
© 2016 Dremio Corporation @DremioHQ
On Disk and in Memory
• Different trade offs
– On disk: Storage.
• Accessed by multiple queries.
• Priority to I/O reduction (but still needs good CPU throughput).
• Mostly Streaming access.
– In memory: Transient.
• Specific to one query execution.
• Priority to CPU throughput (but still needs good I/O).
• Streaming and Random access.
© 2016 Dremio Corporation @DremioHQ
Parquet on disk columnar format
© 2016 Dremio Corporation @DremioHQ
Parquet on disk columnar format
• Nested data structures
• Compact format:
– type aware encodings
– better compression
• Optimized I/O:
– Projection push down (column pruning)
– Predicate push down (filters based on stats)
© 2016 Dremio Corporation @DremioHQ
Access only the data you need
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
+ =
Columnar Statistics
Read only the
data you need!
© 2016 Dremio Corporation @DremioHQ
Parquet nested representation
Document
DocId Links Name
Backward Forward Language Url
Code Country
Columns:
docid
links.backward
links.forward
name.language.code
name.language.country
name.url
Borrowed from the Google Dremel paper
https://blog.twitter.com/2013/dremel-made-simple-with-parquet
© 2016 Dremio Corporation @DremioHQ
Arrow in memory columnar format
© 2016 Dremio Corporation @DremioHQ
Arrow goals
• Well-documented and cross language
compatible
• Designed to take advantage of modern CPU
characteristics
• Embeddable in execution engines, storage
layers, etc.
• Interoperable
© 2016 Dremio Corporation @DremioHQ
Arrow in memory columnar format
• Nested Data Structures
• Maximize CPU throughput
– Pipelining
– SIMD
– cache locality
• Scatter/gather I/O
© 2016 Dremio Corporation @DremioHQ
CPU pipeline
© 2016 Dremio Corporation @DremioHQ
Minimize CPU cache misses
a cache miss costs 10 to 100s cycles depending on the level
© 2016 Dremio Corporation @DremioHQ
Focus on CPU Efficiency
Traditional
Memory Buffer
Arrow
Memory Buffer
• Cache Locality
• Super-scalar & vectorized
operation
• Minimal Structure Overhead
• Constant value access
– With minimal structure
overhead
• Operate directly on columnar
compressed data
© 2016 Dremio Corporation @DremioHQ
Arrow Messages, RPC & IPC
© 2016 Dremio Corporation @DremioHQ
Common Message Pattern
• Schema Negotiation
– Logical Description of structure
– Identification of dictionary encoded
Nodes
• Dictionary Batch
– Dictionary ID, Values
• Record Batch
– Batches of records up to 64K
– Leaf nodes up to 2B values
Schema
Negotiation
Dictionary
Batch
Record
Batch
Record
Batch
Record
Batch
1..N
Batches
0..N
Batches
© 2016 Dremio Corporation @DremioHQ
Columnar data
persons = [{
name: ’Joe',
age: 18,
phones: [
‘555-111-1111’,
‘555-222-2222’
]
}, {
name: ’Jack',
age: 37,
phones: [ ‘555-333-3333’ ]
}]
© 2016 Dremio Corporation @DremioHQ
Record Batch Construction
Schema
Negotiation
Dictionary
Batch
Record
Batch
Record
Batch
Record
Batch
name (offset)
name (data)
age (data)
phones (list offset)
phones (data)
data header (describes offsets into data)
name (bitmap)
age (bitmap)
phones (bitmap)
phones (offset)
{
name: ’Joe',
age: 18,
phones: [
‘555-111-1111’,
‘555-222-2222’
]
}
Each box (vector) is contiguous memory
The entire record batch is contiguous on wire
© 2016 Dremio Corporation @DremioHQ
Moving Data Between Systems
RPC
• Avoid Serialization & Deserialization
• Layer TBD: Focused on supporting vectored io
– Scatter/gather reads/writes against socket
IPC
• Alpha implementation using memory mapped files
– Moving data between Python and Drill
• Working on shared allocation approach
– Shared reference counting and well-defined ownership semantics
© 2016 Dremio Corporation @DremioHQ
Java: Memory Management
• Chunk-based managed allocator
– Built on top of Netty’s JEMalloc implementation
• Create a tree of allocators
– Limit and transfer semantics across allocators
– Leak detection and location accounting
• Wrap native memory from other applications
© 2016 Dremio Corporation @DremioHQ
Language Bindings
Parquet
• Target Languages
– Java
– CPP
– Python & Pandas
• Engines integration:
– Many!
Arrow
• Target Languages
– Java
– CPP, Python
– R (underway)
• Engines integration:
– Drill
– Pandas, R
– Spark (underway)
© 2016 Dremio Corporation @DremioHQ
Execution examples:
© 2016 Dremio Corporation @DremioHQ
RPC: Single system execution
The memory
representation is sent
over the wire.
No serialization
overhead.
Scanner
Scanner
Scanner
Parquet files
projection push down
read only a and b
Partial
Agg
Partial
Agg
Partial
Agg
Agg
Agg
Agg
Shuffle
Arrow batches
Result
© 2016 Dremio Corporation @DremioHQ
Multi-system IPC
SQL engine
Python
process
User
defined
function
SQL
Operator
1
SQL
Operator
2
reads reads
© 2016 Dremio Corporation @DremioHQ
Summary and Future
© 2016 Dremio Corporation @DremioHQ
Where is the bottleneck?
L1 cache reference 1x
Branch mispredict 3x
L2 cache reference 4x
Main memory reference 100x
Non-volatile Memory 200x
3D Xpoint Read 1600x
RDMA Read 1600x
SSD Read 16,000x
Spinning Disk read 3,000,000x
Sources
- https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html
- http://www.anandtech.com/show/9470/intel-and-micron-announce-3d-xpoint-nonvolatile-
memory-technology-1000x-higher-performance-endurance-than-nand
CPU
Bound
IO
Bound
© 2016 Dremio Corporation @DremioHQ
RPC: arrow based storage interchange
The memory
representation is sent
over the wire.
No serialization
overhead.
Scanner
projection/predicate
push down
Operator
Arrow batches
Storage
Mem
Disk
SQL
execution
Scanner Operator
Scanner Operator
Storage
Mem
Disk
Storage
Mem
Disk
…
© 2016 Dremio Corporation @DremioHQ
RPC: arrow based cache
The memory
representation is sent
over the wire.
No serialization
overhead.
projection
push down
Operator
Arrow-based
Cache
SQL
execution
Operator
Operator
…
© 2016 Dremio Corporation @DremioHQ
Storage tiering with Non Volatile Memory
© 2016 Dremio Corporation @DremioHQ
What’s Next
• Parquet – Arrow Nested support for Python & C++
• Arrow IPC Implementation
• Arrow RPC & HTTP/2
• Kudu – Arrow integration
• Apache {Spark, Drill} to Arrow Integration
– Faster UDFs, Storage interfaces
• Support for integration with Intel’s Persistent
Memory library via Apache Mnemonic
© 2016 Dremio Corporation @DremioHQ
Get Involved
• Join the community
– dev@{arrow,parquet}.apache.org
– Slack:
• https://apachearrowslackin.herokuapp.com/
– http://{arrow,parquet}.apache.org
– Follow @Apache{Parquet,Arrow}

More Related Content

What's hot

Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresDataWorks Summit
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Wes McKinney
 
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage EfficiencyHDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage EfficiencyDataWorks Summit
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesDataWorks Summit/Hadoop Summit
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?gvernik
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
Performance tuning your Hadoop/Spark clusters to use cloud storage
Performance tuning your Hadoop/Spark clusters to use cloud storagePerformance tuning your Hadoop/Spark clusters to use cloud storage
Performance tuning your Hadoop/Spark clusters to use cloud storageDataWorks Summit
 
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015Wes McKinney
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive TuningAdam Muise
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High PerformanceInderaj (Raj) Bains
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, Howmcsrivas
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowWes McKinney
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)Wes McKinney
 

What's hot (20)

Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 
Node Labels in YARN
Node Labels in YARNNode Labels in YARN
Node Labels in YARN
 
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage EfficiencyHDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different Rules
 
Spark vstez
Spark vstezSpark vstez
Spark vstez
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
 
Performance tuning your Hadoop/Spark clusters to use cloud storage
Performance tuning your Hadoop/Spark clusters to use cloud storagePerformance tuning your Hadoop/Spark clusters to use cloud storage
Performance tuning your Hadoop/Spark clusters to use cloud storage
 
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)
 
Caching and Buffering in HDF5
Caching and Buffering in HDF5Caching and Buffering in HDF5
Caching and Buffering in HDF5
 

Viewers also liked

Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Julien Le Dem
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon
 
ORC File Introduction
ORC File IntroductionORC File Introduction
ORC File IntroductionOwen O'Malley
 
Embedding Pig in scripting languages
Embedding Pig in scripting languagesEmbedding Pig in scripting languages
Embedding Pig in scripting languagesJulien Le Dem
 
HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran John Mulhall
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVROairisData
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Julian Hyde
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksData Con LA
 
Calcite meetup-2016-04-20
Calcite meetup-2016-04-20Calcite meetup-2016-04-20
Calcite meetup-2016-04-20Josh Elser
 
Spark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSpark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSteve Loughran
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memoryJulian Hyde
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache CalciteJordan Halterman
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache CalciteJulian Hyde
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overviewJulian Hyde
 
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Julian Hyde
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at UberDatabricks
 

Viewers also liked (20)

ORC Files
ORC FilesORC Files
ORC Files
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
ORC File Introduction
ORC File IntroductionORC File Introduction
ORC File Introduction
 
Embedding Pig in scripting languages
Embedding Pig in scripting languagesEmbedding Pig in scripting languages
Embedding Pig in scripting languages
 
HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVRO
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
Calcite meetup-2016-04-20
Calcite meetup-2016-04-20Calcite meetup-2016-04-20
Calcite meetup-2016-04-20
 
Spark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSpark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object stores
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memory
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overview
 
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 

Similar to Strata NY 2016: The future of column-oriented data processing with Arrow and Parquet

The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
 
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowDataWorks Summit/Hadoop Summit
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowJulien Le Dem
 
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowUsing LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowDataWorks Summit
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeDremio Corporation
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsWes McKinney
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended CutWes McKinney
 
Building an Apache Hadoop data application
Building an Apache Hadoop data applicationBuilding an Apache Hadoop data application
Building an Apache Hadoop data applicationtomwhite
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapJulien Le Dem
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenWes McKinney
 
PyData: The Next Generation
PyData: The Next GenerationPyData: The Next Generation
PyData: The Next GenerationWes McKinney
 
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Databricks
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowJulien Le Dem
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Cloudera, Inc.
 
Eurostars MODELS Project, System modeling and design exploration of applicati...
Eurostars MODELS Project, System modeling and design exploration of applicati...Eurostars MODELS Project, System modeling and design exploration of applicati...
Eurostars MODELS Project, System modeling and design exploration of applicati...Alessandra Bagnato
 
MODELS, a unified environment for the design of system applications on parall...
MODELS, a unified environment for the design of system applications on parall...MODELS, a unified environment for the design of system applications on parall...
MODELS, a unified environment for the design of system applications on parall...OW2
 
Python Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the FuturePython Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the FutureWes McKinney
 

Similar to Strata NY 2016: The future of column-oriented data processing with Arrow and Parquet (20)

The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowUsing LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
 
Building an Apache Hadoop data application
Building an Apache Hadoop data applicationBuilding an Apache Hadoop data application
Building an Apache Hadoop data application
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmap
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
 
PyData: The Next Generation
PyData: The Next GenerationPyData: The Next Generation
PyData: The Next Generation
 
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...
 
EVOLVE'16 | Enhance | Anil Kalbag & Anshul Chhabra | Comparative Architecture...
EVOLVE'16 | Enhance | Anil Kalbag & Anshul Chhabra | Comparative Architecture...EVOLVE'16 | Enhance | Anil Kalbag & Anshul Chhabra | Comparative Architecture...
EVOLVE'16 | Enhance | Anil Kalbag & Anshul Chhabra | Comparative Architecture...
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 

 
Eurostars MODELS Project, System modeling and design exploration of applicati...
Eurostars MODELS Project, System modeling and design exploration of applicati...Eurostars MODELS Project, System modeling and design exploration of applicati...
Eurostars MODELS Project, System modeling and design exploration of applicati...
 
MODELS, a unified environment for the design of system applications on parall...
MODELS, a unified environment for the design of system applications on parall...MODELS, a unified environment for the design of system applications on parall...
MODELS, a unified environment for the design of system applications on parall...
 
Python Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the FuturePython Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the Future
 

More from Julien Le Dem

Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageJulien Le Dem
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & MarquezJulien Le Dem
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Julien Le Dem
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Julien Le Dem
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseJulien Le Dem
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed databaseJulien Le Dem
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsJulien Le Dem
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Julien Le Dem
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Julien Le Dem
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open houseJulien Le Dem
 
Poster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languagesPoster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languagesJulien Le Dem
 

More from Julien Le Dem (12)

Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineage
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed database
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open house
 
Poster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languagesPoster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languages
 

Recently uploaded

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 

Recently uploaded (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 

Strata NY 2016: The future of column-oriented data processing with Arrow and Parquet

  • 1. © 2016 Dremio Corporation @DremioHQ The future of column-oriented data processing with Arrow and Parquet Jacques Nadeau, CTO Dremio, VP Apache Arrow Julien Le Dem, Principal Architect Dremio, VP Apache Parquet
  • 2. © 2016 Dremio Corporation @DremioHQ • CTO of Dremio • Apache member • VP Apache Arrow • Apache PMCs: Arrow, Calcite, Drill, Incubator Julien Le Dem @J_ Jacques Nadeau @intjesus • Principal Architect at Dremio • Formerly Tech Lead at Twitter on Data Platforms. • Creator of Parquet • Apache member • Apache PMCs: Arrow, Incubator, Kudu, Pig, Parquet
  • 3. © 2016 Dremio Corporation @DremioHQ Agenda • Community Driven Standard • Interoperability and Ecosystem • Benefits of Columnar representation – On disk (Apache Parquet) – In memory (Apache Arrow) • Future of columnar
  • 4. © 2016 Dremio Corporation @DremioHQ Community Driven Standard
  • 5. © 2016 Dremio Corporation @DremioHQ An open source standard • Parquet: Common need for on disk columnar. • Arrow: Common need for in memory columnar. • Arrow building on the success of Parquet. • Benefits: – Share the effort – Create an ecosystem • Standard from the start
  • 6. © 2016 Dremio Corporation @DremioHQ The Apache Arrow Project • New Top-level Apache Software Foundation project – Announced Feb 17, 2016 • Focused on Columnar In-Memory Analytics 1. 10-100x speedup on many workloads 2. Common data layer enables companies to choose best of breed systems 3. Designed to work with any programming language 4. Support for both relational and complex data as-is • Developers from 13+ major open source projects involved – A significant % of the world’s data will be processed through Arrow! Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
  • 7. © 2016 Dremio Corporation @DremioHQ Interoperability and Ecosystem
  • 8. © 2016 Dremio Corporation @DremioHQ Shared Need => Open Source Opportunity “We are also considering switching to a columnar canonical in-memory format for data that needs to be materialized during query processing, in order to take advantage of SIMD instructions” -Impala Team “A large fraction of the CPU time is spent waiting for data to be fetched from main memory…we are designing cache-friendly algorithms and data structures so Spark applications will spend less time waiting to fetch data from memory and more time doing useful work” – Spark Team “Drill provides a flexible hierarchical columnar data model that can represent complex, highly dynamic and evolving data models and allows efficient processing of it without need to flatten or materialize.” -Drill Team
  • 9. © 2016 Dremio Corporation @DremioHQ High Performance Sharing & Interchange Before With Arrow • Each system has its own internal memory format • 70-80% CPU wasted on serialization and deserialization • Functionality duplication and unnecessary conversions • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality (eg: Parquet-to-Arrow reader)
  • 10. © 2016 Dremio Corporation @DremioHQ Benefits of Columnar formats @EmrgencyKittens
  • 11. © 2016 Dremio Corporation @DremioHQ Columnar layout Logical table representation Row layout Column layout
  • 12. © 2016 Dremio Corporation @DremioHQ On Disk and in Memory • Different trade offs – On disk: Storage. • Accessed by multiple queries. • Priority to I/O reduction (but still needs good CPU throughput). • Mostly Streaming access. – In memory: Transient. • Specific to one query execution. • Priority to CPU throughput (but still needs good I/O). • Streaming and Random access.
  • 13. © 2016 Dremio Corporation @DremioHQ Parquet on disk columnar format
  • 14. © 2016 Dremio Corporation @DremioHQ Parquet on disk columnar format • Nested data structures • Compact format: – type aware encodings – better compression • Optimized I/O: – Projection push down (column pruning) – Predicate push down (filters based on stats)
  • 15. © 2016 Dremio Corporation @DremioHQ Access only the data you need a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 + = Columnar Statistics Read only the data you need!
  • 16. © 2016 Dremio Corporation @DremioHQ Parquet nested representation Document DocId Links Name Backward Forward Language Url Code Country Columns: docid links.backward links.forward name.language.code name.language.country name.url Borrowed from the Google Dremel paper https://blog.twitter.com/2013/dremel-made-simple-with-parquet
  • 17. © 2016 Dremio Corporation @DremioHQ Arrow in memory columnar format
  • 18. © 2016 Dremio Corporation @DremioHQ Arrow goals • Well-documented and cross language compatible • Designed to take advantage of modern CPU characteristics • Embeddable in execution engines, storage layers, etc. • Interoperable
  • 19. © 2016 Dremio Corporation @DremioHQ Arrow in memory columnar format • Nested Data Structures • Maximize CPU throughput – Pipelining – SIMD – cache locality • Scatter/gather I/O
  • 20. © 2016 Dremio Corporation @DremioHQ CPU pipeline
  • 21. © 2016 Dremio Corporation @DremioHQ Minimize CPU cache misses a cache miss costs 10 to 100s cycles depending on the level
  • 22. © 2016 Dremio Corporation @DremioHQ Focus on CPU Efficiency Traditional Memory Buffer Arrow Memory Buffer • Cache Locality • Super-scalar & vectorized operation • Minimal Structure Overhead • Constant value access – With minimal structure overhead • Operate directly on columnar compressed data
  • 23. © 2016 Dremio Corporation @DremioHQ Arrow Messages, RPC & IPC
  • 24. © 2016 Dremio Corporation @DremioHQ Common Message Pattern • Schema Negotiation – Logical Description of structure – Identification of dictionary encoded Nodes • Dictionary Batch – Dictionary ID, Values • Record Batch – Batches of records up to 64K – Leaf nodes up to 2B values Schema Negotiation Dictionary Batch Record Batch Record Batch Record Batch 1..N Batches 0..N Batches
  • 25. © 2016 Dremio Corporation @DremioHQ Columnar data persons = [{ name: ’Joe', age: 18, phones: [ ‘555-111-1111’, ‘555-222-2222’ ] }, { name: ’Jack', age: 37, phones: [ ‘555-333-3333’ ] }]
  • 26. © 2016 Dremio Corporation @DremioHQ Record Batch Construction Schema Negotiation Dictionary Batch Record Batch Record Batch Record Batch name (offset) name (data) age (data) phones (list offset) phones (data) data header (describes offsets into data) name (bitmap) age (bitmap) phones (bitmap) phones (offset) { name: ’Joe', age: 18, phones: [ ‘555-111-1111’, ‘555-222-2222’ ] } Each box (vector) is contiguous memory The entire record batch is contiguous on wire
  • 27. © 2016 Dremio Corporation @DremioHQ Moving Data Between Systems RPC • Avoid Serialization & Deserialization • Layer TBD: Focused on supporting vectored io – Scatter/gather reads/writes against socket IPC • Alpha implementation using memory mapped files – Moving data between Python and Drill • Working on shared allocation approach – Shared reference counting and well-defined ownership semantics
  • 28. © 2016 Dremio Corporation @DremioHQ Java: Memory Management • Chunk-based managed allocator – Built on top of Netty’s JEMalloc implementation • Create a tree of allocators – Limit and transfer semantics across allocators – Leak detection and location accounting • Wrap native memory from other applications
  • 29. © 2016 Dremio Corporation @DremioHQ Language Bindings Parquet • Target Languages – Java – CPP – Python & Pandas • Engines integration: – Many! Arrow • Target Languages – Java – CPP, Python – R (underway) • Engines integration: – Drill – Pandas, R – Spark (underway)
  • 30. © 2016 Dremio Corporation @DremioHQ Execution examples:
  • 31. © 2016 Dremio Corporation @DremioHQ RPC: Single system execution The memory representation is sent over the wire. No serialization overhead. Scanner Scanner Scanner Parquet files projection push down read only a and b Partial Agg Partial Agg Partial Agg Agg Agg Agg Shuffle Arrow batches Result
  • 32. © 2016 Dremio Corporation @DremioHQ Multi-system IPC SQL engine Python process User defined function SQL Operator 1 SQL Operator 2 reads reads
  • 33. © 2016 Dremio Corporation @DremioHQ Summary and Future
  • 34. © 2016 Dremio Corporation @DremioHQ Where is the bottleneck? L1 cache reference 1x Branch mispredict 3x L2 cache reference 4x Main memory reference 100x Non-volatile Memory 200x 3D Xpoint Read 1600x RDMA Read 1600x SSD Read 16,000x Spinning Disk read 3,000,000x Sources - https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html - http://www.anandtech.com/show/9470/intel-and-micron-announce-3d-xpoint-nonvolatile- memory-technology-1000x-higher-performance-endurance-than-nand CPU Bound IO Bound
  • 35. © 2016 Dremio Corporation @DremioHQ RPC: arrow based storage interchange The memory representation is sent over the wire. No serialization overhead. Scanner projection/predicate push down Operator Arrow batches Storage Mem Disk SQL execution Scanner Operator Scanner Operator Storage Mem Disk Storage Mem Disk …
  • 36. © 2016 Dremio Corporation @DremioHQ RPC: arrow based cache The memory representation is sent over the wire. No serialization overhead. projection push down Operator Arrow-based Cache SQL execution Operator Operator …
  • 37. © 2016 Dremio Corporation @DremioHQ Storage tiering with Non Volatile Memory
  • 38. © 2016 Dremio Corporation @DremioHQ What’s Next • Parquet – Arrow Nested support for Python & C++ • Arrow IPC Implementation • Arrow RPC & HTTP/2 • Kudu – Arrow integration • Apache {Spark, Drill} to Arrow Integration – Faster UDFs, Storage interfaces • Support for integration with Intel’s Persistent Memory library via Apache Mnemonic
  • 39. © 2016 Dremio Corporation @DremioHQ Get Involved • Join the community – dev@{arrow,parquet}.apache.org – Slack: • https://apachearrowslackin.herokuapp.com/ – http://{arrow,parquet}.apache.org – Follow @Apache{Parquet,Arrow}