SlideShare une entreprise Scribd logo
1  sur  28
Télécharger pour lire hors ligne
DREMIO
Faster conclusions using in-memory
columnar SQL and machine learning
Strata San Jose - March 30, 2016
Apache	
  
Arrow	
  
DREMIO
Who
Wes McKinney
•  Engineer at Cloudera, formerly
DataPad CEO/founder
•  Wrote bestseller Python for
Data Analysis 2012
•  Open source projects
–  Python {pandas, Ibis,
statsmodels}
–  Apache {Arrow, Parquet, Kudu
(incubating)}
•  Mostly work in Python and
Cython/C/C++
Jacques Nadeau
•  CTO & Co-Founder at
Dremio, formerly Architect
at MapR
•  Open Source projects
–  Apache {Arrow, Parquet,
Calcite, Drill, HBase,
Phoenix}
•  Mostly work in Java
DREMIO
Arrow in a Slide
•  New Top-level Apache Software Foundation project
–  Announced Feb 17, 2016
•  Focused on Columnar In-Memory Analytics
1.  10-100x speedup on many workloads
2.  Common data layer enables companies to choose best of breed
systems
3.  Designed to work with any programming language
4.  Support for both relational and complex data as-is
•  Developers from 13+ major open source projects involved
–  A significant % of the world’s data will be processed through Arrow!
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
DREMIO
Agenda
•  Purpose
•  Memory Representation
•  Language Bindings
•  IPC & RPC
•  Example Integrations
DREMIO
Purpose
DREMIO
Overview
•  A high speed in-memory representation
•  Well-documented and cross language
compatible
•  Designed to take advantage of modern
CPU characteristics
•  Embeddable in execution engines, storage
layers, etc.
DREMIO
Focus on CPU Efficiency
1331246660
1331246351
1331244570
1331261196
3/8/2012 2:44PM
3/8/2012 2:38PM
3/8/2012 2:09PM
3/8/2012 6:46PM
99.155.155.225
65.87.165.114
71.10.106.181
76.102.156.138
Row 1
Row 2
Row 3
Row 4
1331246660
1331246351
1331244570
1331261196
3/8/2012 2:44PM
3/8/2012 2:38PM
3/8/2012 2:09PM
3/8/2012 6:46PM
99.155.155.225
65.87.165.114
71.10.106.181
76.102.156.138
session_id
timestamp
source_ip
Traditional
Memory Buffer	
  
Arrow
Memory Buffer	
  
•  Cache Locality
•  Super-scalar & vectorized
operation
•  Minimal Structure
Overhead
•  Constant value access
–  With minimal structure
overhead
•  Operate directly on
columnar compressed data
DREMIO
High Performance Sharing & Interchange
Today With Arrow
•  Each system has its own internal
memory format
•  70-80% CPU wasted on serialization
and deserialization
•  Similar functionality implemented in
multiple projects
•  All systems utilize the same memory
format
•  No overhead for cross-system
communication
•  Projects can share functionality (eg,
Parquet-to-Arrow reader)
Pandas Drill
Impala
HBase
KuduCassandra
Parquet
Spark
Arrow Memory
Pandas Drill
Impala
HBase
KuduCassandra
Parquet
Spark
Copy & Convert
Copy & Convert
Copy & Convert
Copy & Convert
Copy & Convert
DREMIO
Shared Need > Open Source Opportunity
•  Columnar is Complex
•  Shredded Columnar is even
more complex
•  We all need to go to same
place
•  Take Advantage of Open
Source approach
•  Once we pick a shared
solution, we get interchange
for “free”
“We	
  are	
  also	
  considering	
  switching	
  to	
  
a	
  columnar	
  canonical	
  in-­‐memory	
  
format	
  for	
  data	
  that	
  needs	
  to	
  be	
  
materialized	
  during	
  query	
  processing,	
  
in	
  order	
  to	
  take	
  advantage	
  of	
  SIMD	
  
instrucBons”	
  -­‐Impala	
  Team	
  
“A	
  large	
  fracBon	
  of	
  the	
  CPU	
  Bme	
  is	
  spent	
  
waiBng	
  for	
  data	
  to	
  be	
  fetched	
  from	
  main	
  
memory…we	
  are	
  designing	
  cache-­‐friendly	
  
algorithms	
  and	
  data	
  structures	
  so	
  Spark	
  
applicaBons	
  will	
  spend	
  less	
  Bme	
  waiBng	
  to	
  
fetch	
  data	
  from	
  memory	
  and	
  more	
  Bme	
  
doing	
  useful	
  work	
  –	
  Spark	
  Team	
  
DREMIO
In Memory Representation
DREMIO
Columnar data persons	
  =	
  [{	
  
	
  name:	
  'wes',	
  
	
  iq:	
  180,	
  
	
  addresses:	
  [	
  
	
  {number:	
  2,	
  street	
  'a'},	
  	
  
	
  {number:	
  3,	
  street	
  'bb'}	
  
	
  ]	
  
},	
  {	
  
	
  name:	
  'joe',	
  
	
  iq:	
  100,	
  
	
  addresses:	
  [	
  
	
  	
   	
  {number:	
  4,	
  street	
  'ccc'},	
  	
  
	
  	
   	
  {number:	
  5, 	
  street	
  'dddd'},	
  	
  
	
  	
   	
  {number:	
  2,	
  street	
  'f'}	
  
	
  	
   	
  ]	
  
}]	
  
DREMIO
Simple Example: persons.iq
person.iq
180
100
DREMIO
Simple Example: persons.addresses.number
person.addresses
0
2
5
person.addresses.number
2
3
4
5
6
offset
DREMIO
Columnar data
person.addresses.street
person.addresses
0
2
5
offset
0
1
3
6
10
a
b
b
c
c
c
d
d
d
d
f
person.addresses.number
2
3
4
5
6
offset
DREMIO
Language Bindings
DREMIO
Language Bindings
•  Target Languages
–  Java (beta)
–  CPP (underway)
–  Python & Pandas (underway)
–  R
–  Julia
•  Initial Focus
–  Read a structure
–  Write a structure
–  Manage Memory
DREMIO
Java: Creating Dynamic Off-heap Structures
FieldWriter	
  w=	
  getWriter();	
  
w.varChar("name").write("Wes");	
  
w.integer("iq").write(180);	
  
ListWriter	
  list	
  =	
  writer.list("addresses");	
  
list.startList();	
  
	
  	
  MapWriter	
  map	
  =	
  list.map();	
  
	
  	
  map.start();	
  
	
  	
  	
  	
  map.integer("number").writeInt(2);	
  
	
  	
  	
  	
  map.varChar("street").write("a");	
  
	
  	
  map.end();	
  
	
  	
  map.start();	
  
	
  	
  	
  	
  map.integer("number").writeInt(3);	
  
	
  	
  	
  	
  map.varChar("street").write("bb");	
  
	
  	
  map.end();	
  
list.endList();	
  
{	
  
	
  name:	
  'wes',	
  
	
  iq:	
  180,	
  
	
  addresses:	
  [	
  
	
  {number:	
  2,	
  street	
  'a'},	
  	
  
	
  {number:	
  3,	
  street	
  'bb'}	
  
	
  ]	
  
}	
  	
  
Json	
  RepresentaBon	
   ProgrammaBc	
  ConstrucBon	
  
DREMIO
Java: Memory Management (& NVMe)
•  Chunk-based managed allocator
–  Built on top of Netty’s JEMalloc implementation
•  Create a tree of allocators
–  Limit and transfer semantics across allocators
–  Leak detection and location accounting
•  Wrap native memory from other applications
•  New support for integration with Intel’s Persistent
Memory library via Apache Mnemonic
DREMIO
RPC & IPC
DREMIO
Common Message Pattern
•  Schema Negotiation
–  Logical Description of structure
–  Identification of dictionary
encoded Nodes
•  Dictionary Batch
–  Dictionary ID, Values
•  Record Batch
–  Batches of records up to 64K
–  Leaf nodes up to 2B values
Schema	
  
NegoBaBon	
  
DicBonary	
  
Batch	
  
Record	
  
Batch	
  
Record	
  
Batch	
  
Record	
  
Batch	
  
1..N	
  
Batches	
  
0..N	
  
Batches	
  
DREMIO
Record Batch Construction
Schema	
  
NegoBaBon	
  
DicBonary	
  
Batch	
  
Record	
  
Batch	
  
Record	
  
Batch	
  
Record	
  
Batch	
  
name	
  (offset)	
  
name	
  (data)	
  
iq	
  (data)	
  
addresses	
  (list	
  offset)	
  
addresses.number	
  
addresses.street	
  (offset)	
   addresses.street	
  (data)	
  
data	
  header	
  (describes	
  offsets	
  into	
  data)	
  
name	
  (bitmap)	
  
iq	
  (bitmap)	
  
addresses	
  (bitmap)	
  
addresses.number	
  (bitmap)	
  
addresses.street	
  (bitmap)	
  
{	
  
	
  name:	
  'wes',	
  
	
  iq:	
  180,	
  
	
  addresses:	
  [	
  
	
  {number:	
  2,	
  	
  
	
  	
  	
  	
  	
  	
  	
  street	
  'a'},	
  	
  
	
  {number:	
  3,	
  	
  
	
  	
  	
  	
  	
  	
  	
  street	
  'bb'}	
  
	
  ]	
  
}	
  
Each	
  box	
  is	
  
conBguous	
  memory,	
  
enBrely	
  conBguous	
  on	
  
wire	
  
DREMIO
RPC & IPC: Moving Data Between Systems
RPC
•  Avoid Serialization & Deserialization
•  Layer TBD: Focused on supporting vectored io
–  Scatter/gather reads/writes against socket
IPC
•  Alpha implementation using memory mapped files
–  Moving data between Python and Drill
•  Working on shared allocation approach
–  Shared reference counting and well-defined ownership
semantics
DREMIO
Real World Examples
DREMIO
Real World Example: Python With Spark or Drill
in partition 0
…
in partition
n - 1
SQL Engine
Python
function
input
Python
function
input
User-supplied
Python code
output
output
out partition 0
…
out partition
n - 1
SQL Engine
DREMIO
Real World Example: Feather File Format for
Python and R
•  Problem: fast, language-
agnostic binary data
frame file format
•  Written by Wes
McKinney (Python)
Hadley Wickham (R)
•  Read speeds close to
disk IO performance
Arrow array 0
Arrow array 1
…
Arrow array n
Feather
metadata
Feather file
Apache Arrow
memory
Google
flatbuffers
DREMIO
Real World Example: Feather File Format for
Python and R
library(feather)	
  
	
  	
  
path	
  <-­‐	
  "my_data.feather"	
  
write_feather(df,	
  path)	
  
	
  	
  
df	
  <-­‐	
  read_feather(path)	
  
import	
  feather	
  
	
  	
  
path	
  =	
  'my_data.feather'	
  
	
  	
  
feather.write_dataframe(df,	
  path)	
  
df	
  =	
  feather.read_dataframe(path)	
  
R	
   Python	
  
DREMIO
What’s Next
•  Parquet for Python & C++
– Using Arrow Representation
•  Available IPC Implementation
•  Spark, Drill Integration
– Faster UDFs, Storage interfaces
DREMIO
Get Involved
•  Join the community
– dev@arrow.apache.org
– Slack:
https://apachearrowslackin.herokuapp.com/
– http://arrow.apache.org
– @ApacheArrow, @wesmckinn, @intjesus

Contenu connexe

Tendances

Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowWes McKinney
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasWes McKinney
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsWes McKinney
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FutureWes McKinney
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Julien Le Dem
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drillJulien Le Dem
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended CutWes McKinney
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresDataWorks Summit
 
DataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and UglyDataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and UglyWes McKinney
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowJulien Le Dem
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseJulien Le Dem
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
 
HUG August 2010: Best practices
HUG August 2010: Best practicesHUG August 2010: Best practices
HUG August 2010: Best practicesHadoop User Group
 
HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLHBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLDataWorks Summit
 
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...Uwe Korn
 

Tendances (20)

Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 
Node Labels in YARN
Node Labels in YARNNode Labels in YARN
Node Labels in YARN
 
Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
DataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and UglyDataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and Ugly
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
HUG August 2010: Best practices
HUG August 2010: Best practicesHUG August 2010: Best practices
HUG August 2010: Best practices
 
HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLHBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQL
 
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
 

En vedette

Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processingsscdotopen
 
Time Series Analysis with Spark
Time Series Analysis with SparkTime Series Analysis with Spark
Time Series Analysis with SparkSandy Ryza
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataMike Percy
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXrhatr
 
Hadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphHadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphDataWorks Summit
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache KuduJeff Holoman
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsTuri, Inc.
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataRyan Bosshart
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab CreateTuri, Inc.
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...DataWorks Summit/Hadoop Summit
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 

En vedette (13)

Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
 
Time Series Analysis with Spark
Time Series Analysis with SparkTime Series Analysis with Spark
Time Series Analysis with Spark
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
 
Hadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphHadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache Giraph
 
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab Create
 
Apache kudu
Apache kuduApache kudu
Apache kudu
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 

Similaire à Apache Arrow (Strata-Hadoop World San Jose 2016)

HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran John Mulhall
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupFrens Jan Rumph
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapJulien Le Dem
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigLester Martin
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsMiklos Christine
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsGuido Schmutz
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Rust is for "Big Data"
Rust is for "Big Data"Rust is for "Big Data"
Rust is for "Big Data"Andy Grove
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 

Similaire à Apache Arrow (Strata-Hadoop World San Jose 2016) (20)

HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmap
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
 
HUG France - Apache Drill
HUG France - Apache DrillHUG France - Apache Drill
HUG France - Apache Drill
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Rust is for "Big Data"
Rust is for "Big Data"Rust is for "Big Data"
Rust is for "Big Data"
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 

Plus de Wes McKinney

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackWes McKinney
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackWes McKinney
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Wes McKinney
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data ScienceWes McKinney
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Wes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningWes McKinney
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceWes McKinney
 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 KeynoteWes McKinney
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenWes McKinney
 

Plus de Wes McKinney (18)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
 

Dernier

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 

Dernier (20)

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 

Apache Arrow (Strata-Hadoop World San Jose 2016)

  • 1. DREMIO Faster conclusions using in-memory columnar SQL and machine learning Strata San Jose - March 30, 2016 Apache   Arrow  
  • 2. DREMIO Who Wes McKinney •  Engineer at Cloudera, formerly DataPad CEO/founder •  Wrote bestseller Python for Data Analysis 2012 •  Open source projects –  Python {pandas, Ibis, statsmodels} –  Apache {Arrow, Parquet, Kudu (incubating)} •  Mostly work in Python and Cython/C/C++ Jacques Nadeau •  CTO & Co-Founder at Dremio, formerly Architect at MapR •  Open Source projects –  Apache {Arrow, Parquet, Calcite, Drill, HBase, Phoenix} •  Mostly work in Java
  • 3. DREMIO Arrow in a Slide •  New Top-level Apache Software Foundation project –  Announced Feb 17, 2016 •  Focused on Columnar In-Memory Analytics 1.  10-100x speedup on many workloads 2.  Common data layer enables companies to choose best of breed systems 3.  Designed to work with any programming language 4.  Support for both relational and complex data as-is •  Developers from 13+ major open source projects involved –  A significant % of the world’s data will be processed through Arrow! Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
  • 4. DREMIO Agenda •  Purpose •  Memory Representation •  Language Bindings •  IPC & RPC •  Example Integrations
  • 6. DREMIO Overview •  A high speed in-memory representation •  Well-documented and cross language compatible •  Designed to take advantage of modern CPU characteristics •  Embeddable in execution engines, storage layers, etc.
  • 7. DREMIO Focus on CPU Efficiency 1331246660 1331246351 1331244570 1331261196 3/8/2012 2:44PM 3/8/2012 2:38PM 3/8/2012 2:09PM 3/8/2012 6:46PM 99.155.155.225 65.87.165.114 71.10.106.181 76.102.156.138 Row 1 Row 2 Row 3 Row 4 1331246660 1331246351 1331244570 1331261196 3/8/2012 2:44PM 3/8/2012 2:38PM 3/8/2012 2:09PM 3/8/2012 6:46PM 99.155.155.225 65.87.165.114 71.10.106.181 76.102.156.138 session_id timestamp source_ip Traditional Memory Buffer   Arrow Memory Buffer   •  Cache Locality •  Super-scalar & vectorized operation •  Minimal Structure Overhead •  Constant value access –  With minimal structure overhead •  Operate directly on columnar compressed data
  • 8. DREMIO High Performance Sharing & Interchange Today With Arrow •  Each system has its own internal memory format •  70-80% CPU wasted on serialization and deserialization •  Similar functionality implemented in multiple projects •  All systems utilize the same memory format •  No overhead for cross-system communication •  Projects can share functionality (eg, Parquet-to-Arrow reader) Pandas Drill Impala HBase KuduCassandra Parquet Spark Arrow Memory Pandas Drill Impala HBase KuduCassandra Parquet Spark Copy & Convert Copy & Convert Copy & Convert Copy & Convert Copy & Convert
  • 9. DREMIO Shared Need > Open Source Opportunity •  Columnar is Complex •  Shredded Columnar is even more complex •  We all need to go to same place •  Take Advantage of Open Source approach •  Once we pick a shared solution, we get interchange for “free” “We  are  also  considering  switching  to   a  columnar  canonical  in-­‐memory   format  for  data  that  needs  to  be   materialized  during  query  processing,   in  order  to  take  advantage  of  SIMD   instrucBons”  -­‐Impala  Team   “A  large  fracBon  of  the  CPU  Bme  is  spent   waiBng  for  data  to  be  fetched  from  main   memory…we  are  designing  cache-­‐friendly   algorithms  and  data  structures  so  Spark   applicaBons  will  spend  less  Bme  waiBng  to   fetch  data  from  memory  and  more  Bme   doing  useful  work  –  Spark  Team  
  • 11. DREMIO Columnar data persons  =  [{    name:  'wes',    iq:  180,    addresses:  [    {number:  2,  street  'a'},      {number:  3,  street  'bb'}    ]   },  {    name:  'joe',    iq:  100,    addresses:  [        {number:  4,  street  'ccc'},          {number:  5,  street  'dddd'},          {number:  2,  street  'f'}        ]   }]  
  • 16. DREMIO Language Bindings •  Target Languages –  Java (beta) –  CPP (underway) –  Python & Pandas (underway) –  R –  Julia •  Initial Focus –  Read a structure –  Write a structure –  Manage Memory
  • 17. DREMIO Java: Creating Dynamic Off-heap Structures FieldWriter  w=  getWriter();   w.varChar("name").write("Wes");   w.integer("iq").write(180);   ListWriter  list  =  writer.list("addresses");   list.startList();      MapWriter  map  =  list.map();      map.start();          map.integer("number").writeInt(2);          map.varChar("street").write("a");      map.end();      map.start();          map.integer("number").writeInt(3);          map.varChar("street").write("bb");      map.end();   list.endList();   {    name:  'wes',    iq:  180,    addresses:  [    {number:  2,  street  'a'},      {number:  3,  street  'bb'}    ]   }     Json  RepresentaBon   ProgrammaBc  ConstrucBon  
  • 18. DREMIO Java: Memory Management (& NVMe) •  Chunk-based managed allocator –  Built on top of Netty’s JEMalloc implementation •  Create a tree of allocators –  Limit and transfer semantics across allocators –  Leak detection and location accounting •  Wrap native memory from other applications •  New support for integration with Intel’s Persistent Memory library via Apache Mnemonic
  • 20. DREMIO Common Message Pattern •  Schema Negotiation –  Logical Description of structure –  Identification of dictionary encoded Nodes •  Dictionary Batch –  Dictionary ID, Values •  Record Batch –  Batches of records up to 64K –  Leaf nodes up to 2B values Schema   NegoBaBon   DicBonary   Batch   Record   Batch   Record   Batch   Record   Batch   1..N   Batches   0..N   Batches  
  • 21. DREMIO Record Batch Construction Schema   NegoBaBon   DicBonary   Batch   Record   Batch   Record   Batch   Record   Batch   name  (offset)   name  (data)   iq  (data)   addresses  (list  offset)   addresses.number   addresses.street  (offset)   addresses.street  (data)   data  header  (describes  offsets  into  data)   name  (bitmap)   iq  (bitmap)   addresses  (bitmap)   addresses.number  (bitmap)   addresses.street  (bitmap)   {    name:  'wes',    iq:  180,    addresses:  [    {number:  2,                  street  'a'},      {number:  3,                  street  'bb'}    ]   }   Each  box  is   conBguous  memory,   enBrely  conBguous  on   wire  
  • 22. DREMIO RPC & IPC: Moving Data Between Systems RPC •  Avoid Serialization & Deserialization •  Layer TBD: Focused on supporting vectored io –  Scatter/gather reads/writes against socket IPC •  Alpha implementation using memory mapped files –  Moving data between Python and Drill •  Working on shared allocation approach –  Shared reference counting and well-defined ownership semantics
  • 24. DREMIO Real World Example: Python With Spark or Drill in partition 0 … in partition n - 1 SQL Engine Python function input Python function input User-supplied Python code output output out partition 0 … out partition n - 1 SQL Engine
  • 25. DREMIO Real World Example: Feather File Format for Python and R •  Problem: fast, language- agnostic binary data frame file format •  Written by Wes McKinney (Python) Hadley Wickham (R) •  Read speeds close to disk IO performance Arrow array 0 Arrow array 1 … Arrow array n Feather metadata Feather file Apache Arrow memory Google flatbuffers
  • 26. DREMIO Real World Example: Feather File Format for Python and R library(feather)       path  <-­‐  "my_data.feather"   write_feather(df,  path)       df  <-­‐  read_feather(path)   import  feather       path  =  'my_data.feather'       feather.write_dataframe(df,  path)   df  =  feather.read_dataframe(path)   R   Python  
  • 27. DREMIO What’s Next •  Parquet for Python & C++ – Using Arrow Representation •  Available IPC Implementation •  Spark, Drill Integration – Faster UDFs, Storage interfaces
  • 28. DREMIO Get Involved •  Join the community – dev@arrow.apache.org – Slack: https://apachearrowslackin.herokuapp.com/ – http://arrow.apache.org – @ApacheArrow, @wesmckinn, @intjesus