SlideShare une entreprise Scribd logo
1  sur  34
Télécharger pour lire hors ligne
EFFI C I EN T C OL U M N AR STORAG E
W I TH
APACH E PARQU ET
Apache: Big Data North America 2017
Ranganathan Balashanmugam, Aconex
“The Tables Have Turned.”
Apache: Big Data North America 2017
“The Tables Have Turned.” - 90°
Apache: Big Data North America 2017
Apache: Big Data North America 2017
ANALY TI CAL OV ER
TRAN SAC TI ON AL
Context
Apache: Big Data North America 2017
SIM PL E S TRUCTURE
A B C
a1 b1 c1
a2 b2 c2
a3 b3 c3
a1 b1 c1 a2 b2 c2 a3 b3 c3
row
a1 a2 a3 b1 b2 b3 c1 c2 c3
columnar
Apache: Big Data North America 2017
“Optimizing the disk seeks.”
“The Tables Have Turned.” - 90°
Apache: Big Data North America 2017
Efficient writes
Efficient reads
Apache: Big Data North America 2017
SIM PL E S TRUCTURE
A B C
a1 b1 c1
a2 b2 c2
a3 b3 c3
a1 b1 c1 a2 b2 c2 a3 b3 c3
row
a1 a2 a3 b1 b2 b3 c1 c2 c3
columnar
message Document {
required int64 DocId;
optional group Links {
repeated int64 Backward;
repeated int64 Forward;
}
repeated group Name {
repeated group Language {
required string Code;
optional string Country;
}
optional string Url;
}
}
Apache: Big Data North America 2017
N ES TED A N D R EPEATED S TR U C TU R ES
How to preserve in column store?
DocId: 10
Links:
Forward: 20
Forward: 40
Forward: 60
Name:
Language:
Code: ‘en-us’
Country: ‘us’
Language:
Code: ‘en’
Url: ‘http://a'
Name:
Url: ‘http://b'
Name:
Language:
Code: ‘en-gb’
Country: ‘gb’
Apache: Big Data North America 2017
Colum
nar
Apache: Big Data North America 2017
“Get these performance benefits for nested structures into Hadoop
ecosystem.”
problem
statem
ent
Apache: Big Data North America 2017
M OTI VATI ON
* Allow complex nested data
structures
* Very efficient compression and
encoding schemes
* Support many frameworks
Apache: Big Data North America 2017
“Columnar storage format available to any project in the Hadoop ecosystem,
regardless of the choice of data processing framework, data model or
programming language.”
Apache: Big Data North America 2017
D ES I G N G OAL S
* Interoperability
* Space efficiency
* Query efficiency
Apache: Big Data North America 2017
parquet
avro thrift protobuf pig hive ….
avro thrift protobuf pig hive ….
object
model
converters
storage
format
object
models
parquet binary file
src: https://techmagie.wordpress.com/2016/07/15/data-storage-and-modelling-in-hadoop/
column
readers
scalding
Apache: Big Data North America 2017
Page 1
Page 2
… Page n
parquet file
header - Magic number (4 bytes) : “PAR1”
row group 0
column a column b column c
row group 1
…row group n
footer
Page 1
Page 2
… Page n
Page 1
Page 2
… Page n
src: https://github.com/Parquet/parquet-format
* good enough for
compression efficiency
* 8KB - 1 MB
* good enough to read
* group of rows
* max size buffered while writing
* 50 MB - 1GB
* data of one column in row group
* can be read independently
Apache: Big Data North America 2017
footer
src: https://github.com/Parquet/parquet-format
file metadata (ThriftCompactProtocol)
- version
- schema
row group 0 metadata
footer length (4 bytes)
column 0
- total byte size
- total rows
* type/path/encodings/codec
* num of values
* compressed/uncompressed size
* data page offset
* index page offset
column 1
* …
Magic number (4 bytes): “PAR1”
row group …
column 0
…
column 1
* …
* …
Apache: Big Data North America 2017
page
src: https://techmagie.wordpress.com/2016/07/15/data-storage-and-modelling-in-hadoop/
repetition levels
definition levels
values
page header (ThriftCompactProtocol)
encoded
and
com
pressed
Apache: Big Data North America 2017
message Document {
required int64 DocId;
optional group Links {
repeated int64 Backward;
repeated int64 Forward;
}
repeated group Name {
repeated group Language {
required string Code;
optional string Country;
}
optional string Url;
}
}
Schema
Apache: Big Data North America 2017
Document 1
DocId: 10
Links:
Forward: 20
Forward: 40
Forward: 60
Name:
Language:
Code: ‘en-us’
Country: ‘us’
Language:
Code: ‘en’
Url: ‘http://a'
Name:
Url: ‘http://b'
Name:
Language:
Code: ‘en-gb’
Country: ‘gb’
Document 2
DocId: 20
Links:
Backward: 10
Backward: 30
Forward: 80
Name:
Url: ‘http://c'
message Document {
required int64 DocId;
optional group Links {
repeated int64 Backward;
repeated int64 Forward;
}
repeated group Name {
repeated group Language {
required string Code;
optional string Country;
}
optional string Url;
}
}
Documents
Apache: Big Data North America 2017
“Can we represent it in columnar former efficiently and read them back to
their original nested data structure?”
problem
statem
ent
Apache: Big Data North America 2017
“Can we represent it in columnar former efficiently and read them back to
their original nested data structure?”
Dremel encoding
Apache: Big Data North America 2017
Document 1
DocId: 10
Links:
Forward: 20
Forward: 40
Forward: 60
<Backward>: NULL
Name:
Language:
Code: ‘en-us’
Country: ‘us’
Language:
Code: ‘en’
<Country>: NULL
Url: ‘http://a'
Name:
<Language>:
<Code>: NULL
<Country>: NULL
Url: ‘http://b'
Name:
Language:
Code: ‘en-gb’
Country: ‘gb’
<Url>: NULL
Document 2
DocId: 20
Links:
Backward: 10
Backward: 30
Forward: 80
Name:
<Language>:
<Code>: NULL
<Country>: NULL
Url: ‘http://c'
message Document {
required int64 DocId;
optional group Links {
repeated int64 Backward;
repeated int64 Forward;
}
repeated group Name {
repeated group Language {
required string Code;
optional string Country;
}
optional string Url;
}
}
fill all the nulls
Apache: Big Data North America 2017
Document 1
DocId: 10
Links:
Forward: 20
Forward: 40
Forward: 60
<Backward>: NULL
Name:
Language:
Code: ‘en-us’
Country: ‘us’
Language:
Code: ‘en’
<Country>: NULL
Url: ‘http://a'
Name:
<Language>:
<Code>: NULL
<Country>: NULL
Url: ‘http://b'
Name:
Language:
Code: ‘en-gb’
Country: ‘gb’
<Url>: NULL
Document 2
DocId: 20
Links:
Backward: 10
Backward: 30
Forward: 80
Name:
<Language>:
<Code>: NULL
<Country>: NULL
Url: ‘http://c'
message Document {
required int64 DocId;
optional group Links {
repeated int64 Backward;
repeated int64 Forward;
}
repeated group Name {
repeated group Language {
required string Code;
optional string Country;
}
optional string Url;
}
}
C O L U M N S R D VAL U E
Links.Forward 0 2 20
Links.Forward[1] 1 2 40
Links.Forward[2] 1 2 60
Links.Forward 0 2 80
In the
path to the
field, what
is the last
repeated
field ?
In the
path to the
field, how
many
defined
fields ?
Apache: Big Data North America 2017
Document 1
DocId: 10
Links:
Forward: 20
Forward: 40
Forward: 60
<Backward>: NULL
Name:
Language:
Code: ‘en-us’
Country: ‘us’
Language:
Code: ‘en’
<Country>: NULL
Url: ‘http://a'
Name:
<Language>:
<Code>: NULL
<Country>: NULL
Url: ‘http://b'
Name:
Language:
Code: ‘en-gb’
Country: ‘gb’
<Url>: NULL
Document 2
DocId: 20
Links:
Backward: 10
Backward: 30
Forward: 80
Name:
<Language>:
<Code>: NULL
<Country>: NULL
Url: ‘http://c'
message Document {
required int64 DocId;
optional group Links {
repeated int64 Backward;
repeated int64 Forward;
}
repeated group Name {
repeated group Language {
required string Code;
optional string Country;
}
optional string Url;
}
}
In the
path to the
field, what
is the last
repeated
field ?
In the
path to the
field, how
many
defined
fields ?
C O L U M N S R D VAL U E
Name.Language.Code en-us
Name.Language[1].Code en
Name[1].[Language].[Code] null
Name[2].Language.Code en-gb
Name.Language.Code null
Apache: Big Data North America 2017
Document 1
DocId: 10
Links:
Forward: 20
Forward: 40
Forward: 60
<Backward>: NULL
Name:
Language:
Code: ‘en-us’
Country: ‘us’
Language:
Code: ‘en’
<Country>: NULL
Url: ‘http://a'
Name:
<Language>:
<Code>: NULL
<Country>: NULL
Url: ‘http://b'
Name:
Language:
Code: ‘en-gb’
Country: ‘gb’
<Url>: NULL
Document 2
DocId: 20
Links:
Backward: 10
Backward: 30
Forward: 80
Name:
<Language>:
<Code>: NULL
<Country>: NULL
Url: ‘http://c'
message Document {
required int64 DocId;
optional group Links {
repeated int64 Backward;
repeated int64 Forward;
}
repeated group Name {
repeated group Language {
required string Code;
optional string Country;
}
optional string Url;
}
}
In the
path to the
field, what
is the last
repeated
field ?
In the
path to the
field, how
many
defined
fields ?
C O L U M N S R D VAL U E
Name.Language.Code 0 2 en-us
Name.Language[1].Code en
Name[1].[Language].[Code] null
Name[2].Language.Code en-gb
Name.Language.Code null
Apache: Big Data North America 2017
Document 1
DocId: 10
Links:
Forward: 20
Forward: 40
Forward: 60
<Backward>: NULL
Name:
Language:
Code: ‘en-us’
Country: ‘us’
Language:
Code: ‘en’
<Country>: NULL
Url: ‘http://a'
Name:
<Language>:
<Code>: NULL
<Country>: NULL
Url: ‘http://b'
Name:
Language:
Code: ‘en-gb’
Country: ‘gb’
<Url>: NULL
Document 2
DocId: 20
Links:
Backward: 10
Backward: 30
Forward: 80
Name:
<Language>:
<Code>: NULL
<Country>: NULL
Url: ‘http://c'
message Document {
required int64 DocId;
optional group Links {
repeated int64 Backward;
repeated int64 Forward;
}
repeated group Name {
repeated group Language {
required string Code;
optional string Country;
}
optional string Url;
}
}
In the
path to the
field, what
is the last
repeated
field ?
In the
path to the
field, how
many
defined
fields ?
C O L U M N S R D VAL U E
Name.Language.Code 0 2 en-us
Name.Language[1].Code 2 2 en
Name[1].[Language].[Code] null
Name[2].Language.Code en-gb
Name.Language.Code null
Apache: Big Data North America 2017
Document 1
DocId: 10
Links:
Forward: 20
Forward: 40
Forward: 60
<Backward>: NULL
Name:
Language:
Code: ‘en-us’
Country: ‘us’
Language:
Code: ‘en’
<Country>: NULL
Url: ‘http://a'
Name:
<Language>:
<Code>: NULL
<Country>: NULL
Url: ‘http://b'
Name:
Language:
Code: ‘en-gb’
Country: ‘gb’
<Url>: NULL
Document 2
DocId: 20
Links:
Backward: 10
Backward: 30
Forward: 80
Name:
<Language>:
<Code>: NULL
<Country>: NULL
Url: ‘http://c'
message Document {
required int64 DocId;
optional group Links {
repeated int64 Backward;
repeated int64 Forward;
}
repeated group Name {
repeated group Language {
required string Code;
optional string Country;
}
optional string Url;
}
}
In the
path to the
field, what
is the last
repeated
field ?
In the
path to the
field, how
many
defined
fields ?
C O L U M N S R D VAL U E
Name.Language.Code 0 2 en-us
Name.Language[1].Code 2 2 en
Name[1].[Language].[Code] 1 1 null
Name[2].Language.Code en-gb
Name.Language.Code null
Apache: Big Data North America 2017
Document 1
DocId: 10
Links:
Forward: 20
Forward: 40
Forward: 60
<Backward>: NULL
Name:
Language:
Code: ‘en-us’
Country: ‘us’
Language:
Code: ‘en’
<Country>: NULL
Url: ‘http://a'
Name:
<Language>:
<Code>: NULL
<Country>: NULL
Url: ‘http://b'
Name:
Language:
Code: ‘en-gb’
Country: ‘gb’
<Url>: NULL
Document 2
DocId: 20
Links:
Backward: 10
Backward: 30
Forward: 80
Name:
<Language>:
<Code>: NULL
<Country>: NULL
Url: ‘http://c'
message Document {
required int64 DocId;
optional group Links {
repeated int64 Backward;
repeated int64 Forward;
}
repeated group Name {
repeated group Language {
required string Code;
optional string Country;
}
optional string Url;
}
}
In the
path to the
field, what
is the last
repeated
field ?
In the
path to the
field, how
many
defined
fields ?
C O L U M N S R D VAL U E
Name.Language.Code 0 2 en-us
Name.Language[1].Code 2 2 en
Name[1].[Language].[Code] 1 1 null
Name[2].Language.Code 1 2 en-gb
Name.Language.Code 0 1 null
Apache: Big Data North America 2017
Document 1
DocId: 10
Links:
Forward: 20
Forward: 40
Forward: 60
<Backward>: NULL
Name:
Language:
Code: ‘en-us’
Country: ‘us’
Language:
Code: ‘en’
<Country>: NULL
Url: ‘http://a'
Name:
<Language>:
<Code>: NULL
<Country>: NULL
Url: ‘http://b'
Name:
Language:
Code: ‘en-gb’
Country: ‘gb’
<Url>: NULL
message Document {
required int64 DocId;
optional group Links {
repeated int64 Backward;
repeated int64 Forward;
}
repeated group Name {
repeated group Language {
required string Code;
optional string Country;
}
optional string Url;
}
}
C O L U M N S R D VAL U E
DocId 0 0 10
Links.Forward 0 2 20
Links.Forward[1] 1 2 40
Links.Forward[2] 1 2 60
Links.[Backward] 0 1 null
Name.Language.Code 0 2 en-us
Name.Language.Country 0 3 us
Name.Language[1].Code 2 2 en
Name.Language[1].[Country] 2 2 null
Name.Url 0 2 http://a
Name[1].[Language].[Code] 1 1 null
Name[1].[Language].[Country] 1 1 null
Name[1].Url 1 2 http://b
Name[2].Language.Code 1 2 en-gb
Name[2].Language.Country 1 3 gb
Name[2].[Url] 1 1 null
Apache: Big Data North America 2017
A B C
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
A B C
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
A B C
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
OPTI M I Z ATI ONS
projection push down predicate push down read required data
fetch required columns filter records while reading
Apache: Big Data North America 2017
EN C OD I N G
* Plain
* Dictionary Encoding
* Run Length Encoding/ Bit-Packing Hybrid
* Delta Encoding
* Delta-length byte array
* Delta Strings
Apache: Big Data North America 2017
C O M PR ES S I O N TR A D E O F F
* Storage
* IO
* Bandwidth
* CPU time
Thank you
Apache: Big Data North America 2017
Ran.ga.na.than B
@ran_than

Contenu connexe

Tendances

How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityUwe Korn
 
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowDataWorks Summit/Hadoop Summit
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Julien Le Dem
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseJulien Le Dem
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...DataWorks Summit/Hadoop Summit
 
HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran John Mulhall
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed databaseJulien Le Dem
 
HUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_DunningHUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_DunningJohn Mulhall
 
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage EfficiencyHDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage EfficiencyDataWorks Summit
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidDataWorks Summit/Hadoop Summit
 

Tendances (20)

How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperability
 
Apache Drill (ver. 0.2)
Apache Drill (ver. 0.2)Apache Drill (ver. 0.2)
Apache Drill (ver. 0.2)
 
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
 
HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed database
 
HUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_DunningHUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_Dunning
 
NEON HDF5
NEON HDF5NEON HDF5
NEON HDF5
 
MPP vs Hadoop
MPP vs HadoopMPP vs Hadoop
MPP vs Hadoop
 
MATLAB and Scientific Data: New Features and Capabilities
MATLAB and Scientific Data: New Features and CapabilitiesMATLAB and Scientific Data: New Features and Capabilities
MATLAB and Scientific Data: New Features and Capabilities
 
Caching and Buffering in HDF5
Caching and Buffering in HDF5Caching and Buffering in HDF5
Caching and Buffering in HDF5
 
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage EfficiencyHDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
 
To The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid AnalyticsTo The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid Analytics
 
Utilizing HDF4 File Content Maps for the Cloud Computing
Utilizing HDF4 File Content Maps for the Cloud ComputingUtilizing HDF4 File Content Maps for the Cloud Computing
Utilizing HDF4 File Content Maps for the Cloud Computing
 
HDF Product Designer
HDF Product DesignerHDF Product Designer
HDF Product Designer
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 
H5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only LibraryH5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only Library
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 

Similaire à Apache parquet - Apache big data North America 2017

Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...Big Data Spain
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsBenjamin Darfler
 
SQL on Hadoop: Defining the New Generation of Analytics Databases
SQL on Hadoop: Defining the New Generation of Analytics Databases  SQL on Hadoop: Defining the New Generation of Analytics Databases
SQL on Hadoop: Defining the New Generation of Analytics Databases DataWorks Summit
 
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon
 
Graph Analysis over JSON, Larus
Graph Analysis over JSON, LarusGraph Analysis over JSON, Larus
Graph Analysis over JSON, LarusNeo4j
 
Hadoop databases for oracle DBAs
Hadoop databases for oracle DBAsHadoop databases for oracle DBAs
Hadoop databases for oracle DBAsMaxym Kharchenko
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetupvmoorthy
 
Analytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop ConnectorAnalytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop ConnectorHenrik Ingo
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
SAS integration with NoSQL data
SAS integration with NoSQL dataSAS integration with NoSQL data
SAS integration with NoSQL dataKevin Lee
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysApache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysDemi Ben-Ari
 
HPCC Systems vs Hadoop
HPCC Systems vs HadoopHPCC Systems vs Hadoop
HPCC Systems vs HadoopFujio Turner
 
Introduction to MongoDB and Workshop
Introduction to MongoDB and WorkshopIntroduction to MongoDB and Workshop
Introduction to MongoDB and WorkshopAhmedabadJavaMeetup
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP vinoth kumar
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...Data Con LA
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightPaco Nathan
 

Similaire à Apache parquet - Apache big data North America 2017 (20)

Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at Localytics
 
SQL on Hadoop: Defining the New Generation of Analytics Databases
SQL on Hadoop: Defining the New Generation of Analytics Databases  SQL on Hadoop: Defining the New Generation of Analytics Databases
SQL on Hadoop: Defining the New Generation of Analytics Databases
 
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
 
Graph Analysis over JSON, Larus
Graph Analysis over JSON, LarusGraph Analysis over JSON, Larus
Graph Analysis over JSON, Larus
 
Hadoop databases for oracle DBAs
Hadoop databases for oracle DBAsHadoop databases for oracle DBAs
Hadoop databases for oracle DBAs
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetup
 
Os Gottfrid
Os GottfridOs Gottfrid
Os Gottfrid
 
Analytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop ConnectorAnalytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop Connector
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
SAS integration with NoSQL data
SAS integration with NoSQL dataSAS integration with NoSQL data
SAS integration with NoSQL data
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysApache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - Panorays
 
HPCC Systems vs Hadoop
HPCC Systems vs HadoopHPCC Systems vs Hadoop
HPCC Systems vs Hadoop
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Introduction to MongoDB and Workshop
Introduction to MongoDB and WorkshopIntroduction to MongoDB and Workshop
Introduction to MongoDB and Workshop
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
 

Plus de techmaddy

Qcon London2020 Scaling distributed teams
Qcon London2020  Scaling distributed teamsQcon London2020  Scaling distributed teams
Qcon London2020 Scaling distributed teamstechmaddy
 
Serverless architectures
Serverless architecturesServerless architectures
Serverless architecturestechmaddy
 
Technology -- the first strategy to startups
Technology  -- the first strategy to startupsTechnology  -- the first strategy to startups
Technology -- the first strategy to startupstechmaddy
 
Technology -- the first strategy to startups
Technology  -- the first strategy to startupsTechnology  -- the first strategy to startups
Technology -- the first strategy to startupstechmaddy
 
The best of Apache Kafka Architecture
The best of Apache Kafka ArchitectureThe best of Apache Kafka Architecture
The best of Apache Kafka Architecturetechmaddy
 
Offline First Applications
Offline First ApplicationsOffline First Applications
Offline First Applicationstechmaddy
 
GIDS 2016 Understanding and Building No SQLs
GIDS 2016 Understanding and Building No SQLsGIDS 2016 Understanding and Building No SQLs
GIDS 2016 Understanding and Building No SQLstechmaddy
 
Apache big data 2016 - Speaking the language of Big Data
Apache big data 2016 - Speaking the language of Big DataApache big data 2016 - Speaking the language of Big Data
Apache big data 2016 - Speaking the language of Big Datatechmaddy
 

Plus de techmaddy (8)

Qcon London2020 Scaling distributed teams
Qcon London2020  Scaling distributed teamsQcon London2020  Scaling distributed teams
Qcon London2020 Scaling distributed teams
 
Serverless architectures
Serverless architecturesServerless architectures
Serverless architectures
 
Technology -- the first strategy to startups
Technology  -- the first strategy to startupsTechnology  -- the first strategy to startups
Technology -- the first strategy to startups
 
Technology -- the first strategy to startups
Technology  -- the first strategy to startupsTechnology  -- the first strategy to startups
Technology -- the first strategy to startups
 
The best of Apache Kafka Architecture
The best of Apache Kafka ArchitectureThe best of Apache Kafka Architecture
The best of Apache Kafka Architecture
 
Offline First Applications
Offline First ApplicationsOffline First Applications
Offline First Applications
 
GIDS 2016 Understanding and Building No SQLs
GIDS 2016 Understanding and Building No SQLsGIDS 2016 Understanding and Building No SQLs
GIDS 2016 Understanding and Building No SQLs
 
Apache big data 2016 - Speaking the language of Big Data
Apache big data 2016 - Speaking the language of Big DataApache big data 2016 - Speaking the language of Big Data
Apache big data 2016 - Speaking the language of Big Data
 

Dernier

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Dernier (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Apache parquet - Apache big data North America 2017

  • 1. EFFI C I EN T C OL U M N AR STORAG E W I TH APACH E PARQU ET Apache: Big Data North America 2017 Ranganathan Balashanmugam, Aconex
  • 2. “The Tables Have Turned.” Apache: Big Data North America 2017
  • 3. “The Tables Have Turned.” - 90° Apache: Big Data North America 2017
  • 4. Apache: Big Data North America 2017 ANALY TI CAL OV ER TRAN SAC TI ON AL Context
  • 5. Apache: Big Data North America 2017 SIM PL E S TRUCTURE A B C a1 b1 c1 a2 b2 c2 a3 b3 c3 a1 b1 c1 a2 b2 c2 a3 b3 c3 row a1 a2 a3 b1 b2 b3 c1 c2 c3 columnar
  • 6. Apache: Big Data North America 2017 “Optimizing the disk seeks.”
  • 7. “The Tables Have Turned.” - 90° Apache: Big Data North America 2017 Efficient writes Efficient reads
  • 8. Apache: Big Data North America 2017 SIM PL E S TRUCTURE A B C a1 b1 c1 a2 b2 c2 a3 b3 c3 a1 b1 c1 a2 b2 c2 a3 b3 c3 row a1 a2 a3 b1 b2 b3 c1 c2 c3 columnar
  • 9. message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } Apache: Big Data North America 2017 N ES TED A N D R EPEATED S TR U C TU R ES How to preserve in column store? DocId: 10 Links: Forward: 20 Forward: 40 Forward: 60 Name: Language: Code: ‘en-us’ Country: ‘us’ Language: Code: ‘en’ Url: ‘http://a' Name: Url: ‘http://b' Name: Language: Code: ‘en-gb’ Country: ‘gb’
  • 10. Apache: Big Data North America 2017 Colum nar
  • 11. Apache: Big Data North America 2017 “Get these performance benefits for nested structures into Hadoop ecosystem.” problem statem ent
  • 12. Apache: Big Data North America 2017 M OTI VATI ON * Allow complex nested data structures * Very efficient compression and encoding schemes * Support many frameworks
  • 13. Apache: Big Data North America 2017 “Columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.”
  • 14. Apache: Big Data North America 2017 D ES I G N G OAL S * Interoperability * Space efficiency * Query efficiency
  • 15. Apache: Big Data North America 2017 parquet avro thrift protobuf pig hive …. avro thrift protobuf pig hive …. object model converters storage format object models parquet binary file src: https://techmagie.wordpress.com/2016/07/15/data-storage-and-modelling-in-hadoop/ column readers scalding
  • 16. Apache: Big Data North America 2017 Page 1 Page 2 … Page n parquet file header - Magic number (4 bytes) : “PAR1” row group 0 column a column b column c row group 1 …row group n footer Page 1 Page 2 … Page n Page 1 Page 2 … Page n src: https://github.com/Parquet/parquet-format * good enough for compression efficiency * 8KB - 1 MB * good enough to read * group of rows * max size buffered while writing * 50 MB - 1GB * data of one column in row group * can be read independently
  • 17. Apache: Big Data North America 2017 footer src: https://github.com/Parquet/parquet-format file metadata (ThriftCompactProtocol) - version - schema row group 0 metadata footer length (4 bytes) column 0 - total byte size - total rows * type/path/encodings/codec * num of values * compressed/uncompressed size * data page offset * index page offset column 1 * … Magic number (4 bytes): “PAR1” row group … column 0 … column 1 * … * …
  • 18. Apache: Big Data North America 2017 page src: https://techmagie.wordpress.com/2016/07/15/data-storage-and-modelling-in-hadoop/ repetition levels definition levels values page header (ThriftCompactProtocol) encoded and com pressed
  • 19. Apache: Big Data North America 2017 message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } Schema
  • 20. Apache: Big Data North America 2017 Document 1 DocId: 10 Links: Forward: 20 Forward: 40 Forward: 60 Name: Language: Code: ‘en-us’ Country: ‘us’ Language: Code: ‘en’ Url: ‘http://a' Name: Url: ‘http://b' Name: Language: Code: ‘en-gb’ Country: ‘gb’ Document 2 DocId: 20 Links: Backward: 10 Backward: 30 Forward: 80 Name: Url: ‘http://c' message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } Documents
  • 21. Apache: Big Data North America 2017 “Can we represent it in columnar former efficiently and read them back to their original nested data structure?” problem statem ent
  • 22. Apache: Big Data North America 2017 “Can we represent it in columnar former efficiently and read them back to their original nested data structure?” Dremel encoding
  • 23. Apache: Big Data North America 2017 Document 1 DocId: 10 Links: Forward: 20 Forward: 40 Forward: 60 <Backward>: NULL Name: Language: Code: ‘en-us’ Country: ‘us’ Language: Code: ‘en’ <Country>: NULL Url: ‘http://a' Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://b' Name: Language: Code: ‘en-gb’ Country: ‘gb’ <Url>: NULL Document 2 DocId: 20 Links: Backward: 10 Backward: 30 Forward: 80 Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://c' message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } fill all the nulls
  • 24. Apache: Big Data North America 2017 Document 1 DocId: 10 Links: Forward: 20 Forward: 40 Forward: 60 <Backward>: NULL Name: Language: Code: ‘en-us’ Country: ‘us’ Language: Code: ‘en’ <Country>: NULL Url: ‘http://a' Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://b' Name: Language: Code: ‘en-gb’ Country: ‘gb’ <Url>: NULL Document 2 DocId: 20 Links: Backward: 10 Backward: 30 Forward: 80 Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://c' message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } C O L U M N S R D VAL U E Links.Forward 0 2 20 Links.Forward[1] 1 2 40 Links.Forward[2] 1 2 60 Links.Forward 0 2 80 In the path to the field, what is the last repeated field ? In the path to the field, how many defined fields ?
  • 25. Apache: Big Data North America 2017 Document 1 DocId: 10 Links: Forward: 20 Forward: 40 Forward: 60 <Backward>: NULL Name: Language: Code: ‘en-us’ Country: ‘us’ Language: Code: ‘en’ <Country>: NULL Url: ‘http://a' Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://b' Name: Language: Code: ‘en-gb’ Country: ‘gb’ <Url>: NULL Document 2 DocId: 20 Links: Backward: 10 Backward: 30 Forward: 80 Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://c' message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } In the path to the field, what is the last repeated field ? In the path to the field, how many defined fields ? C O L U M N S R D VAL U E Name.Language.Code en-us Name.Language[1].Code en Name[1].[Language].[Code] null Name[2].Language.Code en-gb Name.Language.Code null
  • 26. Apache: Big Data North America 2017 Document 1 DocId: 10 Links: Forward: 20 Forward: 40 Forward: 60 <Backward>: NULL Name: Language: Code: ‘en-us’ Country: ‘us’ Language: Code: ‘en’ <Country>: NULL Url: ‘http://a' Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://b' Name: Language: Code: ‘en-gb’ Country: ‘gb’ <Url>: NULL Document 2 DocId: 20 Links: Backward: 10 Backward: 30 Forward: 80 Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://c' message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } In the path to the field, what is the last repeated field ? In the path to the field, how many defined fields ? C O L U M N S R D VAL U E Name.Language.Code 0 2 en-us Name.Language[1].Code en Name[1].[Language].[Code] null Name[2].Language.Code en-gb Name.Language.Code null
  • 27. Apache: Big Data North America 2017 Document 1 DocId: 10 Links: Forward: 20 Forward: 40 Forward: 60 <Backward>: NULL Name: Language: Code: ‘en-us’ Country: ‘us’ Language: Code: ‘en’ <Country>: NULL Url: ‘http://a' Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://b' Name: Language: Code: ‘en-gb’ Country: ‘gb’ <Url>: NULL Document 2 DocId: 20 Links: Backward: 10 Backward: 30 Forward: 80 Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://c' message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } In the path to the field, what is the last repeated field ? In the path to the field, how many defined fields ? C O L U M N S R D VAL U E Name.Language.Code 0 2 en-us Name.Language[1].Code 2 2 en Name[1].[Language].[Code] null Name[2].Language.Code en-gb Name.Language.Code null
  • 28. Apache: Big Data North America 2017 Document 1 DocId: 10 Links: Forward: 20 Forward: 40 Forward: 60 <Backward>: NULL Name: Language: Code: ‘en-us’ Country: ‘us’ Language: Code: ‘en’ <Country>: NULL Url: ‘http://a' Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://b' Name: Language: Code: ‘en-gb’ Country: ‘gb’ <Url>: NULL Document 2 DocId: 20 Links: Backward: 10 Backward: 30 Forward: 80 Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://c' message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } In the path to the field, what is the last repeated field ? In the path to the field, how many defined fields ? C O L U M N S R D VAL U E Name.Language.Code 0 2 en-us Name.Language[1].Code 2 2 en Name[1].[Language].[Code] 1 1 null Name[2].Language.Code en-gb Name.Language.Code null
  • 29. Apache: Big Data North America 2017 Document 1 DocId: 10 Links: Forward: 20 Forward: 40 Forward: 60 <Backward>: NULL Name: Language: Code: ‘en-us’ Country: ‘us’ Language: Code: ‘en’ <Country>: NULL Url: ‘http://a' Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://b' Name: Language: Code: ‘en-gb’ Country: ‘gb’ <Url>: NULL Document 2 DocId: 20 Links: Backward: 10 Backward: 30 Forward: 80 Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://c' message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } In the path to the field, what is the last repeated field ? In the path to the field, how many defined fields ? C O L U M N S R D VAL U E Name.Language.Code 0 2 en-us Name.Language[1].Code 2 2 en Name[1].[Language].[Code] 1 1 null Name[2].Language.Code 1 2 en-gb Name.Language.Code 0 1 null
  • 30. Apache: Big Data North America 2017 Document 1 DocId: 10 Links: Forward: 20 Forward: 40 Forward: 60 <Backward>: NULL Name: Language: Code: ‘en-us’ Country: ‘us’ Language: Code: ‘en’ <Country>: NULL Url: ‘http://a' Name: <Language>: <Code>: NULL <Country>: NULL Url: ‘http://b' Name: Language: Code: ‘en-gb’ Country: ‘gb’ <Url>: NULL message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } C O L U M N S R D VAL U E DocId 0 0 10 Links.Forward 0 2 20 Links.Forward[1] 1 2 40 Links.Forward[2] 1 2 60 Links.[Backward] 0 1 null Name.Language.Code 0 2 en-us Name.Language.Country 0 3 us Name.Language[1].Code 2 2 en Name.Language[1].[Country] 2 2 null Name.Url 0 2 http://a Name[1].[Language].[Code] 1 1 null Name[1].[Language].[Country] 1 1 null Name[1].Url 1 2 http://b Name[2].Language.Code 1 2 en-gb Name[2].Language.Country 1 3 gb Name[2].[Url] 1 1 null
  • 31. Apache: Big Data North America 2017 A B C a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 A B C a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 A B C a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 OPTI M I Z ATI ONS projection push down predicate push down read required data fetch required columns filter records while reading
  • 32. Apache: Big Data North America 2017 EN C OD I N G * Plain * Dictionary Encoding * Run Length Encoding/ Bit-Packing Hybrid * Delta Encoding * Delta-length byte array * Delta Strings
  • 33. Apache: Big Data North America 2017 C O M PR ES S I O N TR A D E O F F * Storage * IO * Bandwidth * CPU time
  • 34. Thank you Apache: Big Data North America 2017 Ran.ga.na.than B @ran_than