SlideShare une entreprise Scribd logo
1  sur  47
Télécharger pour lire hors ligne
Parquet
Columnar Storage for the People
Lars George
EMEA Chief Architect @ Cloudera
lars@cloudera.com
About Me
•
•

•

•

EMEA Chief Architect at Cloudera
Apache Committer
‣ HBase and Whirr
O’Reilly Author
‣ HBase – The Definitive Guide
- Now in Japanese!
Contact
‣ lars@cloudera.com
‣ @larsgeorge

日本語版も出ました!
Introduction
For analytical workloads it is often advantageous to
store the data in a layout that is more amenable to
the way it is accessed.
Parquet is an open-source file format that strives to
do exactly that, i.e. provide an efficient layout for
analytical queries.
We will be looking some context from various
companies, the results observed in production and
benchmarks, and finally do a bit of a format deepdive.
Example: Twitter
•

Twitter’s Data
‣

‣

100TB+ a day of compressed data

‣

•

230M+ monthly active users generating and
consuming 500M+ tweets a day
Huge scale for instrumentation, user graph, derived
data, etc.

Analytics Infrastructure
‣

Several 1K+ node Hadoop clusters

‣

Log Collection Pipeline

‣

Processing Tools
Example: Twitter
Twitter’s Use-case
•

Logs available on HDFS

•

Thrift to store logs

•

Example schema: 87 columns, up to 7 levels of
nesting
Example: Twitter
Goal:
!

“To have a state of the art columnar storage available
across the Hadoop platform”
!

•

Hadoop is very reliable for big long running
queries, but also I/O heavy

•

Incrementally take advantage of column based
storage in existing framework

•

Not tied to any framework in particular
Columnar Storage
•

Limits I/O to data actually needed
‣

•

Loads only the columns that need to be
accessed

Saves space
‣
‣

•

Columnar layout compresses better
Type specific encodings

Enables vectorized execution engines
Columnar vs Row-based
Here is an example of translating a logical table
schema. First the example table:

In a row-based layout each row follows the next:
While for a column-oriented layout it stores one
column after the next:
Parquet Intro
Parquet defines a common file format, which is
language independent

and

formally specified.
Implementation exist in Java for MapReduce and 

C++, which is used by Impala.
Example: Impala Results
Example: Impala TPC-DS
Example: Criteo
•

Billions of new events per day

•

Roughly 60 columns per log

•

Heavy analytic workload

•

BI analysts using Hive and RCFile

•

Frequent schema modifications
!

•

Perfect use case for Parquet + Hive!
Parquet + Hive: Basic
Requirements
•

MapReduce compatibility due to Hive

•

Correctly handle evolving schemas across Parquet
files

•

Read only the columns use by query to minimize
data read

•

Interoperability with other execution engines, for
example Pig, Impala, etc.
Performance
Performance
Example: Twitter
•

Petabytes of storage saved

•

Example jobs taking advantage of projection push
down:
‣

Job 1 (Pig): reading 32% less data -> 20% task
time saving

‣

Job 2 (Scalding): reading 14 out of 35 columns,
reading 80% less data -> 66% task time saving

‣

Terabytes of scanning saved every day
Parquet Model
•

The algorithm is borrowed from Google Dremel’s
ColumnIO file format

•

Schema is defined in a familiar format

•

Supports nested data structures

•

Each cell is encoded as triplet: repetition level,
definition level, and the value

•

Level values are bound by the depth of the
schema
‣

Stored in a compact form
Parquet Model
•

Schema similar to Protocol Buffers, but with
simplifications (e.g. no Maps, Lists or Sets)
‣

These complex types can be expressed as a
combination of the other features

•

Root of schema is a group of fields called a
message

•

Field types are either group or primitive type with
repetition of required, optional or repeated
‣

exactly one, zero or one, or zero or more
Example Schema
!

message AddressBook {

required string owner;

repeated string ownerPhoneNumbers;

repeated group contacts { 

required string name; 

optional string phoneNumber;
}

} 

Represent Lists/Sets
Representing Maps
Schema as a Tree
Field per Primitive
Primitive fields are mapped to the columns in the
columnar format, shown in blue here:
Levels
The structure of the record is captured for each
value by two integers called repetition level and
definition level.
Using these two levels we can fully reconstruct the
nested structures while still being able to store each
primitive separately.
Definition Levels
Example:
message ExampleDefinitionLevel {

optional group a {

optional group b {

optional string c;

}

}

}
Contains one column “a.b.c” where all fields are
optional and can be null.
Definition Levels
Definition Levels
Example with a required field:
message ExampleDefinitionLevel { 

optional group a {

required group b {

optional string c;

}

}
}
Repetition Levels
Repeated fields require that we store where a lists
starts in a column of values, since these are stored
sequentially in the same place. The repetition level
denotes per value where a new lists starts, and are
basically a marker which also indicates the level
where to start the new list.
Only levels that are repeated need a repetition level,
i.e. optional or required fields are never repeated and
can be skipped while attributing repetition levels.
Repetition Levels
Repetition Levels
•

0 marks every new record and
implies creating a new level1
and level2 list

•

1 marks every new level1 list
and implies creating a new
level2 list as well

•

2 marks every new element in a
level2 list
Repetition Levels
Combining the Levels
Applying the two to the AddressBook example:
!
!
!
!
!

In particular for the column “contacts.phoneNumber”, a
defined phone number will have the maximum definition
level of 2, and a contact without phone number will have a
definition level of 1. In the case where contacts are absent,
it will be 0.
Example: AddressBook
AddressBook {
owner: "Julien Le Dem",
ownerPhoneNumbers: "555 123 4567",
ownerPhoneNumbers: "555 666 1337",
contacts: {
name: "Dmitriy Ryaboy",
phoneNumber: "555 987 6543",
},
contacts: {
name: "Chris Aniszczyk"
}
}
AddressBook {
owner: "A. Nonymous"
}

Looking at
contacts.phoneNumber
Example: AddressBook
AddressBook {

contacts: {

phoneNumber: "555 987 6543"

}

contacts: {

}
}
AddressBook {
}
Example: AddressBook
Example: AddressBook
When writing:
•

contacts.phoneNumber: “555 987 6543”
‣
‣

•

new record: R = 0
value defined: D = max (2)

contacts.phoneNumber: NULL
‣
‣

•

repeated contacts: R = 1
only defined up to contacts: D = 1

contacts: NULL
‣

new record: R = 0

‣

only defined up to AddressBook: D = 0
Example: AddressBook
During reading
•

R=0, D=2, Value = “555 987 6543”:!
‣

‣
•

R = 0 means a new record. We recreate the nested records from the root
until the definition level (here 2)
D = 2 which is the maximum. The value is defined and is inserted.

R=1, D=1:!
‣
‣

•

R = 1 means a new entry in the contacts list at level 1.
D = 1 means contacts is defined but not phoneNumber, so we just create
an empty contacts.

R=0, D=0:!
‣

R = 0 means a new record. we create the nested records from the root
until the definition level

‣

D = 0 => contacts is actually null, so we only have an empty AddressBook
Example: AddressBook
AddressBook {

contacts: {

phoneNumber: "555 987 6543"

}

contacts: {

}
}
AddressBook {
}
Storing Levels
Each primitive type has three sub columns, though
the overhead is low thanks to the columnar
representation and the fact that values are bound by
the depth of the schema, resulting in only a few bits
used.
When all fields are required in a flat schema we can
omit the levels altogether since they are would
always be zero.
Otherwise compression, such as RLE, takes care of
condensing data efficiently.
File Format
•

Row Groups: A group of rows in columnar format
‣
‣

One (or more) per split while reading

‣
•

Max size buffered in memory while writing
About 50MB < row group < 1GB

Columns Chunk: Data for one column in row group
‣

•

Column chunks can be read independently for efficient scans

Page: Unit of access in a column chunk
‣

Should be big enough for efficient compression

‣

Min size to read while accessing a single record

‣

About 8KB < page < 1MB
File Format
File Format
•

Layout
‣

‣

•

Row groups in
columnar format
footer contains
column chunks
offset and
schema

Language
independent
‣

Well defined
format

‣

Hadop and
Impala support
Integration
Hive and Pig natively support projection push-down.
Based on the query executed only the columns for
the fields accessed are fetched.
MapReduce and other tools use a globbing syntax,
for example:
field1;field2/**;field4{subfield1,subfield2}

This will return field1, all the columns under field2,
subfield1 and 2 under field4 but not field3
Encodings
•

Bit Packing
‣

‣

•

Small integers encoded in the minimum bits
required
Useful for repetition level, definition levels and
dictionary keys

Run Length Encoding (RLE)
‣

Used in combination with bit packing

‣

Cheap compression

‣

Works well for definition level of sparse columns
Encodings
…continued:
•

Dictionary Encoding
‣
‣

•

Useful for columns with few (<50k) distinct values
When applicable, compresses better and faster
than heavyweight algorithms (e.g. gzip, lzo,
snappy)

Extensible
‣

Defining new encodings is supported by the
format
Future
•

Parquet 2.0
‣

More encodings
-

‣

Statistics
-

‣

Delta encodings, improved encodings
For query planners and predicate pushdown

New page format
-

skip ahead better
Questions?
•

Contact: @larsgeorge, lars@cloudera.com

•

Sources
‣

Parquet Sources: https://github.com/parquet/
parquet-format

‣

Blog Post with Info: https://blog.twitter.com/2013/
dremel-made-simple-with-parquet

‣

Impala Source: https://github.com/cloudera/impala

‣

Impala: http://www.cloudera.com/content/
cloudera/en/campaign/introducing-impala.html

Contenu connexe

Tendances

A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architectureSohil Jain
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta LakeKnoldus Inc.
 

Tendances (20)

A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
 

En vedette

Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon
 
Spark, Python and Parquet
Spark, Python and Parquet Spark, Python and Parquet
Spark, Python and Parquet odsc
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Julien Le Dem
 
Taking Hadoop to Enterprise Security Standards
Taking Hadoop to Enterprise Security StandardsTaking Hadoop to Enterprise Security Standards
Taking Hadoop to Enterprise Security StandardsDataWorks Summit
 
Ysance conference - cloud computing - aws - 3 mai 2010
Ysance   conference - cloud computing - aws - 3 mai 2010Ysance   conference - cloud computing - aws - 3 mai 2010
Ysance conference - cloud computing - aws - 3 mai 2010Ysance
 
Social Networks and the Richness of Data
Social Networks and the Richness of DataSocial Networks and the Richness of Data
Social Networks and the Richness of Datalarsgeorge
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoalarsgeorge
 
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012larsgeorge
 
Introduction sur les problématiques d'une architecture distribuée
Introduction sur les problématiques d'une architecture distribuéeIntroduction sur les problématiques d'une architecture distribuée
Introduction sur les problématiques d'une architecture distribuéeKhanh Maudoux
 
HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014larsgeorge
 
HBase Sizing Notes
HBase Sizing NotesHBase Sizing Notes
HBase Sizing Noteslarsgeorge
 
Big Data is not Rocket Science
Big Data is not Rocket ScienceBig Data is not Rocket Science
Big Data is not Rocket Sciencelarsgeorge
 
Cassandra and Spark
Cassandra and Spark Cassandra and Spark
Cassandra and Spark datastaxjp
 
Phoenix - A High Performance Open Source SQL Layer over HBase
Phoenix - A High Performance Open Source SQL Layer over HBasePhoenix - A High Performance Open Source SQL Layer over HBase
Phoenix - A High Performance Open Source SQL Layer over HBaseSalesforce Developers
 
data science toolkit 101: set up Python, Spark, & Jupyter
data science toolkit 101: set up Python, Spark, & Jupyterdata science toolkit 101: set up Python, Spark, & Jupyter
data science toolkit 101: set up Python, Spark, & JupyterRaj Singh
 
HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017larsgeorge
 

En vedette (20)

Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
Spark, Python and Parquet
Spark, Python and Parquet Spark, Python and Parquet
Spark, Python and Parquet
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
Taking Hadoop to Enterprise Security Standards
Taking Hadoop to Enterprise Security StandardsTaking Hadoop to Enterprise Security Standards
Taking Hadoop to Enterprise Security Standards
 
Ysance conference - cloud computing - aws - 3 mai 2010
Ysance   conference - cloud computing - aws - 3 mai 2010Ysance   conference - cloud computing - aws - 3 mai 2010
Ysance conference - cloud computing - aws - 3 mai 2010
 
Hadoop unit
Hadoop unitHadoop unit
Hadoop unit
 
Social Networks and the Richness of Data
Social Networks and the Richness of DataSocial Networks and the Richness of Data
Social Networks and the Richness of Data
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
 
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
 
Introduction sur les problématiques d'une architecture distribuée
Introduction sur les problématiques d'une architecture distribuéeIntroduction sur les problématiques d'une architecture distribuée
Introduction sur les problématiques d'une architecture distribuée
 
Présentation Club STORM
Présentation Club STORMPrésentation Club STORM
Présentation Club STORM
 
HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014
 
HBase Sizing Notes
HBase Sizing NotesHBase Sizing Notes
HBase Sizing Notes
 
Big Data is not Rocket Science
Big Data is not Rocket ScienceBig Data is not Rocket Science
Big Data is not Rocket Science
 
Cassandra and Spark
Cassandra and Spark Cassandra and Spark
Cassandra and Spark
 
Phoenix - A High Performance Open Source SQL Layer over HBase
Phoenix - A High Performance Open Source SQL Layer over HBasePhoenix - A High Performance Open Source SQL Layer over HBase
Phoenix - A High Performance Open Source SQL Layer over HBase
 
data science toolkit 101: set up Python, Spark, & Jupyter
data science toolkit 101: set up Python, Spark, & Jupyterdata science toolkit 101: set up Python, Spark, & Jupyter
data science toolkit 101: set up Python, Spark, & Jupyter
 
Tech day hadoop, Spark
Tech day hadoop, SparkTech day hadoop, Spark
Tech day hadoop, Spark
 
HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017
 

Similaire à Parquet - Data I/O - Philadelphia 2013

(Julien le dem) parquet
(Julien le dem)   parquet(Julien le dem)   parquet
(Julien le dem) parquetNAVER D2
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 
Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Robert Grossman
 
Design for Scalability in ADAM
Design for Scalability in ADAMDesign for Scalability in ADAM
Design for Scalability in ADAMfnothaft
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open houseJulien Le Dem
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
Perly Parallel Processing of Fixed Width Data Records
Perly Parallel Processing of Fixed Width Data RecordsPerly Parallel Processing of Fixed Width Data Records
Perly Parallel Processing of Fixed Width Data RecordsWorkhorse Computing
 
Fast Data at Scale - AWS Summit Tel Aviv 2017
Fast Data at Scale - AWS Summit Tel Aviv 2017Fast Data at Scale - AWS Summit Tel Aviv 2017
Fast Data at Scale - AWS Summit Tel Aviv 2017Amazon Web Services
 
Introduction to NoSql
Introduction to NoSqlIntroduction to NoSql
Introduction to NoSqlOmid Vahdaty
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAlex Palamides
 
Learn c++ Programming Language
Learn c++ Programming LanguageLearn c++ Programming Language
Learn c++ Programming LanguageSteve Johnson
 
On Rails with Apache Cassandra
On Rails with Apache CassandraOn Rails with Apache Cassandra
On Rails with Apache CassandraStu Hood
 
Lens at apachecon
Lens at apacheconLens at apachecon
Lens at apacheconamarsri
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyData
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsCarl Lu
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSAmazon Web Services
 

Similaire à Parquet - Data I/O - Philadelphia 2013 (20)

(Julien le dem) parquet
(Julien le dem)   parquet(Julien le dem)   parquet
(Julien le dem) parquet
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)
 
Design for Scalability in ADAM
Design for Scalability in ADAMDesign for Scalability in ADAM
Design for Scalability in ADAM
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open house
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Perly Parallel Processing of Fixed Width Data Records
Perly Parallel Processing of Fixed Width Data RecordsPerly Parallel Processing of Fixed Width Data Records
Perly Parallel Processing of Fixed Width Data Records
 
Fast Data at Scale - AWS Summit Tel Aviv 2017
Fast Data at Scale - AWS Summit Tel Aviv 2017Fast Data at Scale - AWS Summit Tel Aviv 2017
Fast Data at Scale - AWS Summit Tel Aviv 2017
 
Introduction to NoSql
Introduction to NoSqlIntroduction to NoSql
Introduction to NoSql
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using R
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Learn c++ Programming Language
Learn c++ Programming LanguageLearn c++ Programming Language
Learn c++ Programming Language
 
On Rails with Apache Cassandra
On Rails with Apache CassandraOn Rails with Apache Cassandra
On Rails with Apache Cassandra
 
Lens at apachecon
Lens at apacheconLens at apachecon
Lens at apachecon
 
محاضرة برنامج التحليل الكمي R program د.هديل القفيدي
محاضرة برنامج التحليل الكمي   R program د.هديل القفيديمحاضرة برنامج التحليل الكمي   R program د.هديل القفيدي
محاضرة برنامج التحليل الكمي R program د.هديل القفيدي
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasets
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 

Plus de larsgeorge

HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practicelarsgeorge
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadooplarsgeorge
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge
 
HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014larsgeorge
 
HBase Sizing Guide
HBase Sizing GuideHBase Sizing Guide
HBase Sizing Guidelarsgeorge
 
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012larsgeorge
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaselarsgeorge
 

Plus de larsgeorge (7)

HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
 
HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014
 
HBase Sizing Guide
HBase Sizing GuideHBase Sizing Guide
HBase Sizing Guide
 
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
 

Dernier

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Dernier (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Parquet - Data I/O - Philadelphia 2013

  • 1. Parquet Columnar Storage for the People Lars George EMEA Chief Architect @ Cloudera lars@cloudera.com
  • 2. About Me • • • • EMEA Chief Architect at Cloudera Apache Committer ‣ HBase and Whirr O’Reilly Author ‣ HBase – The Definitive Guide - Now in Japanese! Contact ‣ lars@cloudera.com ‣ @larsgeorge 日本語版も出ました!
  • 3. Introduction For analytical workloads it is often advantageous to store the data in a layout that is more amenable to the way it is accessed. Parquet is an open-source file format that strives to do exactly that, i.e. provide an efficient layout for analytical queries. We will be looking some context from various companies, the results observed in production and benchmarks, and finally do a bit of a format deepdive.
  • 4. Example: Twitter • Twitter’s Data ‣ ‣ 100TB+ a day of compressed data ‣ • 230M+ monthly active users generating and consuming 500M+ tweets a day Huge scale for instrumentation, user graph, derived data, etc. Analytics Infrastructure ‣ Several 1K+ node Hadoop clusters ‣ Log Collection Pipeline ‣ Processing Tools
  • 5. Example: Twitter Twitter’s Use-case • Logs available on HDFS • Thrift to store logs • Example schema: 87 columns, up to 7 levels of nesting
  • 6. Example: Twitter Goal: ! “To have a state of the art columnar storage available across the Hadoop platform” ! • Hadoop is very reliable for big long running queries, but also I/O heavy • Incrementally take advantage of column based storage in existing framework • Not tied to any framework in particular
  • 7. Columnar Storage • Limits I/O to data actually needed ‣ • Loads only the columns that need to be accessed Saves space ‣ ‣ • Columnar layout compresses better Type specific encodings Enables vectorized execution engines
  • 8. Columnar vs Row-based Here is an example of translating a logical table schema. First the example table: In a row-based layout each row follows the next: While for a column-oriented layout it stores one column after the next:
  • 9. Parquet Intro Parquet defines a common file format, which is language independent and formally specified. Implementation exist in Java for MapReduce and 
 C++, which is used by Impala.
  • 12. Example: Criteo • Billions of new events per day • Roughly 60 columns per log • Heavy analytic workload • BI analysts using Hive and RCFile • Frequent schema modifications ! • Perfect use case for Parquet + Hive!
  • 13. Parquet + Hive: Basic Requirements • MapReduce compatibility due to Hive • Correctly handle evolving schemas across Parquet files • Read only the columns use by query to minimize data read • Interoperability with other execution engines, for example Pig, Impala, etc.
  • 16. Example: Twitter • Petabytes of storage saved • Example jobs taking advantage of projection push down: ‣ Job 1 (Pig): reading 32% less data -> 20% task time saving ‣ Job 2 (Scalding): reading 14 out of 35 columns, reading 80% less data -> 66% task time saving ‣ Terabytes of scanning saved every day
  • 17. Parquet Model • The algorithm is borrowed from Google Dremel’s ColumnIO file format • Schema is defined in a familiar format • Supports nested data structures • Each cell is encoded as triplet: repetition level, definition level, and the value • Level values are bound by the depth of the schema ‣ Stored in a compact form
  • 18. Parquet Model • Schema similar to Protocol Buffers, but with simplifications (e.g. no Maps, Lists or Sets) ‣ These complex types can be expressed as a combination of the other features • Root of schema is a group of fields called a message • Field types are either group or primitive type with repetition of required, optional or repeated ‣ exactly one, zero or one, or zero or more
  • 19. Example Schema ! message AddressBook {
 required string owner;
 repeated string ownerPhoneNumbers;
 repeated group contacts { 
 required string name; 
 optional string phoneNumber; } } 

  • 22. Schema as a Tree
  • 23. Field per Primitive Primitive fields are mapped to the columns in the columnar format, shown in blue here:
  • 24. Levels The structure of the record is captured for each value by two integers called repetition level and definition level. Using these two levels we can fully reconstruct the nested structures while still being able to store each primitive separately.
  • 25. Definition Levels Example: message ExampleDefinitionLevel {
 optional group a {
 optional group b {
 optional string c;
 }
 }
 } Contains one column “a.b.c” where all fields are optional and can be null.
  • 27. Definition Levels Example with a required field: message ExampleDefinitionLevel { 
 optional group a {
 required group b {
 optional string c;
 }
 } }
  • 28. Repetition Levels Repeated fields require that we store where a lists starts in a column of values, since these are stored sequentially in the same place. The repetition level denotes per value where a new lists starts, and are basically a marker which also indicates the level where to start the new list. Only levels that are repeated need a repetition level, i.e. optional or required fields are never repeated and can be skipped while attributing repetition levels.
  • 30. Repetition Levels • 0 marks every new record and implies creating a new level1 and level2 list • 1 marks every new level1 list and implies creating a new level2 list as well • 2 marks every new element in a level2 list
  • 32. Combining the Levels Applying the two to the AddressBook example: ! ! ! ! ! In particular for the column “contacts.phoneNumber”, a defined phone number will have the maximum definition level of 2, and a contact without phone number will have a definition level of 1. In the case where contacts are absent, it will be 0.
  • 33. Example: AddressBook AddressBook { owner: "Julien Le Dem", ownerPhoneNumbers: "555 123 4567", ownerPhoneNumbers: "555 666 1337", contacts: { name: "Dmitriy Ryaboy", phoneNumber: "555 987 6543", }, contacts: { name: "Chris Aniszczyk" } } AddressBook { owner: "A. Nonymous" } Looking at contacts.phoneNumber
  • 34. Example: AddressBook AddressBook {
 contacts: {
 phoneNumber: "555 987 6543"
 }
 contacts: {
 } } AddressBook { }
  • 36. Example: AddressBook When writing: • contacts.phoneNumber: “555 987 6543” ‣ ‣ • new record: R = 0 value defined: D = max (2) contacts.phoneNumber: NULL ‣ ‣ • repeated contacts: R = 1 only defined up to contacts: D = 1 contacts: NULL ‣ new record: R = 0 ‣ only defined up to AddressBook: D = 0
  • 37. Example: AddressBook During reading • R=0, D=2, Value = “555 987 6543”:! ‣ ‣ • R = 0 means a new record. We recreate the nested records from the root until the definition level (here 2) D = 2 which is the maximum. The value is defined and is inserted. R=1, D=1:! ‣ ‣ • R = 1 means a new entry in the contacts list at level 1. D = 1 means contacts is defined but not phoneNumber, so we just create an empty contacts. R=0, D=0:! ‣ R = 0 means a new record. we create the nested records from the root until the definition level ‣ D = 0 => contacts is actually null, so we only have an empty AddressBook
  • 38. Example: AddressBook AddressBook {
 contacts: {
 phoneNumber: "555 987 6543"
 }
 contacts: {
 } } AddressBook { }
  • 39. Storing Levels Each primitive type has three sub columns, though the overhead is low thanks to the columnar representation and the fact that values are bound by the depth of the schema, resulting in only a few bits used. When all fields are required in a flat schema we can omit the levels altogether since they are would always be zero. Otherwise compression, such as RLE, takes care of condensing data efficiently.
  • 40. File Format • Row Groups: A group of rows in columnar format ‣ ‣ One (or more) per split while reading ‣ • Max size buffered in memory while writing About 50MB < row group < 1GB Columns Chunk: Data for one column in row group ‣ • Column chunks can be read independently for efficient scans Page: Unit of access in a column chunk ‣ Should be big enough for efficient compression ‣ Min size to read while accessing a single record ‣ About 8KB < page < 1MB
  • 42. File Format • Layout ‣ ‣ • Row groups in columnar format footer contains column chunks offset and schema Language independent ‣ Well defined format ‣ Hadop and Impala support
  • 43. Integration Hive and Pig natively support projection push-down. Based on the query executed only the columns for the fields accessed are fetched. MapReduce and other tools use a globbing syntax, for example: field1;field2/**;field4{subfield1,subfield2} This will return field1, all the columns under field2, subfield1 and 2 under field4 but not field3
  • 44. Encodings • Bit Packing ‣ ‣ • Small integers encoded in the minimum bits required Useful for repetition level, definition levels and dictionary keys Run Length Encoding (RLE) ‣ Used in combination with bit packing ‣ Cheap compression ‣ Works well for definition level of sparse columns
  • 45. Encodings …continued: • Dictionary Encoding ‣ ‣ • Useful for columns with few (<50k) distinct values When applicable, compresses better and faster than heavyweight algorithms (e.g. gzip, lzo, snappy) Extensible ‣ Defining new encodings is supported by the format
  • 46. Future • Parquet 2.0 ‣ More encodings - ‣ Statistics - ‣ Delta encodings, improved encodings For query planners and predicate pushdown New page format - skip ahead better
  • 47. Questions? • Contact: @larsgeorge, lars@cloudera.com • Sources ‣ Parquet Sources: https://github.com/parquet/ parquet-format ‣ Blog Post with Info: https://blog.twitter.com/2013/ dremel-made-simple-with-parquet ‣ Impala Source: https://github.com/cloudera/impala ‣ Impala: http://www.cloudera.com/content/ cloudera/en/campaign/introducing-impala.html