SlideShare une entreprise Scribd logo
1  sur  20
Télécharger pour lire hors ligne
Parquet

Open columnar storage for Hadoop
Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter
http://parquet.io

http://parquet.io
Julien

1
Twitter Context
•

Twitter’s data
•
•

100TB+ a day of compressed data

•

•

200M+ monthly active users generating and consuming 500M+ tweets a day.
Scale is huge: Instrumentation, User graph, Derived data, ...

Analytics infrastructure:
•
•

Log collection pipeline

•

•

Several 1K+ nodes Hadoop clusters
Processing tools

Role of Twitter’s analytics infrastructure team
•

Platform for the whole company.

•

Manages the data and enables analysis.

•

Optimizes the cluster’s workload as a whole.
The Parquet Planers
Gustave Caillebotte

2

Julien
- Hundreds of TB
- Hadoop: a distributed file system and a parallel execution engine focused on data locality
- clusters: ad hoc, prod, ...
- Log collection pipeline (Scribe, Kafka, ...)
- Processing tools (Pig...)
- Every group at Twitter uses the infrastructure: spam, ads, mobile, search, recommendations, ...
- optimize space/io/cpu usage.
- the clusters are constantly growing and never big enough.
- We ingest hundreds of TeraBytes, IO bound processes:
Improved compression saves space. take advantage of column storage for better scans.
Twitter’s use case
•

Logs available on HDFS

•

Thrift to store logs

•

example: one schema has 87 columns, up to 7 levels of nesting.

struct LogEvent {
1: optional logbase.LogBase log_base
2: optional i64 event_value
3: optional string context
4: optional string referring_event
...
18: optional EventNamespace event_namespace
19: optional list<Item> items
20: optional map<AssociationType,Association> associations
21: optional MobileDetails mobile_details
22: optional WidgetDetails widget_details
23: optional map<ExternalService,string> external_ids
}

struct LogBase {
1: string transaction_id,
2: string ip_address,
...
15: optional string country,
16: optional string pid,
}

http://parquet.io
Julien
- Logs stored using Thrift with heavily nested data structure.
- Example: aggregate event counts per IP address for a given event_type.
We just need to access the ip_address and event_name columns

3
Goal
To have a state of the art columnar storage available
across the Hadoop platform
Hadoop is very reliable for big long running queries but also IO
heavy.
•

Incrementally take advantage of column based storage in existing
framework.
•
•

Not tied to any framework in particular

http://parquet.io

4
Columnar storage
•

Limits the IO to only the data that is needed.

•

Saves space: columnar layout compresses better

•

Enables better scans: load only the columns that
need to be accessed

•

Enables vectorized execution engines.

@EmrgencyKittens

http://parquet.io

5
Collaboration between Twitter and Cloudera:
•

Common file format definition:
•
•

•

Language independent
Formally specified.

Implementation in Java for Map/Reduce:
•

•

https://github.com/Parquet/parquet-mr

C++ and code generation in Cloudera Impala:
•

https://github.com/cloudera/impala

http://parquet.io

6
Twitter: Initial results
Data converted: similar to access logs. 30 columns.
Original format: Thrift binary in block compressed files

Scan time

Space
120.0%
100.0%
80.0%
60.0%
40.0%
20.0%
0%

100.00%
75.00%
50.00%
25.00%
0%
Thrift

Space
Parquet

1

Thrift

Parquet

30

columns

Space saving: 28% using the same compression algorithm
Scan + assembly time compared to original:
One column: 10%
All columns: 114%

http://parquet.io
Julien

7
Additional gains with dictionary
13 out of the 30 columns are suitable for dictionary encoding:
they represent 27% of raw data but only 3% of compressed data
Scan time
to thrift
Space
100.00%
60.0%
50.0%

75.00%

40.0%

50.00%

30.0%
20.0%

25.00%
0%

10.0%
Space

Parquet compressed (LZO)
Parquet Dictionary uncompressed
Parquet Dictionary compressed (LZO)

0%

1

13

Parquet compressed
Parquet dictionary compressed

Columns

Space saving: another 52% using the same compression algorithm (on
top of the original columnar storage gains)
Scan + assembly time compared to plain Parquet:
All 13 columns: 48% (of the already faster columnar scan)

http://parquet.io
Julien

8
Format

Row group
Column a
Page 0

Column b
Page 0

Column c
Page 0

Page 1
Page 1
Page 2

•

Row group: A group of rows in columnar format.
•
•

roughly: 50MB < row group < 1 GB

Page 4

Page 2

Page 3

Row group

Column chunk: The data for one column in a row group.
•

•

Page 2
Page 3

One (or more) per split while reading. 

•

•

Max size buffered in memory while writing.

Page 1

Column chunks can be read independently for efficient scans.

Page: Unit of access in a column chunk.
•

Should be big enough for compression to be efficient.

•

Minimum size to read to access a single record (when index pages are available).

•

roughly: 8KB < page < 1MB

http://parquet.io

9
Format
Layout:
Row groups in columnar
format. A footer contains
column chunks offset and
schema.
Language independent:
Well defined format.
Hadoop and Cloudera
Impala support.

http://parquet.io

10
Nested record shredding/assembly
• Algorithm borrowed from Google Dremel's column IO
• Each cell is encoded as a triplet: repetition level, definition level, value.
• Level values are bound by the depth of the schema: stored in a compact form.
Schema:
Document
message Document {
required int64 DocId;
optional group Links {
repeated int64 Backward;
repeated int64 Forward;
}
}

Max rep.
level

Max def. level

DocId

Links

Backward

Forward

0

0

Links.Backward

DocId

Record:
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80

Columns

1

2

Links.Forward

1

2

Document

Column

value

R

D

DocId

Backward

10

30

Forward

80

0

Links.Backward

Links

20

0

10

0

2

Links.Backward

DocId

20

30

1

2

Links.Forward

80

0

2

http://parquet.io
Julien
- We convert from nested data structure to flat columnar so we need to store extra information to maintain the record structure.
- We can picture the schema as a tree. The columns are the leaves identified by the path in the schema.
- Additionally to the values we store two integers called the repetition level and definition level.
- When the value is NULL, the definition level captures up to which level in the path the value was defined.
- the repetition level is the last level in the path where there was a repetition. 0 means we have a new record.
- both levels are bound by the depth of the tree. They can be encoded efficiently using bit packing and run length encoding.
- a flat schema with all fields required does not have any overhead as R and D are always 0 in that case.
Reference: http://research.google.com/pubs/pub36632.html

11
Repetition level
record 1: [[a, b, c], [d, e, f, g]]
record 2: [[h], [i, j]]

Level: 0,2,2,1,2,2,2,0,1,2
Data: a,b,c,d,e,f,g,h,i,j

http://parquet.io

12
Differences of Parquet and ORC
Parquet:
•

Repetition/Definition levels capture the structure.

Document

DocId

=> one column per Leaf in the schema.

Links

Backward

•

Array<int> is one column.

•

Nullity/repetition of an inner node is stored in each of its children

•

Forward

=> One column independently of nesting with some redundancy.

ORC:
•

An extra column for each Map or List to record their size.
=> one column per Node in the schema.

•

Array<int> is two columns: array size and content.

•

=> An extra column per nesting level.
http://parquet.io

13
Iteration on fully assembled records
•

To integrate with existing row based engines (Hive, Pig, M/R).
a1
a2

•

b1

a2

Aware of dictionary encoding: enable optimizations.

a1

b2

a3

b3

a3
b1
b2
b3

•

Assembles projection for any subset of the columns:
- only those are loaded from disc.

Document

DocId

Document

Document

Links

Document

Links

Links

20

Backward

10

30

Backward

10

30

Forward

80

Forward

80

http://parquet.io
Julien
Explain in the context of example.
- columns: for implementing column based engines
- record based: for integrating with existing row based engines (Hive, Pig, M/R).

14
Iteration on columns

Row:

R

D

V

0

0

1

A

1

0

1

B

1

1

C

2

0

0

3

0

1

•

To implement column based execution engine

•

D<1 => Null

Iteration on triplets: repetition level, definition level, value.

•

R=1 => same row

Repetition level = 0 indicates a new record.

D

Encoded or decoded values: computing aggregations on
integers is faster than on strings.
•

http://parquet.io
Julien

15
APIs
•

Schema definition and record materialization:
Hadoop does not have a notion of schema, however Impala, Pig, Hive, Thrift,
Avro, ProtocolBuffers do.
•

•

Event-based SAX-style record materialization layer. No double conversion.

Integration with existing type systems and processing
frameworks:
•

•

Impala

•

Pig

•

Thrift and Scrooge for M/R, Cascading and Scalding

•

Cascading tuples

•

Avro

•

Hive

http://parquet.io
Julien
all define a schema in a different way.

16
Encodings
•

Bit packing:

1

3

2

0

0

2

2

0

01|11|10|00 00|10|10|00

•
•
•

Small integers encoded in the minimum bits required
Useful for repetition level, definition levels and dictionary keys

Run Length Encoding:

1

1

1

1

1

1

1

1

•

•

Cheap compression

•

Works well for definition level of sparse columns.

Dictionary encoding:
•

•

1

Used in combination with bit packing,

•

8

Useful for columns with few ( < 50,000 ) distinct values

Extensible:
•

Defining new encodings is supported by the format

http://parquet.io

17
Contributors
Main contributors:
•

Julien Le Dem (Twitter): Format, Core, Pig, Thrift integration, Encodings

•

Nong Li, Marcel Kornacker, Todd Lipcon (Cloudera): Format, Impala

Jonathan Coveney, Alex Levenson, Aniket Mokashi, Tianshuo Deng
(Twitter): Encodings, projection push down
•

•

Mickaël Lacour, Rémy Pecqueur (Criteo): Hive integration

Dmitriy Ryaboy (Twitter): Format, Thrift and Scrooge Cascading
integration
•
•

Tom White (Cloudera): Avro integration

•

Avi Bryant, Colin Marc (Stripe): Cascading tuples integration

•

Matt Massie (Berkeley AMP lab): predicate and projection push down

•

David Chen (Linkedin): Avro integration improvements

http://parquet.io

18
Parquet 2.0

- More encodings: more compact storage without heavyweight compression
•

Delta encodings: for integers, strings and dictionary

•

Improved encoding for strings and boolean

- Statistics: to be used by query planners and
- New page format: faster predicate push down,

http://parquet.io

19
How to contribute
Questions? Ideas?
Contribute at: github.com/Parquet
Look for tickets marked as “pick me up!”

http://parquet.io

20

Contenu connexe

Tendances

Native erasure coding support inside hdfs presentation
Native erasure coding support inside hdfs presentationNative erasure coding support inside hdfs presentation
Native erasure coding support inside hdfs presentationlin bao
 
Data Storage Formats in Hadoop
Data Storage Formats in HadoopData Storage Formats in Hadoop
Data Storage Formats in HadoopBotond Balázs
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using PigDavid Wellman
 
An unsupervised framework for effective indexing of BigData
An unsupervised framework for effective indexing of BigDataAn unsupervised framework for effective indexing of BigData
An unsupervised framework for effective indexing of BigDataRamakrishna Prasad Sakhamuri
 
Control dataset partitioning and cache to optimize performances in Spark
Control dataset partitioning and cache to optimize performances in SparkControl dataset partitioning and cache to optimize performances in Spark
Control dataset partitioning and cache to optimize performances in SparkChristophePraud2
 
The reasons why 64-bit programs require more stack memory
The reasons why 64-bit programs require more stack memoryThe reasons why 64-bit programs require more stack memory
The reasons why 64-bit programs require more stack memoryPVS-Studio
 
Optimization Techniques
Optimization TechniquesOptimization Techniques
Optimization TechniquesJoud Khattab
 
CloverETL + Hadoop
CloverETL + HadoopCloverETL + Hadoop
CloverETL + HadoopDavid Pavlis
 
Nutch and lucene_framework
Nutch and lucene_frameworkNutch and lucene_framework
Nutch and lucene_frameworksamuelhard
 
fileop report
fileop reportfileop report
fileop reportJason Lu
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0HBaseCon
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Nathan Bijnens
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsChien Chung Shen
 

Tendances (20)

Native erasure coding support inside hdfs presentation
Native erasure coding support inside hdfs presentationNative erasure coding support inside hdfs presentation
Native erasure coding support inside hdfs presentation
 
Data Storage Formats in Hadoop
Data Storage Formats in HadoopData Storage Formats in Hadoop
Data Storage Formats in Hadoop
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
 
An unsupervised framework for effective indexing of BigData
An unsupervised framework for effective indexing of BigDataAn unsupervised framework for effective indexing of BigData
An unsupervised framework for effective indexing of BigData
 
Control dataset partitioning and cache to optimize performances in Spark
Control dataset partitioning and cache to optimize performances in SparkControl dataset partitioning and cache to optimize performances in Spark
Control dataset partitioning and cache to optimize performances in Spark
 
I pv6
I pv6I pv6
I pv6
 
The reasons why 64-bit programs require more stack memory
The reasons why 64-bit programs require more stack memoryThe reasons why 64-bit programs require more stack memory
The reasons why 64-bit programs require more stack memory
 
CPP17 - File IO
CPP17 - File IOCPP17 - File IO
CPP17 - File IO
 
Optimization Techniques
Optimization TechniquesOptimization Techniques
Optimization Techniques
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
CloverETL + Hadoop
CloverETL + HadoopCloverETL + Hadoop
CloverETL + Hadoop
 
Nutch and lucene_framework
Nutch and lucene_frameworkNutch and lucene_framework
Nutch and lucene_framework
 
Implementing HDF5 in MATLAB
Implementing HDF5 in MATLABImplementing HDF5 in MATLAB
Implementing HDF5 in MATLAB
 
MATLAB, netCDF, and OPeNDAP
MATLAB, netCDF, and OPeNDAPMATLAB, netCDF, and OPeNDAP
MATLAB, netCDF, and OPeNDAP
 
Matlab netcdf guide
Matlab netcdf guideMatlab netcdf guide
Matlab netcdf guide
 
fileop report
fileop reportfileop report
fileop report
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle Professionals
 

Similaire à Open Columnar Storage Format Parquet for Hadoop Distributed File System

Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Julien Le Dem
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OSri Ambati
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapJulien Le Dem
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudQubole
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAsLuis Marques
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Lviv Startup Club
 

Similaire à Open Columnar Storage Format Parquet for Hadoop Distributed File System (20)

Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Hadoop
HadoopHadoop
Hadoop
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Drill at the Chicago Hug
Drill at the Chicago HugDrill at the Chicago Hug
Drill at the Chicago Hug
 
Data Science
Data ScienceData Science
Data Science
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2O
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmap
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 

Plus de NAVER D2

[211] 인공지능이 인공지능 챗봇을 만든다
[211] 인공지능이 인공지능 챗봇을 만든다[211] 인공지능이 인공지능 챗봇을 만든다
[211] 인공지능이 인공지능 챗봇을 만든다NAVER D2
 
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...NAVER D2
 
[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기NAVER D2
 
[245]Papago Internals: 모델분석과 응용기술 개발
[245]Papago Internals: 모델분석과 응용기술 개발[245]Papago Internals: 모델분석과 응용기술 개발
[245]Papago Internals: 모델분석과 응용기술 개발NAVER D2
 
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈NAVER D2
 
[235]Wikipedia-scale Q&A
[235]Wikipedia-scale Q&A[235]Wikipedia-scale Q&A
[235]Wikipedia-scale Q&ANAVER D2
 
[244]로봇이 현실 세계에 대해 학습하도록 만들기
[244]로봇이 현실 세계에 대해 학습하도록 만들기[244]로봇이 현실 세계에 대해 학습하도록 만들기
[244]로봇이 현실 세계에 대해 학습하도록 만들기NAVER D2
 
[243] Deep Learning to help student’s Deep Learning
[243] Deep Learning to help student’s Deep Learning[243] Deep Learning to help student’s Deep Learning
[243] Deep Learning to help student’s Deep LearningNAVER D2
 
[234]Fast & Accurate Data Annotation Pipeline for AI applications
[234]Fast & Accurate Data Annotation Pipeline for AI applications[234]Fast & Accurate Data Annotation Pipeline for AI applications
[234]Fast & Accurate Data Annotation Pipeline for AI applicationsNAVER D2
 
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load BalancingOld version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load BalancingNAVER D2
 
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지NAVER D2
 
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기NAVER D2
 
[224]네이버 검색과 개인화
[224]네이버 검색과 개인화[224]네이버 검색과 개인화
[224]네이버 검색과 개인화NAVER D2
 
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)NAVER D2
 
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기NAVER D2
 
[213] Fashion Visual Search
[213] Fashion Visual Search[213] Fashion Visual Search
[213] Fashion Visual SearchNAVER D2
 
[232] TensorRT를 활용한 딥러닝 Inference 최적화
[232] TensorRT를 활용한 딥러닝 Inference 최적화[232] TensorRT를 활용한 딥러닝 Inference 최적화
[232] TensorRT를 활용한 딥러닝 Inference 최적화NAVER D2
 
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지NAVER D2
 
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터NAVER D2
 
[223]기계독해 QA: 검색인가, NLP인가?
[223]기계독해 QA: 검색인가, NLP인가?[223]기계독해 QA: 검색인가, NLP인가?
[223]기계독해 QA: 검색인가, NLP인가?NAVER D2
 

Plus de NAVER D2 (20)

[211] 인공지능이 인공지능 챗봇을 만든다
[211] 인공지능이 인공지능 챗봇을 만든다[211] 인공지능이 인공지능 챗봇을 만든다
[211] 인공지능이 인공지능 챗봇을 만든다
 
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
 
[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기
 
[245]Papago Internals: 모델분석과 응용기술 개발
[245]Papago Internals: 모델분석과 응용기술 개발[245]Papago Internals: 모델분석과 응용기술 개발
[245]Papago Internals: 모델분석과 응용기술 개발
 
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
 
[235]Wikipedia-scale Q&A
[235]Wikipedia-scale Q&A[235]Wikipedia-scale Q&A
[235]Wikipedia-scale Q&A
 
[244]로봇이 현실 세계에 대해 학습하도록 만들기
[244]로봇이 현실 세계에 대해 학습하도록 만들기[244]로봇이 현실 세계에 대해 학습하도록 만들기
[244]로봇이 현실 세계에 대해 학습하도록 만들기
 
[243] Deep Learning to help student’s Deep Learning
[243] Deep Learning to help student’s Deep Learning[243] Deep Learning to help student’s Deep Learning
[243] Deep Learning to help student’s Deep Learning
 
[234]Fast & Accurate Data Annotation Pipeline for AI applications
[234]Fast & Accurate Data Annotation Pipeline for AI applications[234]Fast & Accurate Data Annotation Pipeline for AI applications
[234]Fast & Accurate Data Annotation Pipeline for AI applications
 
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load BalancingOld version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
Old version: [233]대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing
 
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
[226]NAVER 광고 deep click prediction: 모델링부터 서빙까지
 
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
[225]NSML: 머신러닝 플랫폼 서비스하기 & 모델 튜닝 자동화하기
 
[224]네이버 검색과 개인화
[224]네이버 검색과 개인화[224]네이버 검색과 개인화
[224]네이버 검색과 개인화
 
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
[216]Search Reliability Engineering (부제: 지진에도 흔들리지 않는 네이버 검색시스템)
 
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
[214] Ai Serving Platform: 하루 수 억 건의 인퍼런스를 처리하기 위한 고군분투기
 
[213] Fashion Visual Search
[213] Fashion Visual Search[213] Fashion Visual Search
[213] Fashion Visual Search
 
[232] TensorRT를 활용한 딥러닝 Inference 최적화
[232] TensorRT를 활용한 딥러닝 Inference 최적화[232] TensorRT를 활용한 딥러닝 Inference 최적화
[232] TensorRT를 활용한 딥러닝 Inference 최적화
 
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
[242]컴퓨터 비전을 이용한 실내 지도 자동 업데이트 방법: 딥러닝을 통한 POI 변화 탐지
 
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
[212]C3, 데이터 처리에서 서빙까지 가능한 하둡 클러스터
 
[223]기계독해 QA: 검색인가, NLP인가?
[223]기계독해 QA: 검색인가, NLP인가?[223]기계독해 QA: 검색인가, NLP인가?
[223]기계독해 QA: 검색인가, NLP인가?
 

Dernier

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 

Dernier (20)

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Open Columnar Storage Format Parquet for Hadoop Distributed File System

  • 1. Parquet Open columnar storage for Hadoop Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter http://parquet.io http://parquet.io Julien 1
  • 2. Twitter Context • Twitter’s data • • 100TB+ a day of compressed data • • 200M+ monthly active users generating and consuming 500M+ tweets a day. Scale is huge: Instrumentation, User graph, Derived data, ... Analytics infrastructure: • • Log collection pipeline • • Several 1K+ nodes Hadoop clusters Processing tools Role of Twitter’s analytics infrastructure team • Platform for the whole company. • Manages the data and enables analysis. • Optimizes the cluster’s workload as a whole. The Parquet Planers Gustave Caillebotte 2 Julien - Hundreds of TB - Hadoop: a distributed file system and a parallel execution engine focused on data locality - clusters: ad hoc, prod, ... - Log collection pipeline (Scribe, Kafka, ...) - Processing tools (Pig...) - Every group at Twitter uses the infrastructure: spam, ads, mobile, search, recommendations, ... - optimize space/io/cpu usage. - the clusters are constantly growing and never big enough. - We ingest hundreds of TeraBytes, IO bound processes: Improved compression saves space. take advantage of column storage for better scans.
  • 3. Twitter’s use case • Logs available on HDFS • Thrift to store logs • example: one schema has 87 columns, up to 7 levels of nesting. struct LogEvent { 1: optional logbase.LogBase log_base 2: optional i64 event_value 3: optional string context 4: optional string referring_event ... 18: optional EventNamespace event_namespace 19: optional list<Item> items 20: optional map<AssociationType,Association> associations 21: optional MobileDetails mobile_details 22: optional WidgetDetails widget_details 23: optional map<ExternalService,string> external_ids } struct LogBase { 1: string transaction_id, 2: string ip_address, ... 15: optional string country, 16: optional string pid, } http://parquet.io Julien - Logs stored using Thrift with heavily nested data structure. - Example: aggregate event counts per IP address for a given event_type. We just need to access the ip_address and event_name columns 3
  • 4. Goal To have a state of the art columnar storage available across the Hadoop platform Hadoop is very reliable for big long running queries but also IO heavy. • Incrementally take advantage of column based storage in existing framework. • • Not tied to any framework in particular http://parquet.io 4
  • 5. Columnar storage • Limits the IO to only the data that is needed. • Saves space: columnar layout compresses better • Enables better scans: load only the columns that need to be accessed • Enables vectorized execution engines. @EmrgencyKittens http://parquet.io 5
  • 6. Collaboration between Twitter and Cloudera: • Common file format definition: • • • Language independent Formally specified. Implementation in Java for Map/Reduce: • • https://github.com/Parquet/parquet-mr C++ and code generation in Cloudera Impala: • https://github.com/cloudera/impala http://parquet.io 6
  • 7. Twitter: Initial results Data converted: similar to access logs. 30 columns. Original format: Thrift binary in block compressed files Scan time Space 120.0% 100.0% 80.0% 60.0% 40.0% 20.0% 0% 100.00% 75.00% 50.00% 25.00% 0% Thrift Space Parquet 1 Thrift Parquet 30 columns Space saving: 28% using the same compression algorithm Scan + assembly time compared to original: One column: 10% All columns: 114% http://parquet.io Julien 7
  • 8. Additional gains with dictionary 13 out of the 30 columns are suitable for dictionary encoding: they represent 27% of raw data but only 3% of compressed data Scan time to thrift Space 100.00% 60.0% 50.0% 75.00% 40.0% 50.00% 30.0% 20.0% 25.00% 0% 10.0% Space Parquet compressed (LZO) Parquet Dictionary uncompressed Parquet Dictionary compressed (LZO) 0% 1 13 Parquet compressed Parquet dictionary compressed Columns Space saving: another 52% using the same compression algorithm (on top of the original columnar storage gains) Scan + assembly time compared to plain Parquet: All 13 columns: 48% (of the already faster columnar scan) http://parquet.io Julien 8
  • 9. Format Row group Column a Page 0 Column b Page 0 Column c Page 0 Page 1 Page 1 Page 2 • Row group: A group of rows in columnar format. • • roughly: 50MB < row group < 1 GB Page 4 Page 2 Page 3 Row group Column chunk: The data for one column in a row group. • • Page 2 Page 3 One (or more) per split while reading.  • • Max size buffered in memory while writing. Page 1 Column chunks can be read independently for efficient scans. Page: Unit of access in a column chunk. • Should be big enough for compression to be efficient. • Minimum size to read to access a single record (when index pages are available). • roughly: 8KB < page < 1MB http://parquet.io 9
  • 10. Format Layout: Row groups in columnar format. A footer contains column chunks offset and schema. Language independent: Well defined format. Hadoop and Cloudera Impala support. http://parquet.io 10
  • 11. Nested record shredding/assembly • Algorithm borrowed from Google Dremel's column IO • Each cell is encoded as a triplet: repetition level, definition level, value. • Level values are bound by the depth of the schema: stored in a compact form. Schema: Document message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } } Max rep. level Max def. level DocId Links Backward Forward 0 0 Links.Backward DocId Record: DocId: 20 Links Backward: 10 Backward: 30 Forward: 80 Columns 1 2 Links.Forward 1 2 Document Column value R D DocId Backward 10 30 Forward 80 0 Links.Backward Links 20 0 10 0 2 Links.Backward DocId 20 30 1 2 Links.Forward 80 0 2 http://parquet.io Julien - We convert from nested data structure to flat columnar so we need to store extra information to maintain the record structure. - We can picture the schema as a tree. The columns are the leaves identified by the path in the schema. - Additionally to the values we store two integers called the repetition level and definition level. - When the value is NULL, the definition level captures up to which level in the path the value was defined. - the repetition level is the last level in the path where there was a repetition. 0 means we have a new record. - both levels are bound by the depth of the tree. They can be encoded efficiently using bit packing and run length encoding. - a flat schema with all fields required does not have any overhead as R and D are always 0 in that case. Reference: http://research.google.com/pubs/pub36632.html 11
  • 12. Repetition level record 1: [[a, b, c], [d, e, f, g]] record 2: [[h], [i, j]] Level: 0,2,2,1,2,2,2,0,1,2 Data: a,b,c,d,e,f,g,h,i,j http://parquet.io 12
  • 13. Differences of Parquet and ORC Parquet: • Repetition/Definition levels capture the structure. Document DocId => one column per Leaf in the schema. Links Backward • Array<int> is one column. • Nullity/repetition of an inner node is stored in each of its children • Forward => One column independently of nesting with some redundancy. ORC: • An extra column for each Map or List to record their size. => one column per Node in the schema. • Array<int> is two columns: array size and content. • => An extra column per nesting level. http://parquet.io 13
  • 14. Iteration on fully assembled records • To integrate with existing row based engines (Hive, Pig, M/R). a1 a2 • b1 a2 Aware of dictionary encoding: enable optimizations. a1 b2 a3 b3 a3 b1 b2 b3 • Assembles projection for any subset of the columns: - only those are loaded from disc. Document DocId Document Document Links Document Links Links 20 Backward 10 30 Backward 10 30 Forward 80 Forward 80 http://parquet.io Julien Explain in the context of example. - columns: for implementing column based engines - record based: for integrating with existing row based engines (Hive, Pig, M/R). 14
  • 15. Iteration on columns Row: R D V 0 0 1 A 1 0 1 B 1 1 C 2 0 0 3 0 1 • To implement column based execution engine • D<1 => Null Iteration on triplets: repetition level, definition level, value. • R=1 => same row Repetition level = 0 indicates a new record. D Encoded or decoded values: computing aggregations on integers is faster than on strings. • http://parquet.io Julien 15
  • 16. APIs • Schema definition and record materialization: Hadoop does not have a notion of schema, however Impala, Pig, Hive, Thrift, Avro, ProtocolBuffers do. • • Event-based SAX-style record materialization layer. No double conversion. Integration with existing type systems and processing frameworks: • • Impala • Pig • Thrift and Scrooge for M/R, Cascading and Scalding • Cascading tuples • Avro • Hive http://parquet.io Julien all define a schema in a different way. 16
  • 17. Encodings • Bit packing: 1 3 2 0 0 2 2 0 01|11|10|00 00|10|10|00 • • • Small integers encoded in the minimum bits required Useful for repetition level, definition levels and dictionary keys Run Length Encoding: 1 1 1 1 1 1 1 1 • • Cheap compression • Works well for definition level of sparse columns. Dictionary encoding: • • 1 Used in combination with bit packing, • 8 Useful for columns with few ( < 50,000 ) distinct values Extensible: • Defining new encodings is supported by the format http://parquet.io 17
  • 18. Contributors Main contributors: • Julien Le Dem (Twitter): Format, Core, Pig, Thrift integration, Encodings • Nong Li, Marcel Kornacker, Todd Lipcon (Cloudera): Format, Impala Jonathan Coveney, Alex Levenson, Aniket Mokashi, Tianshuo Deng (Twitter): Encodings, projection push down • • Mickaël Lacour, Rémy Pecqueur (Criteo): Hive integration Dmitriy Ryaboy (Twitter): Format, Thrift and Scrooge Cascading integration • • Tom White (Cloudera): Avro integration • Avi Bryant, Colin Marc (Stripe): Cascading tuples integration • Matt Massie (Berkeley AMP lab): predicate and projection push down • David Chen (Linkedin): Avro integration improvements http://parquet.io 18
  • 19. Parquet 2.0 - More encodings: more compact storage without heavyweight compression • Delta encodings: for integers, strings and dictionary • Improved encoding for strings and boolean - Statistics: to be used by query planners and - New page format: faster predicate push down, http://parquet.io 19
  • 20. How to contribute Questions? Ideas? Contribute at: github.com/Parquet Look for tickets marked as “pick me up!” http://parquet.io 20