SlideShare une entreprise Scribd logo
1  sur  44
Télécharger pour lire hors ligne
Parquet performance tuning:
The missing guide
Ryan Blue
Strata + Hadoop World NY 2016
● Big data at Netflix
● Parquet format background
● Optimization basics
● Stats and dictionary filtering
● Format 2 and compression
● Future work
Contents.
Big data at Netflix.
Big data at Netflix.
40+ PB DW Read 3PB Write 300TB600B Events
Strata San Jose results.
Metrics dataset.
Based on Atlas, Netflix’s telemetry platform.
● Performance monitoring backend and UI
● http://techblog.netflix.com/2014/12/introducing-atlas-netflixs-primary.html
Example metrics data.
● Partitioned by day, and cluster
● Columns include metric time, name, value, and host
● Measurements for each minute are stored in a Parquet table
Parquet format background.
Parquet data layout.
ROW GROUPS.
● Data needed for a group of rows to be reassembled
● Smallest task or input split size
● Made of COLUMN CHUNKS
COLUMN CHUNKS.
● Contiguous data for a single column
● Made of DATA PAGES and an optional DICTIONARY PAGE
DATA PAGES.
● Encoded and compressed runs of values
Row groups.
... F
A B C D
a1 b1 c1 d1
... ... ... ...
aN bN cN dN
... ... ... ...
HDFS block
Column chunks and pages.
... F
dict
Read less data.
Columnar organization.
● Encoding: make the data smaller
● Column projection: read only the columns you need
Row group filtering.
● Use footer stats to eliminate row groups
● Use dictionary pages to eliminate row groups
Page filtering.
● Use page stats to eliminate pages
Basics.
Setup.
Parquet writes:
● Version 1.8.1 or later – includes fix for incorrect statistics, PARQUET-251
● 1.9.0 due in October
Reads:
● Presto: Used 0.139
● Spark: Used version 1.6.1 reading from Hive
● Pig: Used parquet-pig 1.9.0 for predicate push-down
Pig configuration.
-- enable pushdown/filtering
set parquet.pig.predicate.pushdown.enable true;
-- enables stats and dictionary filtering
set parquet.filter.statistics.enabled true;
set parquet.filter.dictionary.enabled true;
Spark configuration.
// turn on Parquet push-down, stats filtering, and dictionary filtering
sqlContext.setConf("parquet.filter.statistics.enabled", "true")
sqlContext.setConf("parquet.filter.dictionary.enabled", "true")
sqlContext.setConf("spark.sql.parquet.filterPushdown", "true")
// use the non-Hive read path
sqlContext.setConf("spark.sql.hive.convertMetastoreParquet", "true")
// turn off schema merging, which turns off push-down
sqlContext.setConf("spark.sql.parquet.mergeSchema", "false")
sqlContext.setConf("spark.sql.hive.convertMetastoreParquet.mergeSchema",
"false")
Writing the data.
Spark:
sqlContext
.table("raw_metrics")
.write.insertInto("metrics")
Pig:
metricsData = LOAD 'raw_metrics'
USING SomeLoader;
STORE metricsData INTO 'metrics'
USING ParquetStorer;
Writing the data.
Spark:
sqlContext
.table("raw_metrics")
.write.insertInto("metrics")
Pig:
metricsData = LOAD 'raw_metrics'
USING SomeLoader;
STORE metricsData INTO 'metrics'
USING ParquetStorer;
OutOfMemoryError
or
ParquetRuntimeException
Writing too many files.
Data doesn’t match partitioning.
● Tasks write a file per partition
Symptoms:
● OutOfMemoryError
● ParquetRuntimeException: New Memory allocation 1047284 bytes is smaller than the
minimum allocation size of 1048576 bytes.
● Successfully write lots of small files, slow split planning
Task 1 part=1/
part=2/
Task 2 part=3/
part=4/
Task 3 part=.../
Account for partitioning.
Spark.
sqlContext
.table("raw_metrics")
.sort("day", "cluster")
.write.insertInto("metrics")
Pig.
metrics = LOAD 'raw_metrics'
USING SomeLoader;
metricsSorted = ORDER metrics
BY day, cluster;
STORE metricsSorted INTO 'metrics'
USING ParquetStorer;
Filter to select partitions.
Spark.
val partition = sqlContext
.table("metrics")
.filter("day = 20160929")
.filter("cluster = 'emr_adhoc'")
Pig.
metricsData = LOAD 'metrics'
USING ParquetLoader;
partition = FILTER metricsData BY
date == 20160929 AND
cluster == 'emr_adhoc'
Stats filters.
Sample query.
Spark.
val low_cpu_count = partition
.filter("name =
'system.cpu.utilization'")
.filter("value < 0.8")
.count()
Pig.
low_cpu = FILTER partition BY
name == 'system.cpu.utilization' AND
value < 0.8;
low_cpu_count = FOREACH
(GROUP low_cpu ALL) GENERATE
COUNT(name);
My job was 5 minutes faster!
Did it work?
● Success metrics: S3 bytes read, CPU time spent
S3N: Number of bytes read: 1,366,228,942,336
CPU time spent (ms): 280,218,780
● Filter didn’t work. Bytes read shows the entire partition was read.
● What happened?
Inspect the file.
● Stats show what happened:
Row group 0: count: 84756 845.42 B records
type encodings count avg size nulls min / max
name BINARY G _ 84756 61.52 B 0 "A..." / "z..."
...
Row group 1: count: 84756 845.42 B records
type encodings count avg size nulls min / max
name BINARY G _ 85579 61.52 B 0 "A..." / "z..."
● Every row group matched the query
Add query columns to the sort.
Spark.
sqlContext
.table("raw_metrics")
.sort("day", "cluster", "name")
.write.insertInto("metrics")
Pig.
metrics = LOAD 'raw_metrics'
USING SomeLoader;
metricsSorted = ORDER metrics
BY day, cluster, name;
STORE metricsSorted INTO 'metrics'
USING ParquetStorer;
Inspect the file, again.
● Stats are fixed:
Row group 0: count: 84756 845.42 B records
type encodings count avg size nulls min / max
name BINARY G _ 84756 61.52 B 0 "A..." / "F..."
...
Row group 1: count: 85579 845.42 B records
type encodings count avg size nulls min / max
name BINARY G _ 85579 61.52 B 0 "F..." / "N..."
...
Row group 2: count: 86712 845.42 B records
type encodings count avg size nulls min / max
name BINARY G _ 86712 61.52 B 0 "N..." / "b..."
Dictionary filters.
Dictionary filtering.
Dictionary is a compact list of all the values.
● Search term missing? Skip the row group
● Like a bloom filter without false positives
When dictionary filtering helps:
● When a column is sorted in each file, not globally sorted – one row group matches
● When filtering an unsorted column
dict dict dict
Dictionary filtering overhead.
Read overhead.
● Extra seeks
● Extra page reads
Not a problem in practice.
● Reading both dictionary and row group resulted in < 1% penalty
● Stats filtering prevents unnecessary dictionary reads
dict dict dict
Works out of the box, right?
Nope.
● Only works when columns are completely dictionary-encoded
● Plain-encoded pages can contain any value, dictionary is no help
● All pages in a chunk must use the dictionary
Dictionary fallback rules:
● If dictionary + references > plain encoding, fall back
● If dictionary size is too large, fall back (default threshold: 1 MB)
Fallback to plain encoding.
parquet-tools dump -d
utc_timestamp_ms TV=142990 RL=0 DL=1 DS: 833491 DE:PLAIN_DICTIONARY
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:BIT_PACKED V:RLE SZ:72912
page 1: DLE:RLE RLE:BIT_PACKED V:RLE SZ:135022
page 2: DLE:RLE RLE:BIT_PACKED V:PLAIN SZ:1048607
page 3: DLE:RLE RLE:BIT_PACKED V:PLAIN SZ:1048607
page 4: DLE:RLE RLE:BIT_PACKED V:PLAIN SZ:714941
What’s happening:
● Values repeat, but change over time
● Dictionary gets too large, falls back to plain encoding
● Dictionary encoding is a size win!
Avoid encoding fallback.
Increase max dictionary size.
● 2-3 MB usually worked
● parquet.dictionary.page.size
Decrease row group size.
● 24, 32, or 64 MB
● parquet.block.size
● New dictionary for each row group
● Also lowers memory consumption!
Run several tests to find the right configuration (per table).
Row group size.
Other reasons to decrease row group size:
● Reduce memory consumption – but not to avoid write-side OOM
● Increase number of tasks / parallelism
Results!
Results (from Pig).
CPU and wall time dropped.
● Initial: CPU Time: 280,218,780 ms Wall Time: 15m 27s
● Filtered: CPU Time: 120,275,590 ms Wall Time: 9m 51s
● Final: CPU Time: 9,593,700 ms Wall Time: 6m 47s
Bytes read is much better.
● Initial: S3 bytes read: 1,366,228,942,336 (1.24 TB)
● Filtered: S3 bytes read: 49,195,996,736 (45.82 GB)
Filtered vs. final time.
Row group filtering is parallel.
● Split planning is independent of stats (or else is a bottleneck)
● Lots of very small tasks: read footer, read dictionary, stop processing
Combine splits in Pig/MR for better time.
● 1 GB splits tend to work well
Other work.
Format version 2.
What’s included:
● New encodings: delta-integer, prefix-binary
● New page format to enable page-level filtering
New encodings didn’t help with Netflix data.
● Delta-integer didn’t help significantly, even with timestamps (high overhead?)
● Not large enough prefixes in URL and JSON data
Page filtering isn’t implemented (yet).
Brotli compression.
● New compression library, from Google
● Based on LZ77, with compatible license
Faster compression, smaller files, or both.
● brotli-5: 19.7% smaller, 2.7% slower – 1 day of data from Kafka
● brotli-4: 14.8% smaller, 12.5% faster – 1 hour, 4 largest Parquet tables
● brotli-1: 8.1% smaller, 28.3% faster – JSON-heavy dataset
Brotli compression. (continued)
Future work.
Future work.
Short term:
● Release Parquet 1.9.0
● Test Zstd compression
● Convert embedded JSON to Avro – good preliminary results
Long-term:
● New encodings: Zig-zag RLE, patching, and floating point decomposition
● Page-level filtering
Thank you!
Questions?
https://jobs.netflix.com/
rblue@netflix.com

Contenu connexe

Tendances

A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in DeltaDatabricks
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in SparkDatabricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IODatabricks
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDatabricks
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan BlueDatabricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDatabricks
 

Tendances (20)

A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
 

En vedette

Serverlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and QuicksightServerlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and QuicksightAmazon Web Services
 
SIGMOD’12勉強会 -Session 7-
SIGMOD’12勉強会 -Session 7-SIGMOD’12勉強会 -Session 7-
SIGMOD’12勉強会 -Session 7-Takeshi Yamamuro
 
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...Amazon Web Services
 
Prestoで実現するインタラクティブクエリ - dbtech showcase 2014 Tokyo
Prestoで実現するインタラクティブクエリ - dbtech showcase 2014 TokyoPrestoで実現するインタラクティブクエリ - dbtech showcase 2014 Tokyo
Prestoで実現するインタラクティブクエリ - dbtech showcase 2014 TokyoTreasure Data, Inc.
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Julien Le Dem
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsJulien Le Dem
 
AWS Black Belt Online Seminar 2017 Amazon Athena
AWS Black Belt Online Seminar 2017 Amazon AthenaAWS Black Belt Online Seminar 2017 Amazon Athena
AWS Black Belt Online Seminar 2017 Amazon AthenaAmazon Web Services Japan
 

En vedette (7)

Serverlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and QuicksightServerlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and Quicksight
 
SIGMOD’12勉強会 -Session 7-
SIGMOD’12勉強会 -Session 7-SIGMOD’12勉強会 -Session 7-
SIGMOD’12勉強会 -Session 7-
 
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
 
Prestoで実現するインタラクティブクエリ - dbtech showcase 2014 Tokyo
Prestoで実現するインタラクティブクエリ - dbtech showcase 2014 TokyoPrestoで実現するインタラクティブクエリ - dbtech showcase 2014 Tokyo
Prestoで実現するインタラクティブクエリ - dbtech showcase 2014 Tokyo
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
AWS Black Belt Online Seminar 2017 Amazon Athena
AWS Black Belt Online Seminar 2017 Amazon AthenaAWS Black Belt Online Seminar 2017 Amazon Athena
AWS Black Belt Online Seminar 2017 Amazon Athena
 

Similaire à Parquet performance tuning: The missing guide

10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQLSatoshi Nagayasu
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBJason Terpko
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBAntonios Giannopoulos
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0Databricks
 
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareDatabricks
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0HBaseCon
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionDatabricks
 
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...Ontico
 
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Spark Summit
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataJihoon Son
 
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New FeaturesAmazon Web Services
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
 
Spark + AI Summit recap jul16 2020
Spark + AI Summit recap jul16 2020Spark + AI Summit recap jul16 2020
Spark + AI Summit recap jul16 2020Guido Oswald
 
User Defined Partitioning on PlazmaDB
User Defined Partitioning on PlazmaDBUser Defined Partitioning on PlazmaDB
User Defined Partitioning on PlazmaDBKai Sasaki
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 

Similaire à Parquet performance tuning: The missing guide (20)

10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDB
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDB
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why Care
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0
 
Spark Meetup
Spark MeetupSpark Meetup
Spark Meetup
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
 
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
 
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big Data
 
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
Spark + AI Summit recap jul16 2020
Spark + AI Summit recap jul16 2020Spark + AI Summit recap jul16 2020
Spark + AI Summit recap jul16 2020
 
User Defined Partitioning on PlazmaDB
User Defined Partitioning on PlazmaDBUser Defined Partitioning on PlazmaDB
User Defined Partitioning on PlazmaDB
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 

Dernier

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Dernier (20)

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

Parquet performance tuning: The missing guide

  • 1. Parquet performance tuning: The missing guide Ryan Blue Strata + Hadoop World NY 2016
  • 2. ● Big data at Netflix ● Parquet format background ● Optimization basics ● Stats and dictionary filtering ● Format 2 and compression ● Future work Contents.
  • 3. Big data at Netflix.
  • 4. Big data at Netflix. 40+ PB DW Read 3PB Write 300TB600B Events
  • 5. Strata San Jose results.
  • 6. Metrics dataset. Based on Atlas, Netflix’s telemetry platform. ● Performance monitoring backend and UI ● http://techblog.netflix.com/2014/12/introducing-atlas-netflixs-primary.html Example metrics data. ● Partitioned by day, and cluster ● Columns include metric time, name, value, and host ● Measurements for each minute are stored in a Parquet table
  • 8. Parquet data layout. ROW GROUPS. ● Data needed for a group of rows to be reassembled ● Smallest task or input split size ● Made of COLUMN CHUNKS COLUMN CHUNKS. ● Contiguous data for a single column ● Made of DATA PAGES and an optional DICTIONARY PAGE DATA PAGES. ● Encoded and compressed runs of values
  • 9. Row groups. ... F A B C D a1 b1 c1 d1 ... ... ... ... aN bN cN dN ... ... ... ... HDFS block
  • 10. Column chunks and pages. ... F dict
  • 11. Read less data. Columnar organization. ● Encoding: make the data smaller ● Column projection: read only the columns you need Row group filtering. ● Use footer stats to eliminate row groups ● Use dictionary pages to eliminate row groups Page filtering. ● Use page stats to eliminate pages
  • 13. Setup. Parquet writes: ● Version 1.8.1 or later – includes fix for incorrect statistics, PARQUET-251 ● 1.9.0 due in October Reads: ● Presto: Used 0.139 ● Spark: Used version 1.6.1 reading from Hive ● Pig: Used parquet-pig 1.9.0 for predicate push-down
  • 14. Pig configuration. -- enable pushdown/filtering set parquet.pig.predicate.pushdown.enable true; -- enables stats and dictionary filtering set parquet.filter.statistics.enabled true; set parquet.filter.dictionary.enabled true;
  • 15. Spark configuration. // turn on Parquet push-down, stats filtering, and dictionary filtering sqlContext.setConf("parquet.filter.statistics.enabled", "true") sqlContext.setConf("parquet.filter.dictionary.enabled", "true") sqlContext.setConf("spark.sql.parquet.filterPushdown", "true") // use the non-Hive read path sqlContext.setConf("spark.sql.hive.convertMetastoreParquet", "true") // turn off schema merging, which turns off push-down sqlContext.setConf("spark.sql.parquet.mergeSchema", "false") sqlContext.setConf("spark.sql.hive.convertMetastoreParquet.mergeSchema", "false")
  • 16. Writing the data. Spark: sqlContext .table("raw_metrics") .write.insertInto("metrics") Pig: metricsData = LOAD 'raw_metrics' USING SomeLoader; STORE metricsData INTO 'metrics' USING ParquetStorer;
  • 17. Writing the data. Spark: sqlContext .table("raw_metrics") .write.insertInto("metrics") Pig: metricsData = LOAD 'raw_metrics' USING SomeLoader; STORE metricsData INTO 'metrics' USING ParquetStorer; OutOfMemoryError or ParquetRuntimeException
  • 18. Writing too many files. Data doesn’t match partitioning. ● Tasks write a file per partition Symptoms: ● OutOfMemoryError ● ParquetRuntimeException: New Memory allocation 1047284 bytes is smaller than the minimum allocation size of 1048576 bytes. ● Successfully write lots of small files, slow split planning Task 1 part=1/ part=2/ Task 2 part=3/ part=4/ Task 3 part=.../
  • 19. Account for partitioning. Spark. sqlContext .table("raw_metrics") .sort("day", "cluster") .write.insertInto("metrics") Pig. metrics = LOAD 'raw_metrics' USING SomeLoader; metricsSorted = ORDER metrics BY day, cluster; STORE metricsSorted INTO 'metrics' USING ParquetStorer;
  • 20. Filter to select partitions. Spark. val partition = sqlContext .table("metrics") .filter("day = 20160929") .filter("cluster = 'emr_adhoc'") Pig. metricsData = LOAD 'metrics' USING ParquetLoader; partition = FILTER metricsData BY date == 20160929 AND cluster == 'emr_adhoc'
  • 22. Sample query. Spark. val low_cpu_count = partition .filter("name = 'system.cpu.utilization'") .filter("value < 0.8") .count() Pig. low_cpu = FILTER partition BY name == 'system.cpu.utilization' AND value < 0.8; low_cpu_count = FOREACH (GROUP low_cpu ALL) GENERATE COUNT(name);
  • 23. My job was 5 minutes faster!
  • 24. Did it work? ● Success metrics: S3 bytes read, CPU time spent S3N: Number of bytes read: 1,366,228,942,336 CPU time spent (ms): 280,218,780 ● Filter didn’t work. Bytes read shows the entire partition was read. ● What happened?
  • 25. Inspect the file. ● Stats show what happened: Row group 0: count: 84756 845.42 B records type encodings count avg size nulls min / max name BINARY G _ 84756 61.52 B 0 "A..." / "z..." ... Row group 1: count: 84756 845.42 B records type encodings count avg size nulls min / max name BINARY G _ 85579 61.52 B 0 "A..." / "z..." ● Every row group matched the query
  • 26. Add query columns to the sort. Spark. sqlContext .table("raw_metrics") .sort("day", "cluster", "name") .write.insertInto("metrics") Pig. metrics = LOAD 'raw_metrics' USING SomeLoader; metricsSorted = ORDER metrics BY day, cluster, name; STORE metricsSorted INTO 'metrics' USING ParquetStorer;
  • 27. Inspect the file, again. ● Stats are fixed: Row group 0: count: 84756 845.42 B records type encodings count avg size nulls min / max name BINARY G _ 84756 61.52 B 0 "A..." / "F..." ... Row group 1: count: 85579 845.42 B records type encodings count avg size nulls min / max name BINARY G _ 85579 61.52 B 0 "F..." / "N..." ... Row group 2: count: 86712 845.42 B records type encodings count avg size nulls min / max name BINARY G _ 86712 61.52 B 0 "N..." / "b..."
  • 29. Dictionary filtering. Dictionary is a compact list of all the values. ● Search term missing? Skip the row group ● Like a bloom filter without false positives When dictionary filtering helps: ● When a column is sorted in each file, not globally sorted – one row group matches ● When filtering an unsorted column dict dict dict
  • 30. Dictionary filtering overhead. Read overhead. ● Extra seeks ● Extra page reads Not a problem in practice. ● Reading both dictionary and row group resulted in < 1% penalty ● Stats filtering prevents unnecessary dictionary reads dict dict dict
  • 31. Works out of the box, right? Nope. ● Only works when columns are completely dictionary-encoded ● Plain-encoded pages can contain any value, dictionary is no help ● All pages in a chunk must use the dictionary Dictionary fallback rules: ● If dictionary + references > plain encoding, fall back ● If dictionary size is too large, fall back (default threshold: 1 MB)
  • 32. Fallback to plain encoding. parquet-tools dump -d utc_timestamp_ms TV=142990 RL=0 DL=1 DS: 833491 DE:PLAIN_DICTIONARY ---------------------------------------------------------------------------- page 0: DLE:RLE RLE:BIT_PACKED V:RLE SZ:72912 page 1: DLE:RLE RLE:BIT_PACKED V:RLE SZ:135022 page 2: DLE:RLE RLE:BIT_PACKED V:PLAIN SZ:1048607 page 3: DLE:RLE RLE:BIT_PACKED V:PLAIN SZ:1048607 page 4: DLE:RLE RLE:BIT_PACKED V:PLAIN SZ:714941 What’s happening: ● Values repeat, but change over time ● Dictionary gets too large, falls back to plain encoding ● Dictionary encoding is a size win!
  • 33. Avoid encoding fallback. Increase max dictionary size. ● 2-3 MB usually worked ● parquet.dictionary.page.size Decrease row group size. ● 24, 32, or 64 MB ● parquet.block.size ● New dictionary for each row group ● Also lowers memory consumption! Run several tests to find the right configuration (per table).
  • 34. Row group size. Other reasons to decrease row group size: ● Reduce memory consumption – but not to avoid write-side OOM ● Increase number of tasks / parallelism
  • 36. Results (from Pig). CPU and wall time dropped. ● Initial: CPU Time: 280,218,780 ms Wall Time: 15m 27s ● Filtered: CPU Time: 120,275,590 ms Wall Time: 9m 51s ● Final: CPU Time: 9,593,700 ms Wall Time: 6m 47s Bytes read is much better. ● Initial: S3 bytes read: 1,366,228,942,336 (1.24 TB) ● Filtered: S3 bytes read: 49,195,996,736 (45.82 GB)
  • 37. Filtered vs. final time. Row group filtering is parallel. ● Split planning is independent of stats (or else is a bottleneck) ● Lots of very small tasks: read footer, read dictionary, stop processing Combine splits in Pig/MR for better time. ● 1 GB splits tend to work well
  • 39. Format version 2. What’s included: ● New encodings: delta-integer, prefix-binary ● New page format to enable page-level filtering New encodings didn’t help with Netflix data. ● Delta-integer didn’t help significantly, even with timestamps (high overhead?) ● Not large enough prefixes in URL and JSON data Page filtering isn’t implemented (yet).
  • 40. Brotli compression. ● New compression library, from Google ● Based on LZ77, with compatible license Faster compression, smaller files, or both. ● brotli-5: 19.7% smaller, 2.7% slower – 1 day of data from Kafka ● brotli-4: 14.8% smaller, 12.5% faster – 1 hour, 4 largest Parquet tables ● brotli-1: 8.1% smaller, 28.3% faster – JSON-heavy dataset
  • 43. Future work. Short term: ● Release Parquet 1.9.0 ● Test Zstd compression ● Convert embedded JSON to Avro – good preliminary results Long-term: ● New encodings: Zig-zag RLE, patching, and floating point decomposition ● Page-level filtering