Bucketing is a partitioning technique that can improve performance in certain data transformations by avoiding data shuffling and sorting. The general idea of bucketing is to partition, and optionally sort, the data based on a subset of columns while it is written out (a one-time cost), while making successive reads of the data more performant for downstream jobs if the SQL operators can make use of this property. Bucketing can enable faster joins (i.e. single stage sort merge join), the ability to short circuit in FILTER operation if the file is pre-sorted over the column in a filter predicate, and it supports quick data sampling.
In this session, you’ll learn how bucketing is implemented in both Hive and Spark. In particular, Patil will describe the changes in the Catalyst optimizer that enable these optimizations in Spark for various bucketing scenarios. Facebook’s performance tests have shown bucketing to improve Spark performance from 3-5x faster when the optimization is enabled. Many tables at Facebook are sorted and bucketed, and migrating these workloads to Spark have resulted in a 2-3x savings when compared to Hive. You’ll also hear about real-world applications of bucketing, like loading of cumulative tables with daily delta, and the characteristics that can help identify suitable candidate jobs that can benefit from bucketing.
2. Agenda
• Why bucketing ?
• Why is shuffle bad ?
• How to avoid shuffle ?
• When to use bucketing ?
• Spark’s bucketing support
• Bucketing semantics of Spark vs Hive
• Hive bucketing support in Spark
• SQL Planner improvements
7. Table A Table B
. . . . . . . . . . . . JOIN
Broadcast hash join Shuffle hash join
- Ship smaller table to all nodes
- Stream the other table
- Shuffle both tables,
- Hash smaller one, stream the
bigger one
Sort merge join
- Shuffle both tables,
- Sort both tables, buffer one,
stream the bigger one
8. Table A Table B
. . . . . . . . . . . . JOIN
Broadcast hash join Shuffle hash join Sort merge join
- Ship smaller table to all nodes
- Stream the other table
- Shuffle both tables,
- Hash smaller one, stream the
bigger one
- Shuffle both tables,
- Sort both tables, buffer one,
stream the bigger one
47. Bucketing semantics of Spark vs Hive
Hive Spark
Model Optimizes reads, writes are costly Writes are cheaper, reads are
costlier
Unit of bucketing A single file represents a single
bucket
A collection of files together
comprise a single bucket
Sort ordering Each bucket is sorted globally. This
makes it easy to optimize queries
reading this data
Each file is individually sorted but
the “bucket” as a whole is not
globally sorted
48. Bucketing semantics of Spark vs Hive
Hive Spark
Model Optimizes reads, writes are costly Writes are cheaper, reads are
costlier
Unit of bucketing A single file represents a single
bucket
A collection of files together
comprise a single bucket
Sort ordering Each bucket is sorted globally. This
makes it easy to optimize queries
reading this data
Each file is individually sorted but
the “bucket” as a whole is not
globally sorted
Hashing function Hive’s inbuilt hash Murmur3Hash
49. [SPARK-19256] Hive bucketing support
• Introduce Hive’s hashing function [SPARK-17495]
• Enable creating hive bucketed tables [SPARK-17729]
• Support Hive’s `Bucketed file format`
• Propagate Hive bucketing information to planner [SPARK-17654]
• Expressing outputPartitioning and requiredChildDistribution
• Creating empty bucket files when no data for a bucket
• Allow Sort Merge join over tables with number of buckets
multiple of each other
• Support N-way Sort Merge Join