Soumettre la recherche
Mettre en ligne
Optimizing Hive Queries
•
73 j'aime
•
36,027 vues
Owen O'Malley
Suivre
Owen O'Malley gave a talk at Hadoop Summit EU 2013 about optimizing Hive queries.
Lire moins
Lire la suite
Technologie
Signaler
Partager
Signaler
Partager
1 sur 36
Télécharger maintenant
Télécharger pour lire hors ligne
Recommandé
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r
Optimizing Hive Queries
Optimizing Hive Queries
DataWorks Summit
Hive: Loading Data
Hive: Loading Data
Benjamin Leonhardi
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
Hive tuning
Hive tuning
Michael Zhang
Apache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
Spark on yarn
Spark on yarn
datamantra
Recommandé
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r
Optimizing Hive Queries
Optimizing Hive Queries
DataWorks Summit
Hive: Loading Data
Hive: Loading Data
Benjamin Leonhardi
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
Hive tuning
Hive tuning
Michael Zhang
Apache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
Spark on yarn
Spark on yarn
datamantra
Using Apache Hive with High Performance
Using Apache Hive with High Performance
Inderaj (Raj) Bains
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Databricks
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
Bill Liu
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
Alluxio, Inc.
Building an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.
The Impala Cookbook
The Impala Cookbook
Cloudera, Inc.
Apache Spark overview
Apache Spark overview
DataArt
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
HBase in Practice
HBase in Practice
larsgeorge
LLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
DataWorks Summit
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
DataWorks Summit
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
Spark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
Databricks
Cassandra Introduction & Features
Cassandra Introduction & Features
DataStax Academy
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
DataWorks Summit
Introduction to Apache Spark
Introduction to Apache Spark
datamantra
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxData
Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query perf...
Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query perf...
Vinod Kumar Vavilapalli
Distributed Data processing in a Cloud
Distributed Data processing in a Cloud
elliando dias
Contenu connexe
Tendances
Using Apache Hive with High Performance
Using Apache Hive with High Performance
Inderaj (Raj) Bains
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Databricks
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
Bill Liu
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
Alluxio, Inc.
Building an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.
The Impala Cookbook
The Impala Cookbook
Cloudera, Inc.
Apache Spark overview
Apache Spark overview
DataArt
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
HBase in Practice
HBase in Practice
larsgeorge
LLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
DataWorks Summit
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
DataWorks Summit
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
Spark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
Databricks
Cassandra Introduction & Features
Cassandra Introduction & Features
DataStax Academy
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
DataWorks Summit
Introduction to Apache Spark
Introduction to Apache Spark
datamantra
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxData
Tendances
(20)
Using Apache Hive with High Performance
Using Apache Hive with High Performance
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
Building an open data platform with apache iceberg
Building an open data platform with apache iceberg
The Impala Cookbook
The Impala Cookbook
Apache Spark overview
Apache Spark overview
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
HBase in Practice
HBase in Practice
LLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Spark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
Cassandra Introduction & Features
Cassandra Introduction & Features
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
Introduction to Apache Spark
Introduction to Apache Spark
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
Similaire à Optimizing Hive Queries
Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query perf...
Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query perf...
Vinod Kumar Vavilapalli
Distributed Data processing in a Cloud
Distributed Data processing in a Cloud
elliando dias
Hadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
Ike Ellis
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
DataWorks Summit/Hadoop Summit
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
elliando dias
Intro to Big Data and NoSQL
Intro to Big Data and NoSQL
Don Demcsak
Why databases cry at night
Why databases cry at night
Michael Yarichuk
Redshift deep dive
Redshift deep dive
Amazon Web Services LATAM
Apache Tez – Present and Future
Apache Tez – Present and Future
DataWorks Summit
Ozone and HDFS's Evolution
Ozone and HDFS's Evolution
DataWorks Summit
Ozone and HDFS’s evolution
Ozone and HDFS’s evolution
DataWorks Summit
Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)
Don Demcsak
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
Databricks
Hardware Provisioning
Hardware Provisioning
MongoDB
Intro to Big Data
Intro to Big Data
Zohar Elkayam
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
RahulBhole12
Taming the resource tiger
Taming the resource tiger
Elizabeth Smith
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
LinkedIn
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
Joe Alex
Apache Tez – Present and Future
Apache Tez – Present and Future
Jianfeng Zhang
Similaire à Optimizing Hive Queries
(20)
Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query perf...
Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query perf...
Distributed Data processing in a Cloud
Distributed Data processing in a Cloud
Hadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
Intro to Big Data and NoSQL
Intro to Big Data and NoSQL
Why databases cry at night
Why databases cry at night
Redshift deep dive
Redshift deep dive
Apache Tez – Present and Future
Apache Tez – Present and Future
Ozone and HDFS's Evolution
Ozone and HDFS's Evolution
Ozone and HDFS’s evolution
Ozone and HDFS’s evolution
Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
Hardware Provisioning
Hardware Provisioning
Intro to Big Data
Intro to Big Data
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
Taming the resource tiger
Taming the resource tiger
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
Apache Tez – Present and Future
Apache Tez – Present and Future
Plus de Owen O'Malley
Running An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid Them
Owen O'Malley
Big Data's Journey to ACID
Big Data's Journey to ACID
Owen O'Malley
ORC Deep Dive 2020
ORC Deep Dive 2020
Owen O'Malley
Protect your private data with ORC column encryption
Protect your private data with ORC column encryption
Owen O'Malley
Fine Grain Access Control for Big Data: ORC Column Encryption
Fine Grain Access Control for Big Data: ORC Column Encryption
Owen O'Malley
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
Strata NYC 2018 Iceberg
Strata NYC 2018 Iceberg
Owen O'Malley
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
ORC Column Encryption
ORC Column Encryption
Owen O'Malley
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley
Protecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache Hadoop
Owen O'Malley
Data protection2015
Data protection2015
Owen O'Malley
Structor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop Clusters
Owen O'Malley
Hadoop Security Architecture
Hadoop Security Architecture
Owen O'Malley
Adding ACID Updates to Hive
Adding ACID Updates to Hive
Owen O'Malley
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
Owen O'Malley
ORC Files
ORC Files
Owen O'Malley
ORC File Introduction
ORC File Introduction
Owen O'Malley
Next Generation Hadoop Operations
Next Generation Hadoop Operations
Owen O'Malley
Next Generation MapReduce
Next Generation MapReduce
Owen O'Malley
Plus de Owen O'Malley
(20)
Running An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid Them
Big Data's Journey to ACID
Big Data's Journey to ACID
ORC Deep Dive 2020
ORC Deep Dive 2020
Protect your private data with ORC column encryption
Protect your private data with ORC column encryption
Fine Grain Access Control for Big Data: ORC Column Encryption
Fine Grain Access Control for Big Data: ORC Column Encryption
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Strata NYC 2018 Iceberg
Strata NYC 2018 Iceberg
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
ORC Column Encryption
ORC Column Encryption
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Protecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache Hadoop
Data protection2015
Data protection2015
Structor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop Clusters
Hadoop Security Architecture
Hadoop Security Architecture
Adding ACID Updates to Hive
Adding ACID Updates to Hive
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
ORC Files
ORC Files
ORC File Introduction
ORC File Introduction
Next Generation Hadoop Operations
Next Generation Hadoop Operations
Next Generation MapReduce
Next Generation MapReduce
Dernier
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Deepika Singh
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
The Digital Insurer
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
The Digital Insurer
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
Anna Loughnan Colquhoun
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
Zilliz
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
wesley chun
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
Igalia
Architecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
Nanddeep Nachan
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
apidays
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
Khem
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
MIND CTI
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
apidays
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Martijn de Jong
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
apidays
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
lior mazor
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Jeffrey Haguewood
Dernier
(20)
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
Architecting Cloud Native Applications
Architecting Cloud Native Applications
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Optimizing Hive Queries
1.
Optimizing Hive Queries Owen
O’Malley Founder and Architect owen@hortonworks.com @owen_omalley © Hortonworks Inc. 2013: Page 1
2.
Who Am I? • Founder
and Architect at Hortonworks – Working on Hive, working with customer – Formerly Hadoop MapReduce & Security – Been working on Hadoop since beginning • Apache Hadoop, ASF – Hadoop PMC (Original VP) – Tez, Ambari, Giraph PMC – Mentor for: Accumulo, Kafka, Knox – Apache Member © Hortonworks Inc. 2013 Page 2
3.
Outline • Data Layout • Data Format • Joins • Debugging
© Hortonworks Inc. 2013 Page 3
4.
Data Layout Location, Location,
Location © Hortonworks Inc. 2013 Page 4
5.
Fundamental Questions • What is
your primary use case? – What kind of queries and filters? • How do you need to access the data? – What information do you need together? • How much data do you have? – What is your year to year growth? • How do you get the data? © Hortonworks Inc. 2013 Page 5
6.
HDFS Characteristics • Provides Distributed
File System – Very high aggregate bandwidth – Extreme scalability (up to 100 PB) – Self-healing storage – Relatively simple to administer • Limitations – Can’t modify existing files – Single writer for each file – Heavy bias for large files ( > 100 MB) © Hortonworks Inc. 2013 Page 6
7.
Choices for Layout • Partitions
– Top level mechanism for pruning – Primary unit for updating tables (& schema) – Directory per value of specified column • Bucketing – Hashed into a file, good for sampling – Controls write parallelism • Sort order – The order the data is written within file © Hortonworks Inc. 2013 Page 7
8.
Example Hive Layout • Directory
Structure warehouse/$database/$table • Partitioning /part1=$partValue/part2=$partValue • Bucketing /$bucket_$attempt (eg. 000000_0) • Sort – Each file is sorted within the file © Hortonworks Inc. 2013 Page 8
9.
Layout Guidelines • Limit the
number of partitions – 1,000 partitions is much faster than 10,000 – Nested partitions are almost always wrong • Gauge the number of buckets – Calculate file size and keep big (200-500MB) – Don’t forget number of files (Buckets * Parts) • Layout related tables the same way – Partition – Bucket and sort order © Hortonworks Inc. 2013 Page 9
10.
Normalization • Most databases suggest
normalization – Keep information about each thing together – Customer, Sales, Returns, Inventory tables • Has lots of good properties, but… – Is typically slow to query • Often best to denormalize during load – Write once, read many times – Additionally provides snapshots in time. © Hortonworks Inc. 2013 Page 10
11.
Data Format How is
your data stored? © Hortonworks Inc. 2013 Page 11
12.
Choice of Format • Serde
– How each record is encoded? • Input/Output (aka File) Format – How are the files stored? • Primary Choices – Text – Sequence File – RCFile – ORC (Coming Soon!) © Hortonworks Inc. 2013 Page 12
13.
Text Format • Critical to
pick a Serde – Default - ^A’s between fields – JSON – top level JSON record – CSV – commas between fields (on github) • Slow to read and write • Can’t split compressed files – Leads to huge maps • Need to read/decompress all fields © Hortonworks Inc. 2013 Page 13
14.
Sequence File • Traditional MapReduce
binary file format – Stores keys and values as classes – Not a good fit for Hive, which has SQL types – Hive always stores entire row as value • Splittable but only by searching file – Default block size is 1 MB • Need to read and decompress all fields © Hortonworks Inc. 2013 Page 14
15.
RC (Row Columnar)
File • Columns stored separately – Read and decompress only needed ones – Better compression • Columns stored as binary blobs – Depends on metastore to supply types • Larger blocks – 4 MB by default – Still search file for split boundary © Hortonworks Inc. 2013 Page 15
16.
ORC (Optimized Row
Columnar) • Columns stored separately • Knows types – Uses type-specific encoders – Stores statistics (min, max, sum, count) • Has light-weight index – Skip over blocks of rows that don’t matter • Larger blocks – 256 MB by default – Has an index for block boundaries © Hortonworks Inc. 2013 Page 16
17.
ORC - File
Layout © Hortonworks Inc. 2013 Page 17
18.
Example File Sizes
from TPC-DS © Hortonworks Inc. 2013 Page 18
19.
Compression • Need to pick
level of compression – None – LZO or Snappy – fast but sloppy – Best for temporary tables – ZLIB – slow and complete – Best for long term storage © Hortonworks Inc. 2013 Page 19
20.
Joins Putting the pieces
together © Hortonworks Inc. 2013 Page 20
21.
Default Assumption • Hive assumes
users are either: – Noobies – Hive developers • Default behavior is always finish – Little Engine that Could! • Experts could override default behaviors – Get better performance, but riskier • We’re working on improving heuristics © Hortonworks Inc. 2013 Page 21
22.
Shuffle Join • Default choice
– Always works (I’ve sorted a petabyte!) – Worst case scenario • Each process – Reads from part of one of the tables – Buckets and sorts on join key – Sends one bucket to each reduce • Works everytime! © Hortonworks Inc. 2013 Page 22
23.
Map Join • One table
is small (eg. dimension table) – Fits in memory • Each process – Reads small table into memory hash table – Streams through part of the big file – Joining each record from hash table • Very fast, but limited © Hortonworks Inc. 2013 Page 23
24.
Sort Merge Bucket
(SMB) Join • If both tables are: – Sorted the same – Bucketed the same – And joining on the sort/bucket column • Each process: – Reads a bucket from each table – Process the row with the lowest value • Very efficient if applicable © Hortonworks Inc. 2013 Page 24
25.
Debugging What could possibly
go wrong? © Hortonworks Inc. 2013 Page 25
26.
Performance Question • Which of
the following is faster? – select count(distinct(Col)) from Tbl – select count(*) from (select distict(Col) from Tbl) © Hortonworks Inc. 2013 Page 26
27.
Count Distinct
© Hortonworks Inc. 2013 Page 27
28.
Answer • Surprisingly the second
is usually faster – In the first case: – Maps send each value to the reduce – Single reduce counts them all – In the second case: – Maps split up the values to many reduces – Each reduce generates its list – Final job counts the size of each list – Singleton reduces are almost always BAD © Hortonworks Inc. 2013 Page 28
29.
Communication is Good! • Hive
doesn’t tell you what is wrong. – Expects you to know! – “Lucy, you have some ‘splaining to do!” • Explain tool provides query plan – Filters on input – Numbers of jobs – Numbers of maps and reduces – What the jobs are sorting by – What directories are they reading or writing © Hortonworks Inc. 2013 Page 29
30.
Blinded by Science • The
explanation tool is confusing. – It takes practice to understand. – It doesn’t include some critical details like partition pruning. • Running the query makes things clearer! – Pay attention to the details – Look at JobConf and job history files © Hortonworks Inc. 2013 Page 30
31.
Skew • Skew is typical
in real datasets. • A user complained that his job was slow – He had 100 reduces – 98 of them finished fast – 2 ran really slow • The key was a boolean… © Hortonworks Inc. 2013 Page 31
32.
Root Cause Analysis • Ambari
– Apache project building Hadoop installation and management tool – Provides metrics (Ganglia & Nagios) – Root Cause Analysis – Processes MapReduce job logs – Displays timing of each part of query plan © Hortonworks Inc. 2013 Page 32
33.
Root Cause Analysis
Screenshots © Hortonworks Inc. 2013 Page 33
34.
Root Cause Analysis
Screenshots © Hortonworks Inc. 2013 Page 34
35.
Thank You! Questions &
Answers @owen_omalley © Hortonworks Inc. 2012: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION Page 35
36.
ORCFile - Comparison
RC File Trevni ORC File Hive Type Model N N Y Separate complex columns N Y Y Splits found quickly N Y Y Default column group size 4MB 64MB* 250MB Files per a bucket 1 >1 1 Store min, max, sum, count N N Y Versioned metadata N Y Y Run length data encoding N N Y Store strings in dictionary N N Y Store row count N Y Y Skip compressed blocks N N Y Store internal indexes N N Y © Hortonworks Inc. 2013 Page 36
Télécharger maintenant