SlideShare utilise les cookies pour améliorer les fonctionnalités et les performances, et également pour vous montrer des publicités pertinentes. Si vous continuez à naviguer sur ce site, vous acceptez l’utilisation de cookies. Consultez nos Conditions d’utilisation et notre Politique de confidentialité.
SlideShare utilise les cookies pour améliorer les fonctionnalités et les performances, et également pour vous montrer des publicités pertinentes. Si vous continuez à naviguer sur ce site, vous acceptez l’utilisation de cookies. Consultez notre Politique de confidentialité et nos Conditions d’utilisation pour en savoir plus.
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
H104: Harnessing the Hadoop Ecosystem
Optimizations in Apache Hive
Jason Huang, Senior Solutions Architect – Qubole, Inc.
May 12, 2015
NYC Data Summit Hadoop Day
A little bit about Qubole
Founder & CEO
Joydeep Sen Sarma
Founder & CTO
Founded in 2011 by the pioneers of “big data” @
Facebook and the creator’s of the Apache Hive Project.
Based in Mountain View, CA with offices in Bangalore,
India. Investments by Charles River, LightSpeed,
2015 CNBC Disruptor 50 Companies – announced today!
World class product and engineering team from:
Hive – SQL on Hadoop
● A system for managing and querying unstructured data as if it were
● Uses Map-Reduce for execution
● HDFS for Storage (or Amazon S3)
● Key Building Principles
● SQL as a familiar data warehousing tool
● Extensibility (Pluggable map/reduce scripts in the language of your
choice, Rich and User Defined Data Types, User Defined Functions)
● Interoperability (Extensible Framework to support different file and data
● Problem : Unlimited data
● Terabytes everyday
● Wide Adoption of Hadoop
● But, Hadoop can be …
● Different Paradigm
● Map-Reduce hard to program
Qubole DataFlow Diagram
Qubole UI via
Customer’s AWS Account
Data Flow within
w/S3 Server Side
a) Qubole can encrypt the result cache
b) Qubole supports encryption of the ephemeral drives used for HDFS
c) Qubole supports S3 Server Side Encryption
- models data tables with certain rules to deal with redundancy
- normalizing creates multiple relational tables
- requires joins at runtime to produce results
Joins are expensive and difficult operations to perform and are one of the
common reasons for performance issues. Because of this, it’s a good idea
to avoid highly normalized table structures because they require join
queries to derive the desired metrics.
Hive partitioning is an effective method to improve the query performance
on larger tables. Partition key is best as a low cardinal attribute.
Improves the join performance if the bucket key and join keys are
- improves the join performance if the bucket key and join keys are
- distributes the data in different buckets based on the hash results on
the bucket key
- Reduces I/O scans during the join process if the process is happening
on the same keys (columns)
Note: set bucketing flag (hive.enforce.bucketing) each time before
writing data to the bucketed table.
To leverage the bucketing in the join operation we should set
hive.optimize.bucketmapjoin=true. This setting hints to Hive to do
bucket level join during the map stage join.
Really efficient if a table on the other side of a join is small enough to fit
in the memory.
File Input Formats:
- play a critical role in Hive performance
E.g. JSON, the text type of input formats
- not a good choice for a large production system where data volume is
- readable format take a lot of space and have some overhead of
parsing ( e.g JSON parsing )
To address these problems, Hive comes with columnar input formats like
RCFile, ORC etc. Columnar formats reduce read operations in queries by
allowing each column to be accessed individually.
Other binary formats like Avro, sequence files, Thrift can be effective in
various use cases.
Compress map/reduce output:
- reduce the intermediate data volume
- reduces the amount of data transfers between mappers and reducers
over the network
Note: gzip compressed files are not splittable – so apply with caution
File size should not be larger than a few hundred megabytes
- otherwise it can potentially lead to an imbalanced job
- compression codec options: e.g. snappy, lzo, bzip, etc.
For map output compression: set mapred.compress.map.output=“true”
For job output compression: set mapred.output.compress=“true”
Hadoop can execute MapReduce jobs in parallel, and several queries
executed on Hive automatically use this parallelism.
- allows Hive to process a batch of rows in ORC format together instead
of processing one row at a time
Each batch consists of a column vector which is usually an array of
primitive types. Operations are performed on the entire column vector,
which improves the instruction pipelines and cache usage.
To enable: set hive.vectorized.execution.enabled=true
- allows users to take a subset of dataset and analyze it, without having
to analyze the entire data set
Hive offers a built-in TABLESAMPLE clause that allows you to sample
TABLESAMPLE can sample at various granularity levels
- return only subsets of buckets (bucket sampling)
- HDFS blocks (block sampling)
- first N records from each input split
Alternatively, you can implement your own UDF that filters out records
according to your sampling algorithm.
- In Hive, you can unit test UDFs, SerDes, streaming scripts, Hive queries
- Verify the correctness of your whole HiveQL query without touching a
- Executing HiveQL query in the local mode takes literally seconds,
compared to minutes, hours or days if it runs in the Hadoop mode.
Various tools available: e.g HiveRunner, Hive_test and Beetest.
“Qubole has enabled more users within Pinterest to get to the
data and has made the data platform lot more scalable and
Mohammad Shahangian - Lead, Data Science and Infrastructure
Moved to Qubole from Amazon EMR because
of stability and rapidly expanded big data usage by
giving access to data to users beyond developers.
Rapid expansion of big data beyond developers (240 users
out of 600 person company)
Use CasesUser and Query Growth
Rapid expansion in use cases ranging from ETL, search,
adhoc querying, product analytics etc.
Rock solid infrastructure sees 50% less failures as
compared to AWS Elastic Map/Reduce
Enterprise scale processing and data access
“We needed something that was reliable and easy to learn,
setup, use and put into production without the risk and high
expectations that comes with committing millions of dollars in
upfront investment. Qubole was that thing.”
Marc Rosen - Sr. Director, Data Analytics
Moved to Big data on the cloud (from internal Oracle
clusters) because getting to analysis was much
quicker than operating infrastructure themselves.
Used to answer client queries and power client
Use Cases# Commands Per Month
Number of queries
Segment audiences based on their behavior including such
topics as user pathway and multi-dimensional recency
Build customer profiles (both uni/multivariate) across
thousands of first party (i.e., client CRM files) and third
party (i.e., demographic) segments
Simplify attribution insights showing the effects of upper
funnel prospecting on lower funnel remarketing media
Links for more information