Powerful Google developer tools for immediate impact! (2023-24 C)
Hive: Data Warehousing for Hadoop
1. Hive: Data Warehousing for
Hadoop
Ben Lever
@bmlever
Big Data Analytics Meetup
27 March 2012
2. Another Data Warehousing System?
• Problem:
– Lots of data
• Partial solution:
– Hadoop
• Another problem:
– MapReduce can be hard
– Schema information embedded in program – a lot
of data is still structured
3. Solution: Hive
• A system for querying and managing
structured data within Hadoop
– MapReduce for execution
– HDFS for storage
• Designed for end-users that know more SQL
than Java
• Apache v2
• hive.apache.org
4. Working example: MovieLens
• Movie ratings
• 3 “tables”:
Users Movies Ratings
id id user id
age title movie id
gender release date rating (1 – 5)
occupation action timestamp
zip code adventure
romance
...
www.grouplens.org
6. So far
• Hive shell
• Creating and loading tables
• Data model:
– INT, BIGINT, TINYINT, STRING, etc
– Also: FLOAT, DOUBLE, ARRAY, MAP, STRUCT
• Simple queries with filtering
• Table data is immutable
• Schema on readvsschema on write
7. Hive components
TABLE customer (
customer_id BIGINT,
Metastore gender STRING,
...
schema info
launch MapReduce
Driver MapReduc
e job
Hive query
HDFS
(SQL-like)
raw source data
(compressed)
SELECT *
FROM customers CLI
WHERE gender = ‘M’;
11. Built in functions
• Text mining:
– ngrams()
– context_ngrams()
– sentences()
• Statistics + mathematics:
– stddev()
– histogram_numeric()
– log
– radians
12. User Defined Functions
• Written in Java
• User Defined Functions (UDFs):
– Single row Single row
– e.g. mathematical and string functions
• User Defined Aggregate Functions (UDAFs):
– Multiple rows Single row
– e.g. AVG
• User Defined Table Functions (UDTFs):
– Single row Multiple rows
– e.g. “explode”
17. Conclusion
• Scales to handle much more data than traditional
systems:
– Leverages Hadoop HDFS and MapReduce
– Relational/structured data
– Schema on read vs schema on write
• Supports rapid iteration of ad-hoc queries
– SQL-like querying language
– Complex queries (joins, etc) with minimal code
• Is not a database replacement:
– Treats data as immutable
– No indexing
Notes de l'éditeur
# of users = 943# of movies = 1682# of ratings = 100,000