An intriduction to hive

An Introduction to
Apache HIVE

Credits
By: Reza Ameri

Semester: Fall 2013

Course: DDB

Prof: Dr. Naderi

Agenda
• Starting Note
– What is Hive
– What is cool about Hive
– Hive in use
– What Hive is not?

• Brief About Data Warehouse

An Introduction to Apache HIVE

2 of 31

Agenda- Contd.
• Hive Architecture
– Components
– Architecture Diagram

• Hive in Production
– HQL
– Data Insertion/Aggregation

• Performance
• Further Reading
• References

3 of 31

Starting Note
• What is Apache Hive?
– Open Source (Very Important!) So Free 
– Data Warehouse System on Hadoop
– Provides HQL(SQL like query interface)
– Suitable for Structured and Semi-Structured Data
– Capability to deal with different storages and file
formats

4 of 31

Starting Note- Contd.
• What is cool about Hive
– Let users use MR without thinking MR with
HiveQL interface.

• Some history
– Hive is made by Facebook!
– Developing by Netflix aslo.
– Amazon uses it in Amazon Elastic MapReduce

5 of 31

Starting Note- Contd.
• What Hive is not
– Does not use complex indexes so do not response
in a seconds!
– But it scales very well and, It works with data of
Peta Byte order
– It is not independent and it’s performance is tied
Hadoop


6 of 31

Brief About Data Warehouse
• OLAP vs OLTP
– DW is needed in OLAP
– We want report and summary not live data of
transactions for continuing the operate
– We need reports to make operation better not to
conduct and operation!
– We use ETL to populate data in DW.


7 of 31


Inmon approach
vs
Kimbal approach


8 of 31


Inmon approach
vs
Kimbal approach


9 of 31

• Other keywords
– ODS- Operational Data Store
– Fact Tables
– Data Mart
– Dimensions
– Concurrent ETLs


10 of 31

Hive Architecture
• Components
– Hadoop
– Driver
– Command Line Interface (CLI)
– Web Interface
– Metastore
– Thrift Server


11 of 31

Hive Architecture


12 of 31

Hive Architecture
Map Reduce

Web UI + Hive CLI + JDBC/ODBC

User-defined
Map-reduce Scripts

HDFS

Browse, Query, DDL
Hive QL
MetaStore

Parser

UDF/UDAF
substr
sum
average

Planner
Execution

Thrift API
Optimizer

SerDe
CSV
Thrift
Regex


FileFormats
TextFile
SequenceFile
RCFile

13 of 31

Hive Architecture- Contd.
– Internal Components
• Compiler and Planner
– It compiles and checks the input query and create an
execution plan.

• Optimizer
– It optimizes the execution plan before it runs.

• Execution Engine
– Runs the execution plan. It is guaranteed that execution plan
is DAG


14 of 31

Hive Architecture- Contd.
• Hive Data Model
– Any data in hive is categorized in
• Databases
– First level of abstraction.

• Tables
– Ordinary tables

• Partition
– To handle data transferring in MR.

• Bucket
– Facilitate the data access in partitions.


15 of 31

Hive in Production
• Log processing
– Daily Report
– User Activity Measurement

• Data/Text mining
– Machine learning (Training Data)

• Business intelligence
– Advertising Delivery
– Spam Detection

16 of 31

Hive in Production
– HQL
•
•
•
•
•

Create
Row Format
SerDe
Select
Cluster By/Distribute By

– Data Insertion/Aggregation


17 of 31

HQL- Samples
• CREATE TABLE
CREATE TABLE movies (movie_id int, movie_name string, tags
string)

• ROW FORMAT
ROW FORMAT DELIMITED FIELDS TERMINATED BY
‘:’;


18 of 31

HQL- Samples
• Partition
create table table_name (
id int,
date string,
name string)
partitioned by (date string)


19 of 31

HQL- Samples
• SerDe
– User Table with
“id::gender::age::occupation::zipcode” format.
CREATE TABLE USER (id INT, gender STRING, age INT,
occupation STRING, zipcode INT)
ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(.*)::(.*)::(.*)::(.*)::(.*)");

20 of 31

HQL- Samples
• Select
SELECT * FROM movies LIMIT 10;

• Distribute By
– Select * from movies distribute by tags;
– Select the column to organize data while sending
it to reducer.


21 of 20

Hive Process
• Data Insertion/Aggregation
– Bulk
• ETL
– Talend - Community version
– Sqoop (SQl to hadOOP, Apache license)
– SyncSort – Not Free!


22 of 31

Hive Process- Contd.
– STP(Straight Through Processing)
• Flume – Apache lisenced
• Chukwa - a part of Apache Hadoop distribution
• Scribe – Facebook solution for log processing
and aggregation.


23 of 31

• NetFlix Case Study
– Usage of Chukwa
– Log processing
– Count Errors per session
– Count Streams per day
– Ad-hoc queries like summaries (sum, max, min, …)


24 of 31



25 of 31

• Phase 1
– Hadoop job parses the logs and loads to Hive
every hour.
– Previous job should also run every 24 hours for
summary

• Phase 2
– Real-time log processing(parse/merge/load)
– Chukwa has non-stop log collection.


26 of 31

Performance
• According to Globant investigations
• Tables:


27 of 31

Performance


28 of 31

Performance


29 of 31

Further Reading
• Apache Drill
– Software framework that supports data-intensive, distributed
applications, for interactive analysis of large-scale datasets

• PIG
– MR Platform for creating and using MR on Hadoop

•
•
•
•
•
•
•

Oracle Big Data
DB2 10 and InfoSphere Warehouse
Parallel databases: Gamma, Bubba, Volcano
Google: Sawzall
Yahoo: Pig
IBM: JAQL
Microsoft: DradLINQ , SCOPE

30 of 31

References
•
•
•
•
•
•
•
•

https://www.facebook.com/note.php?note_id=89508453919
https://github.com/facebook/scribe
http://sqoop.apache.org/docs/
http://flume.apache.org/FlumeDeveloperGuide.html
Sqoop Database Import For Hadoop, Cloudera, Oct.2009
https://cwiki.apache.org/confluence/display/Hive/LanguageManual
http://www.semantikoz.com/blog/the-free-apache-hive-book/
BEGINNING MICROSOFT® SQL SERVER® 2012 PROGRAMMING,
Wiley, Paul Atkinson and Robert Vieira, ISBN: 978-1-118-10228-2
• Hive – A Petabyte Scale Data Warehouse Using Hadoop, facebook
team, 2009

31 of 31

Thanks…


An intriduction to hive

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à An intriduction to hive

Similaire à An intriduction to hive (20)

An intriduction to hive

Notes de l'éditeur