1. An Introduction to
Apache HIVE
Credits
By: Reza Ameri
Semester: Fall 2013
Course: DDB
Prof: Dr. Naderi
2. Agenda
• Starting Note
– What is Hive
– What is cool about Hive
– Hive in use
– What Hive is not?
• Brief About Data Warehouse
An Introduction to Apache HIVE
2 of 31
3. Agenda- Contd.
• Hive Architecture
– Components
– Architecture Diagram
• Hive in Production
– HQL
– Data Insertion/Aggregation
• Performance
• Further Reading
• References
An Introduction to Apache HIVE
3 of 31
4. Starting Note
• What is Apache Hive?
– Open Source (Very Important!) So Free
– Data Warehouse System on Hadoop
– Provides HQL(SQL like query interface)
– Suitable for Structured and Semi-Structured Data
– Capability to deal with different storages and file
formats
An Introduction to Apache HIVE
4 of 31
5. Starting Note- Contd.
• What is cool about Hive
– Let users use MR without thinking MR with
HiveQL interface.
• Some history
– Hive is made by Facebook!
– Developing by Netflix aslo.
– Amazon uses it in Amazon Elastic MapReduce
An Introduction to Apache HIVE
5 of 31
6. Starting Note- Contd.
• What Hive is not
– Does not use complex indexes so do not response
in a seconds!
– But it scales very well and, It works with data of
Peta Byte order
– It is not independent and it’s performance is tied
Hadoop
An Introduction to Apache HIVE
6 of 31
7. Brief About Data Warehouse
• OLAP vs OLTP
– DW is needed in OLAP
– We want report and summary not live data of
transactions for continuing the operate
– We need reports to make operation better not to
conduct and operation!
– We use ETL to populate data in DW.
An Introduction to Apache HIVE
7 of 31
8. Brief About Data Warehouse
Inmon approach
vs
Kimbal approach
An Introduction to Apache HIVE
8 of 31
9. Brief About Data Warehouse
Inmon approach
vs
Kimbal approach
An Introduction to Apache HIVE
9 of 31
10. Brief About Data Warehouse
• Other keywords
– ODS- Operational Data Store
– Fact Tables
– Data Mart
– Dimensions
– Concurrent ETLs
An Introduction to Apache HIVE
10 of 31
11. Hive Architecture
• Components
– Hadoop
– Driver
– Command Line Interface (CLI)
– Web Interface
– Metastore
– Thrift Server
An Introduction to Apache HIVE
11 of 31
13. Hive Architecture
Map Reduce
Web UI + Hive CLI + JDBC/ODBC
User-defined
Map-reduce Scripts
HDFS
Browse, Query, DDL
Hive QL
MetaStore
Parser
UDF/UDAF
substr
sum
average
Planner
Execution
Thrift API
Optimizer
SerDe
CSV
Thrift
Regex
An Introduction to Apache HIVE
FileFormats
TextFile
SequenceFile
RCFile
13 of 31
14. Hive Architecture- Contd.
– Internal Components
• Compiler and Planner
– It compiles and checks the input query and create an
execution plan.
• Optimizer
– It optimizes the execution plan before it runs.
• Execution Engine
– Runs the execution plan. It is guaranteed that execution plan
is DAG
An Introduction to Apache HIVE
14 of 31
15. Hive Architecture- Contd.
• Hive Data Model
– Any data in hive is categorized in
• Databases
– First level of abstraction.
• Tables
– Ordinary tables
• Partition
– To handle data transferring in MR.
• Bucket
– Facilitate the data access in partitions.
An Introduction to Apache HIVE
15 of 31
16. Hive in Production
• Log processing
– Daily Report
– User Activity Measurement
• Data/Text mining
– Machine learning (Training Data)
• Business intelligence
– Advertising Delivery
– Spam Detection
An Introduction to Apache HIVE
16 of 31
17. Hive in Production
– HQL
•
•
•
•
•
Create
Row Format
SerDe
Select
Cluster By/Distribute By
– Data Insertion/Aggregation
An Introduction to Apache HIVE
17 of 31
18. HQL- Samples
• CREATE TABLE
CREATE TABLE movies (movie_id int, movie_name string, tags
string)
• ROW FORMAT
ROW FORMAT DELIMITED FIELDS TERMINATED BY
‘:’;
An Introduction to Apache HIVE
18 of 31
19. HQL- Samples
• Partition
create table table_name (
id int,
date string,
name string)
partitioned by (date string)
An Introduction to Apache HIVE
19 of 31
20. HQL- Samples
• SerDe
– User Table with
“id::gender::age::occupation::zipcode” format.
CREATE TABLE USER (id INT, gender STRING, age INT,
occupation STRING, zipcode INT)
ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(.*)::(.*)::(.*)::(.*)::(.*)");
An Introduction to Apache HIVE
20 of 31
21. HQL- Samples
• Select
SELECT * FROM movies LIMIT 10;
• Distribute By
– Select * from movies distribute by tags;
– Select the column to organize data while sending
it to reducer.
An Introduction to Apache HIVE
21 of 20
22. Hive Process
• Data Insertion/Aggregation
– Bulk
• ETL
– Talend - Community version
– Sqoop (SQl to hadOOP, Apache license)
– SyncSort – Not Free!
An Introduction to Apache HIVE
22 of 31
23. Hive Process- Contd.
– STP(Straight Through Processing)
• Flume – Apache lisenced
• Chukwa - a part of Apache Hadoop distribution
• Scribe – Facebook solution for log processing
and aggregation.
An Introduction to Apache HIVE
23 of 31
24. Hive Process- Contd.
• NetFlix Case Study
– Usage of Chukwa
– Log processing
– Count Errors per session
– Count Streams per day
– Ad-hoc queries like summaries (sum, max, min, …)
An Introduction to Apache HIVE
24 of 31
26. Hive Process- Contd.
• Phase 1
– Hadoop job parses the logs and loads to Hive
every hour.
– Previous job should also run every 24 hours for
summary
• Phase 2
– Real-time log processing(parse/merge/load)
– Chukwa has non-stop log collection.
An Introduction to Apache HIVE
26 of 31
30. Further Reading
• Apache Drill
– Software framework that supports data-intensive, distributed
applications, for interactive analysis of large-scale datasets
• PIG
– MR Platform for creating and using MR on Hadoop
•
•
•
•
•
•
•
Oracle Big Data
DB2 10 and InfoSphere Warehouse
Parallel databases: Gamma, Bubba, Volcano
Google: Sawzall
Yahoo: Pig
IBM: JAQL
Microsoft: DradLINQ , SCOPE
An Introduction to Apache HIVE
30 of 31
هایو روی هادوپ ساخته شده تا بتوان روی BigData کوئری زد. هایو در فیسبوک ایجاد شد.مشکلی فیسبوک با آن روبرو بود بعد از آن مشکل خیلی از شرکتهای دیگر هم شد و کم کم کارایی و قابلیتهای rdbmsها و NoSqlها در دادههای بزرگ کمرنگ شد.گزارشات کم کم چند دقیقه طول کشیدند و گاهی ساعتها زمان بردند.گاهی همزمانی دو گزارش مشکل بزرگی را به وجود آورد.کم کم سیستم ها کند شدند و گیر کردند و یا از دسترس خارج شدند.تازه بعد از حل این مشکل نیاز به اطلاعات بدون درگیر شدن به MR هم به چشم امد. لازم بود که اطلاعات را بدون داشتن تسلط به دانش پیچیده مپ ریدوس فراخوانی و استفاده کنند.هادوپ اسکیما نداشت و کار باهاش سخت بود.Not ReusableFor complex jobs:Multiple stage of Map/Reduce functionsمثال مشکل شرکت مخابرات استان تهران برای اعلام لیست قطعی و یا تغییرات در دیتابیس خود.مثال کوئری ۳۶ ساعته و ۲۴ ثانیهایمثال توانیر
هادوپ چیست؟رایگان و متن باز.فرق هست بین متن باز رو رایگان این هم رایگان هست و هم متن بازDWareHouse برای هادوپ است.یک انتزاع هست و یک سیستم انتزاعی است.
چیزی که در مورد هایو جالبه اینه که این امکان رو می ده که بدون داشتن دانش نگاشت کاهشیبتونیم از هادوپ و امکانات بیگ دیتا استفاده کنیم.بهرهمندی از امکانات scalable با وجود استفاده از واسط Query Languageای که مشابه با SQL قدیمی هست.هایو در سال ۲۰۰۸ توسط فیسبوک متن باز شد و تحت لایسنس آپاچی در اومد.
Hadoop: Hive needs Hadoop as a Base Framework to operate.Driver: Hive has its own drivers to communicate with the Hadoop World.CLI: The Hive CLI is the console for firing Hive Queries. The CLI would be used for operating on our data.Webinterface: Hive also provides a web interface to monitor/administrate Hive jobs.MetaStore:Metastore is the Hive’s data warehouse which stores all the structure information of various tables/partitions in Hive.(Database Catalog)Thrift Server: we can expose Hive as a service which can then be used for connecting via JDBC/ODBC etc.
Hadoop: Hive needs Hadoop as a Base Framework to operate.Driver: Hive has its own drivers to communicate with the Hadoop World.CLI: The Hive CLI is the console for firing Hive Queries. The CLI would be used for operating on our data.Webinterface: Hive also provides a web interface to monitor/administrate Hive jobs.MetaStore:Metastore is the Hive’s data warehouse which stores all the structure information of various tables/partitions in Hive.(Database Catalog)Thrift Server: we can expose Hive as a service which can then be used for connecting via JDBC/ODBC etc.
UDF User Defined functions
Directed acyclic graph: is a directed graph with no directed cycles.
پارتیشن: هر جدول می تواند یک یا چند کلید پارتیشن داشته باشد. اطلاعات براساس کلید پارتیشن در فایلها ذخیره میشوند. بدون پارتیشن کل دیتا به MR ارسال می شوند اما با پارتیشن ارسال اطلاعات به MR مدیریت می شود.باکت: اطلاعات هر پارتیشن هم براساس hash valueها دسته بندی میشوند.ایناطلاعات در همان پوشهی پارتیشن نگهداری میشود.
برای کار با دادههای پیچیده و delimeterهای چند حرفی و پیچیده.کاربرد: پردازش لاگها
DISTRIBUTE BY + Sort By = Cluster byشبیه به group by
اینها مثل log4jبا این تفاوت که پیش و پس پردازش روی لاگ دارند.
Drill:Design goal is that Drill is able to scale to 10,000 servers or more and to be able to process petabytes of data and trillions of records in seconds