SlideShare une entreprise Scribd logo
1  sur  32
An Introduction to
Apache HIVE

Credits
By: Reza Ameri

Semester: Fall 2013

Course: DDB

Prof: Dr. Naderi
Agenda
• Starting Note
– What is Hive
– What is cool about Hive
– Hive in use
– What Hive is not?

• Brief About Data Warehouse

An Introduction to Apache HIVE

2 of 31
Agenda- Contd.
• Hive Architecture
– Components
– Architecture Diagram

• Hive in Production
– HQL
– Data Insertion/Aggregation

• Performance
• Further Reading
• References
An Introduction to Apache HIVE

3 of 31
Starting Note
• What is Apache Hive?
– Open Source (Very Important!) So Free 
– Data Warehouse System on Hadoop
– Provides HQL(SQL like query interface)
– Suitable for Structured and Semi-Structured Data
– Capability to deal with different storages and file
formats
An Introduction to Apache HIVE

4 of 31
Starting Note- Contd.
• What is cool about Hive
– Let users use MR without thinking MR with
HiveQL interface.

• Some history
– Hive is made by Facebook!
– Developing by Netflix aslo.
– Amazon uses it in Amazon Elastic MapReduce
An Introduction to Apache HIVE

5 of 31
Starting Note- Contd.
• What Hive is not
– Does not use complex indexes so do not response
in a seconds!
– But it scales very well and, It works with data of
Peta Byte order
– It is not independent and it’s performance is tied
Hadoop

An Introduction to Apache HIVE

6 of 31
Brief About Data Warehouse
• OLAP vs OLTP
– DW is needed in OLAP
– We want report and summary not live data of
transactions for continuing the operate
– We need reports to make operation better not to
conduct and operation!
– We use ETL to populate data in DW.

An Introduction to Apache HIVE

7 of 31
Brief About Data Warehouse

Inmon approach
vs
Kimbal approach

An Introduction to Apache HIVE

8 of 31
Brief About Data Warehouse

Inmon approach
vs
Kimbal approach

An Introduction to Apache HIVE

9 of 31
Brief About Data Warehouse
• Other keywords
– ODS- Operational Data Store
– Fact Tables
– Data Mart
– Dimensions
– Concurrent ETLs

An Introduction to Apache HIVE

10 of 31
Hive Architecture
• Components
– Hadoop
– Driver
– Command Line Interface (CLI)
– Web Interface
– Metastore
– Thrift Server

An Introduction to Apache HIVE

11 of 31
Hive Architecture

An Introduction to Apache HIVE

12 of 31
Hive Architecture
Map Reduce

Web UI + Hive CLI + JDBC/ODBC

User-defined
Map-reduce Scripts

HDFS

Browse, Query, DDL
Hive QL
MetaStore

Parser

UDF/UDAF
substr
sum
average

Planner
Execution

Thrift API
Optimizer

SerDe
CSV
Thrift
Regex

An Introduction to Apache HIVE

FileFormats
TextFile
SequenceFile
RCFile

13 of 31
Hive Architecture- Contd.
– Internal Components
• Compiler and Planner
– It compiles and checks the input query and create an
execution plan.

• Optimizer
– It optimizes the execution plan before it runs.

• Execution Engine
– Runs the execution plan. It is guaranteed that execution plan
is DAG

An Introduction to Apache HIVE

14 of 31
Hive Architecture- Contd.
• Hive Data Model
– Any data in hive is categorized in
• Databases
– First level of abstraction.

• Tables
– Ordinary tables

• Partition
– To handle data transferring in MR.

• Bucket
– Facilitate the data access in partitions.

An Introduction to Apache HIVE

15 of 31
Hive in Production
• Log processing
– Daily Report
– User Activity Measurement

• Data/Text mining
– Machine learning (Training Data)

• Business intelligence
– Advertising Delivery
– Spam Detection
An Introduction to Apache HIVE

16 of 31
Hive in Production
– HQL
•
•
•
•
•

Create
Row Format
SerDe
Select
Cluster By/Distribute By

– Data Insertion/Aggregation

An Introduction to Apache HIVE

17 of 31
HQL- Samples
• CREATE TABLE
CREATE TABLE movies (movie_id int, movie_name string, tags
string)

• ROW FORMAT
ROW FORMAT DELIMITED FIELDS TERMINATED BY
‘:’;

An Introduction to Apache HIVE

18 of 31
HQL- Samples
• Partition
create table table_name (
id int,
date string,
name string)
partitioned by (date string)

An Introduction to Apache HIVE

19 of 31
HQL- Samples
• SerDe
– User Table with
“id::gender::age::occupation::zipcode” format.
CREATE TABLE USER (id INT, gender STRING, age INT,
occupation STRING, zipcode INT)
ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(.*)::(.*)::(.*)::(.*)::(.*)");
An Introduction to Apache HIVE

20 of 31
HQL- Samples
• Select
SELECT * FROM movies LIMIT 10;

• Distribute By
– Select * from movies distribute by tags;
– Select the column to organize data while sending
it to reducer.

An Introduction to Apache HIVE

21 of 20
Hive Process
• Data Insertion/Aggregation
– Bulk
• ETL
– Talend - Community version
– Sqoop (SQl to hadOOP, Apache license)
– SyncSort – Not Free!

An Introduction to Apache HIVE

22 of 31
Hive Process- Contd.
– STP(Straight Through Processing)
• Flume – Apache lisenced
• Chukwa - a part of Apache Hadoop distribution
• Scribe – Facebook solution for log processing
and aggregation.

An Introduction to Apache HIVE

23 of 31
Hive Process- Contd.
• NetFlix Case Study
– Usage of Chukwa
– Log processing
– Count Errors per session
– Count Streams per day
– Ad-hoc queries like summaries (sum, max, min, …)

An Introduction to Apache HIVE

24 of 31
Hive Process- Contd.

An Introduction to Apache HIVE

25 of 31
Hive Process- Contd.
• Phase 1
– Hadoop job parses the logs and loads to Hive
every hour.
– Previous job should also run every 24 hours for
summary

• Phase 2
– Real-time log processing(parse/merge/load)
– Chukwa has non-stop log collection.

An Introduction to Apache HIVE

26 of 31
Performance
• According to Globant investigations
• Tables:

An Introduction to Apache HIVE

27 of 31
Performance

An Introduction to Apache HIVE

28 of 31
Performance

An Introduction to Apache HIVE

29 of 31
Further Reading
• Apache Drill
– Software framework that supports data-intensive, distributed
applications, for interactive analysis of large-scale datasets

• PIG
– MR Platform for creating and using MR on Hadoop

•
•
•
•
•
•
•

Oracle Big Data
DB2 10 and InfoSphere Warehouse
Parallel databases: Gamma, Bubba, Volcano
Google: Sawzall
Yahoo: Pig
IBM: JAQL
Microsoft: DradLINQ , SCOPE
An Introduction to Apache HIVE

30 of 31
References
•
•
•
•
•
•
•
•

https://www.facebook.com/note.php?note_id=89508453919
https://github.com/facebook/scribe
http://sqoop.apache.org/docs/
http://flume.apache.org/FlumeDeveloperGuide.html
Sqoop Database Import For Hadoop, Cloudera, Oct.2009
https://cwiki.apache.org/confluence/display/Hive/LanguageManual
http://www.semantikoz.com/blog/the-free-apache-hive-book/
BEGINNING MICROSOFT® SQL SERVER® 2012 PROGRAMMING,
Wiley, Paul Atkinson and Robert Vieira, ISBN: 978-1-118-10228-2
• Hive – A Petabyte Scale Data Warehouse Using Hadoop, facebook
team, 2009
An Introduction to Apache HIVE

31 of 31
Thanks…

An Introduction to Apache HIVE

Contenu connexe

Tendances

Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Takrim Ul Islam Laskar
 
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveMike Frampton
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1Sperasoft
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - OverviewJay
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSBouquet
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesKelly Technologies
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Marcel Krcah
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
 
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesHadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesMithun Radhakrishnan
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
 
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Introduction to Hive for Hadoop
Introduction to Hive for HadoopIntroduction to Hive for Hadoop
Introduction to Hive for Hadoopryanlecompte
 

Tendances (20)

Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)
 
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop Hive
 
Apache hive
Apache hiveApache hive
Apache hive
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMS
 
6.hive
6.hive6.hive
6.hive
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesHadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
 
SQOOP PPT
SQOOP PPTSQOOP PPT
SQOOP PPT
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Mar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBaseMar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBase
 
Introduction to Hive for Hadoop
Introduction to Hive for HadoopIntroduction to Hive for Hadoop
Introduction to Hive for Hadoop
 
Sqoop tutorial
Sqoop tutorialSqoop tutorial
Sqoop tutorial
 

En vedette

HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
Hive - SerDe and LazySerde
Hive - SerDe and LazySerdeHive - SerDe and LazySerde
Hive - SerDe and LazySerdeZheng Shao
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveWill Du
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010ragho
 
Data Engineering with Spring, Hadoop and Hive
Data Engineering with Spring, Hadoop and Hive	Data Engineering with Spring, Hadoop and Hive
Data Engineering with Spring, Hadoop and Hive Alex Silva
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014alanfgates
 
Introduction to Big Data processing (FGRE2016)
Introduction to Big Data processing (FGRE2016)Introduction to Big Data processing (FGRE2016)
Introduction to Big Data processing (FGRE2016)Thomas Vanhove
 
Hadoop Summit 2009 Hive
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 HiveNamit Jain
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 
Hive Object Model
Hive Object ModelHive Object Model
Hive Object ModelZheng Shao
 
Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks
 
Apache Hive Hook
Apache Hive HookApache Hive Hook
Apache Hive HookMinwoo Kim
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveJulian Hyde
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomynzhang
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache HiveTapan Avasthi
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache HiveAvkash Chauhan
 
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan GatesApache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan GatesBig Data Spain
 

En vedette (20)

HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Hive - SerDe and LazySerde
Hive - SerDe and LazySerdeHive - SerDe and LazySerde
Hive - SerDe and LazySerde
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
 
Data Engineering with Spring, Hadoop and Hive
Data Engineering with Spring, Hadoop and Hive	Data Engineering with Spring, Hadoop and Hive
Data Engineering with Spring, Hadoop and Hive
 
20081030linkedin
20081030linkedin20081030linkedin
20081030linkedin
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014
 
Introduction to Big Data processing (FGRE2016)
Introduction to Big Data processing (FGRE2016)Introduction to Big Data processing (FGRE2016)
Introduction to Big Data processing (FGRE2016)
 
Hadoop Summit 2009 Hive
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
Hive Object Model
Hive Object ModelHive Object Model
Hive Object Model
 
Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive
 
Apache Hive Hook
Apache Hive HookApache Hive Hook
Apache Hive Hook
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
 
Internal Hive
Internal HiveInternal Hive
Internal Hive
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan GatesApache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
 

Similaire à An intriduction to hive

Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Dataconomy Media
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Mats Uddenfeldt
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013Jen Stirrup
 
Open source stak of big data techs open suse asia
Open source stak of big data techs   open suse asiaOpen source stak of big data techs   open suse asia
Open source stak of big data techs open suse asiaMuhammad Rifqi
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Ashish Narasimham
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksData Con LA
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_featuresAlberto Romero
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Milos Milovanovic
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Darko Marjanovic
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)Stéphane Fréchette
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNATomas Cervenka
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 

Similaire à An intriduction to hive (20)

Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
 
Hive_Pig.pptx
Hive_Pig.pptxHive_Pig.pptx
Hive_Pig.pptx
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
 
Hive
HiveHive
Hive
 
Open source stak of big data techs open suse asia
Open source stak of big data techs   open suse asiaOpen source stak of big data techs   open suse asia
Open source stak of big data techs open suse asia
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
Apache Hadoop Hive
Apache Hadoop HiveApache Hadoop Hive
Apache Hadoop Hive
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Hadoop intro
Hadoop introHadoop intro
Hadoop intro
 

An intriduction to hive

  • 1. An Introduction to Apache HIVE Credits By: Reza Ameri Semester: Fall 2013 Course: DDB Prof: Dr. Naderi
  • 2. Agenda • Starting Note – What is Hive – What is cool about Hive – Hive in use – What Hive is not? • Brief About Data Warehouse An Introduction to Apache HIVE 2 of 31
  • 3. Agenda- Contd. • Hive Architecture – Components – Architecture Diagram • Hive in Production – HQL – Data Insertion/Aggregation • Performance • Further Reading • References An Introduction to Apache HIVE 3 of 31
  • 4. Starting Note • What is Apache Hive? – Open Source (Very Important!) So Free  – Data Warehouse System on Hadoop – Provides HQL(SQL like query interface) – Suitable for Structured and Semi-Structured Data – Capability to deal with different storages and file formats An Introduction to Apache HIVE 4 of 31
  • 5. Starting Note- Contd. • What is cool about Hive – Let users use MR without thinking MR with HiveQL interface. • Some history – Hive is made by Facebook! – Developing by Netflix aslo. – Amazon uses it in Amazon Elastic MapReduce An Introduction to Apache HIVE 5 of 31
  • 6. Starting Note- Contd. • What Hive is not – Does not use complex indexes so do not response in a seconds! – But it scales very well and, It works with data of Peta Byte order – It is not independent and it’s performance is tied Hadoop An Introduction to Apache HIVE 6 of 31
  • 7. Brief About Data Warehouse • OLAP vs OLTP – DW is needed in OLAP – We want report and summary not live data of transactions for continuing the operate – We need reports to make operation better not to conduct and operation! – We use ETL to populate data in DW. An Introduction to Apache HIVE 7 of 31
  • 8. Brief About Data Warehouse Inmon approach vs Kimbal approach An Introduction to Apache HIVE 8 of 31
  • 9. Brief About Data Warehouse Inmon approach vs Kimbal approach An Introduction to Apache HIVE 9 of 31
  • 10. Brief About Data Warehouse • Other keywords – ODS- Operational Data Store – Fact Tables – Data Mart – Dimensions – Concurrent ETLs An Introduction to Apache HIVE 10 of 31
  • 11. Hive Architecture • Components – Hadoop – Driver – Command Line Interface (CLI) – Web Interface – Metastore – Thrift Server An Introduction to Apache HIVE 11 of 31
  • 12. Hive Architecture An Introduction to Apache HIVE 12 of 31
  • 13. Hive Architecture Map Reduce Web UI + Hive CLI + JDBC/ODBC User-defined Map-reduce Scripts HDFS Browse, Query, DDL Hive QL MetaStore Parser UDF/UDAF substr sum average Planner Execution Thrift API Optimizer SerDe CSV Thrift Regex An Introduction to Apache HIVE FileFormats TextFile SequenceFile RCFile 13 of 31
  • 14. Hive Architecture- Contd. – Internal Components • Compiler and Planner – It compiles and checks the input query and create an execution plan. • Optimizer – It optimizes the execution plan before it runs. • Execution Engine – Runs the execution plan. It is guaranteed that execution plan is DAG An Introduction to Apache HIVE 14 of 31
  • 15. Hive Architecture- Contd. • Hive Data Model – Any data in hive is categorized in • Databases – First level of abstraction. • Tables – Ordinary tables • Partition – To handle data transferring in MR. • Bucket – Facilitate the data access in partitions. An Introduction to Apache HIVE 15 of 31
  • 16. Hive in Production • Log processing – Daily Report – User Activity Measurement • Data/Text mining – Machine learning (Training Data) • Business intelligence – Advertising Delivery – Spam Detection An Introduction to Apache HIVE 16 of 31
  • 17. Hive in Production – HQL • • • • • Create Row Format SerDe Select Cluster By/Distribute By – Data Insertion/Aggregation An Introduction to Apache HIVE 17 of 31
  • 18. HQL- Samples • CREATE TABLE CREATE TABLE movies (movie_id int, movie_name string, tags string) • ROW FORMAT ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘:’; An Introduction to Apache HIVE 18 of 31
  • 19. HQL- Samples • Partition create table table_name ( id int, date string, name string) partitioned by (date string) An Introduction to Apache HIVE 19 of 31
  • 20. HQL- Samples • SerDe – User Table with “id::gender::age::occupation::zipcode” format. CREATE TABLE USER (id INT, gender STRING, age INT, occupation STRING, zipcode INT) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "(.*)::(.*)::(.*)::(.*)::(.*)"); An Introduction to Apache HIVE 20 of 31
  • 21. HQL- Samples • Select SELECT * FROM movies LIMIT 10; • Distribute By – Select * from movies distribute by tags; – Select the column to organize data while sending it to reducer. An Introduction to Apache HIVE 21 of 20
  • 22. Hive Process • Data Insertion/Aggregation – Bulk • ETL – Talend - Community version – Sqoop (SQl to hadOOP, Apache license) – SyncSort – Not Free! An Introduction to Apache HIVE 22 of 31
  • 23. Hive Process- Contd. – STP(Straight Through Processing) • Flume – Apache lisenced • Chukwa - a part of Apache Hadoop distribution • Scribe – Facebook solution for log processing and aggregation. An Introduction to Apache HIVE 23 of 31
  • 24. Hive Process- Contd. • NetFlix Case Study – Usage of Chukwa – Log processing – Count Errors per session – Count Streams per day – Ad-hoc queries like summaries (sum, max, min, …) An Introduction to Apache HIVE 24 of 31
  • 25. Hive Process- Contd. An Introduction to Apache HIVE 25 of 31
  • 26. Hive Process- Contd. • Phase 1 – Hadoop job parses the logs and loads to Hive every hour. – Previous job should also run every 24 hours for summary • Phase 2 – Real-time log processing(parse/merge/load) – Chukwa has non-stop log collection. An Introduction to Apache HIVE 26 of 31
  • 27. Performance • According to Globant investigations • Tables: An Introduction to Apache HIVE 27 of 31
  • 28. Performance An Introduction to Apache HIVE 28 of 31
  • 29. Performance An Introduction to Apache HIVE 29 of 31
  • 30. Further Reading • Apache Drill – Software framework that supports data-intensive, distributed applications, for interactive analysis of large-scale datasets • PIG – MR Platform for creating and using MR on Hadoop • • • • • • • Oracle Big Data DB2 10 and InfoSphere Warehouse Parallel databases: Gamma, Bubba, Volcano Google: Sawzall Yahoo: Pig IBM: JAQL Microsoft: DradLINQ , SCOPE An Introduction to Apache HIVE 30 of 31
  • 31. References • • • • • • • • https://www.facebook.com/note.php?note_id=89508453919 https://github.com/facebook/scribe http://sqoop.apache.org/docs/ http://flume.apache.org/FlumeDeveloperGuide.html Sqoop Database Import For Hadoop, Cloudera, Oct.2009 https://cwiki.apache.org/confluence/display/Hive/LanguageManual http://www.semantikoz.com/blog/the-free-apache-hive-book/ BEGINNING MICROSOFT® SQL SERVER® 2012 PROGRAMMING, Wiley, Paul Atkinson and Robert Vieira, ISBN: 978-1-118-10228-2 • Hive – A Petabyte Scale Data Warehouse Using Hadoop, facebook team, 2009 An Introduction to Apache HIVE 31 of 31

Notes de l'éditeur

  1. هایو روی هادوپ ساخته شده تا بتوان روی BigData کوئری زد. هایو در فیسبوک ایجاد شد.مشکلی فیسبوک با آن روبرو بود بعد از آن مشکل خیلی از شرکت‌های دیگر هم شد و کم کم کارایی و قابلیت‌های rdbmsها و NoSqlها در داده‌های بزرگ کمرنگ شد.گزارشات کم کم چند دقیقه طول کشیدند و گاهی ساعت‌ها زمان بردند.گاهی همزمانی دو گزارش مشکل بزرگی را به وجود آورد.کم کم سیستم ها کند شدند و گیر کردند و یا از دسترس خارج شدند.تازه بعد از حل این مشکل نیاز به اطلاعات بدون درگیر شدن به MR هم به چشم امد. لازم بود که اطلاعات را بدون داشتن تسلط به دانش پیچیده مپ ریدوس فراخوانی و استفاده کنند.هادوپ اسکیما نداشت و کار باهاش سخت بود.Not ReusableFor complex jobs:Multiple stage of Map/Reduce functionsمثال مشکل شرکت مخابرات استان تهران برای اعلام لیست قطعی و یا تغییرات در دیتابیس خود.مثال کوئری ۳۶ ساعته و ۲۴ ثانیه‌ایمثال توانیر
  2. هادوپ چیست؟رایگان و متن باز.فرق هست بین متن باز رو رایگان این هم رایگان هست و هم متن بازDWareHouse برای هادوپ است.یک انتزاع هست و یک سیستم انتزاعی است.
  3. چیزی که در مورد هایو جالبه اینه که این امکان رو می ده که بدون داشتن دانش نگاشت کاهشیبتونیم از هادوپ و امکانات بیگ دیتا استفاده کنیم.بهره‌مندی از امکانات scalable با وجود استفاده از واسط Query Languageای که مشابه با SQL قدیمی هست.هایو در سال ۲۰۰۸ توسط فیسبوک متن باز شد و تحت لایسنس آپاچی در اومد.
  4. OLAP: online analytical processingOLTP: online transactional processing
  5. Hadoop: Hive needs Hadoop as a Base Framework to operate.Driver: Hive has its own drivers to communicate with the Hadoop World.CLI: The Hive CLI is the console for firing Hive Queries. The CLI would be used for operating on our data.Webinterface: Hive also provides a web interface to monitor/administrate Hive jobs.MetaStore:Metastore is the Hive’s data warehouse which stores all the structure information of various tables/partitions in Hive.(Database Catalog)Thrift Server: we can expose Hive as a service which can then be used for connecting via JDBC/ODBC etc.
  6. Hadoop: Hive needs Hadoop as a Base Framework to operate.Driver: Hive has its own drivers to communicate with the Hadoop World.CLI: The Hive CLI is the console for firing Hive Queries. The CLI would be used for operating on our data.Webinterface: Hive also provides a web interface to monitor/administrate Hive jobs.MetaStore:Metastore is the Hive’s data warehouse which stores all the structure information of various tables/partitions in Hive.(Database Catalog)Thrift Server: we can expose Hive as a service which can then be used for connecting via JDBC/ODBC etc.
  7. UDF User Defined functions
  8. Directed acyclic graph: is a directed graph with no directed cycles.
  9. پارتیشن: هر جدول می تواند یک یا چند کلید پارتیشن داشته باشد. اطلاعات براساس کلید پارتیشن در فایل‌ها ذخیره می‌شوند. بدون پارتیشن کل دیتا به MR ارسال می شوند اما با پارتیشن ارسال اطلاعات به MR مدیریت می شود.باکت: اطلاعات هر پارتیشن هم براساس hash valueها دسته بندی می‌شوند.این‌اطلاعات در همان پوشه‌ی پارتیشن نگهداری می‌شود.
  10. برای کار با داده‌های پیچیده و delimeterهای چند حرفی و پیچیده.کاربرد: پردازش لاگ‌ها
  11. DISTRIBUTE BY + Sort By = Cluster byشبیه به group by
  12. این‌ها مثل log4jبا این تفاوت که پیش و پس پردازش روی لاگ دارند.
  13. Drill:Design goal is that Drill is able to scale to 10,000 servers or more and to be able to process petabytes of data and trillions of records in seconds