Hive

Prepared by
Vetri.V
What is Hive?
 Hive is a data warehouse system for Hadoop that facilitates easy data
summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop
compatible file systems.
 Hive provides a mechanism to project structure onto this data and query the data
using a SQL-like language called HiveQL.
 At the same time this language also allows traditional map/reduce programmers to
plug in their custom mappers and reducers when it is inconvenient or inefficient to
express this logic in HiveQL.
 Hive is an abstraction on top of MapReduce it allows users to query data in the
Hadoop cluster without knowing Java or MapReduce.It Uses the HiveQLlanguage,
Very similar to SQL.
Some Points about Hive:-
 Hive was originally developed at Facebook
 Provides a very SQL-like language
 Can be used by people who know SQL
 Enabling Hive requires almost no extra work by the system administrator
 Hive ‘layers’ table definitions on top of data in HDFS
 Hive tables are stored in Hive’s ‘warehouse’ directory in HDFS, By default,
/user/hive/warehouse
 Tables are stored in subdirectories of the warehouse directory
 Actual data is stored in flat files- Control character-delimited text, or SequenceFiles
Hive is Data warehousing tool on top of Hadoop. It same as SQL:
 SQL like Queries
 SHOW TABLES, DESCRIBE, DROPTABLE

Prepared by
Vetri.V
 CREATE TABLE, ALTER TABLE
 SELECT, INSERT
Hive Limitations:
 Not all ‘standard’ SQL is supported
 No support for UPDATE or DELETE
 No support for INSERTing single rows
 Relatively limited number of built-in functions
 No datatypes for date or time - Use the STRING datatypeinstead.In new version date
or time datatype will support.
Hive Architecture
Metastore:
 Stores system catalog.
Driver:
 Manages life cycle of HiveQL query as it moves thru’ HIVE; also manages session
handle and session statistics
Query compiler:
 Compiles HiveQL into a directed acyclic graph of map/reduce tasks
Execution engines:
 The component executes the tasks in proper dependency order; interacts with
Hadoop
Hive Server:

Prepared by
Vetri.V
 Provides Thrift interface and JDBC/ODBC for integrating other applications.
Client components:
CLI, web interface, jdbc/odbc interface Extensibility interface include Server , User Defined
Functions and User Defined Aggregate Function.
Hive Installations and Metastore with Mysql:
 Store Hive Metadata into RDBMS.
 The Metastore stores all the information about the tables, their partitions, the
schemas, the columns and their types, the table locations etc.
 This information can be queried or modified using a thrift interface and as a result
it can be called from clients in different programming languages.
 As this information needs to be served fast to the compiler, they have chosen to
store this information on a traditional RDBMS.
 By default, hive use an embedded Derby database to store metadata information.
The metastore is the "glue" between Hive and HDFS.
 It tells Hive where your data files live in HDFS, what type of data they contain, what
tables they belong to, etc.
 The Metastore is an application that runs on an RDBMS and uses an open source
ORM layer called DataNucleus, to convert object representations into a relational
schema and vice versa.
 They chose this approach as opposed to storing this information in hdfs as they
need the Metastore to be very low latency. The DataNucleus layer allows them to
plugin many different RDBMS technologies.
 Facebook uses MySQL to store metadata information.
Store Hive metadata into MySQL:
 write on /etc/profile the following:
export JAVA_HOME=/usr/java/jdk1.7.0_17
export PATH=$PATH:/usr/java/jdk1.7.0_17/bin
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:/opt/hadoop/bin
export HIVE_HOME=/opt/hive
export PATH=$HADOOP_HOME/bin:$HIVE_HOME/bin:$PATH
 Go to hadoop home and perform the following tasks:
 Commands to perform this setup are as follows:
$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp
$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse

Prepared by
Vetri.V
We need to perform following steps to setup a metastore in a MySQL server.
1. Install mysql-server.
sudo apt-get install mysql-server
(or)
yum install mysql-server <---centOS------>
2. Use su command. You will see password promt. Type your root
userpassword.
3. If your password correct, you'll see like this; root@yourdesktopname#.
4. Now, Start mysql server.
/etc/init.d/mysql start.
5. When mysql server started, type mysql -u root mysql.
6. Next, I'm going to create a new MySQL user for hadoop/hive.
mysql> CREATE USER 'hadoop'@'localhost' IDENTIFIED BY 'hadoop';
mysql> GRANT ALL PRIVILEGES ON *.* TO 'hadoop'@'localhost' WITH
GRANT OPTION;
mysql> exit;
7. su â€“ userName // In my case userName is hadoop. So, switch to that user.
8. mysql -h localhost -u userName -p.
9. Now, we need to change the hive configuration so it can use MySQL:
10. Go to hive/conf directory, then open hive-default.xml and perform
following changes.

<property>
<name>hive.metastore.local</name>
<value>true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>
jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true
</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>

Prepared by
Vetri.V
<name>javax.jdo.option.ConnectionUserName</name>
<value>Username</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>UserPassword</value>
</property>
(OR)
create hive-site.xml and paste the followings:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>





<property>
<name>hive.metastore.local</name>
<value>true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/eattributes?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>

Prepared by
Vetri.V
<name>javax.jdo.option.ConnectionPassword</name>
<value>hadoop</value>
</property>
</configuration>
11. Hive needs to have the MySQL jdbc drivers, so we need to download and
12. If all settings are done correctly, we can do this:-
copy it to hive/lib folder.
impetus@ubuntu:~$ cd HIVE_HOME
impetus@ubuntu:~/hive$ bin/hive
Hive history file=/tmp/ankit
/hive_job_log_ankit_201102211937_456962737.txt
hive>
hive>create table temp(info INT);
OK
Time taken: 4.745 seconds
hive> show tables;
OK
temp
Time taken: 0.137 seconds
hive>
13. Again, start MySQL server.
vetri@ubuntu:~$mysql -h localhost -u hadoop -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or g.
Your MySQL connection id is 84
Server version: 5.1.41-3ubuntu12.8 (Ubuntu)
Type 'help;' or 'h' for help. Type 'c' to clear the current input statement.
mysql>
14. Use following command to view all existing databases.
mysql> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |

Prepared by
Vetri.V
mysql> select * from COLUMNS;
+-------+---------+-------------+-----------+-------------+
|SD_ID|COMMENT|COLUMN_NAME|TYPE_NAME|INTEGER_IDX|
+-------+---------+-------------+-----------+-------------+
| 1 | NULL | info | int | 0 |
+-------+---------+-------------+-----------+-------------+
mysql>
//Where info is the column of temp table.
17. Similarly, you can View the information of partiotions, Bucketing,
tables, etc.
Hive Applications, Components, Model and Layout
Hive Applications:
 Log processing
 Text mining
 Document indexing
 Customer-facing business intelligence (e.g., Google Analytics)
 Predictive modeling, hypothesis testing
Hive Components:
 Shell: allows interactive queries like MySQL shell connected to database – Also
supports web and JDBC clients
 Driver: session handles, fetch, execute
 Compiler: parse, plan, optimize
 sExecution engine: DAG of stages (M/R,HDFS, or metadata)
 Metastore: schema, location in HDFS,SerDe
Data Model:
 Tables – Typed columns (int, float, string, date,boolean) – Also, list: map (for JSON-
like data)
 Partitions – e.g., to range-partition tables by date
 Buckets – Hash partitions within ranges (useful for sampling, join optimization)
Metastore:
 Database: namespace containing a set of tables
 Holds table definitions (column types,physical layout)
 Partition data

Prepared by
Vetri.V
 Uses JPOX ORM for implementation; can be stored in Derby, MySQL, many other
relational databases
Physical Layout:
 Warehouse directory in HDFS – e.g., /home/hive/warehouse
 Tables stored in subdirectories of warehouse – Partitions, buckets form
subdirectories of tables
 Actual data stored in flat files – Control char-delimited text, or SequenceFiles – With
custom Server, can use arbitrary format
Hive Command Line Interface:
To launch the Hive shell, start a terminal and run $ hive
Note: example is the table name for all qurey
hive>
Hive : Creating Tables
hive> CREATE TABLE example (id INT, name STRING) ROW FORMAT DELIMITED FIELDS
TERMINATED BY 't' STORED AS TEXTFILE;
hive> DESCRIBE example;
hive> SHOW TABLES;
Hive : Loading Data Into Hive
Data is loaded into Hive with the LOAD DATA INPATH statement – Assumes that the data is
already in HDFS
hive> LOAD DATA INPATH “file_txtdata.txt” INTO TABLE example;
If the data is on the local filesystem, use LOAD DATA LOCAL INPATH – Automatically loads
it into HDFS
hive> LOAD DATA LOCAL INPATH "file_txtdata.txt" INTO TABLE example;
Hive : SELECT Queries
Hive supports most familiar SELECT syntax
hive> SELECT * FROM example LIMIT 10;
hive> SELECT * FROM example WHERE id > 100 ORDER BY name ASC LIMIT 10;
Joining Tables
SELECT e.name, e.dep, s.id FROM example e JOIN sample s ON (e.dep = s.dep) WHERE e.id
>= 20;
Creating User-Defined Functions
INSERT OVERWRITE TABLE u_data_new
SELECT TRANSFORM (userid, movieid, rating, unixtime) USING 'python
weekday_mapper.py' AS (userid, movieid, rating, weekday) FROM u_data;

Prepared by
Vetri.V
Join Query: sample
1.Create table
CREATE TABLE example(ID
int,SUBJECTstring,PRODUCTstring,PERIODint,START_TIMEint,OPERATION string)ROW
FORMAT DELIMITED FIELDS TERMINATED BY ','STORED AS TEXTFILE;
2.Load data (save the file in related folder)
hive> LOAD DATA LOCAL INPATH "file_txtdata.txt" INTO TABLE example;
3.Join Query
select A.*
from example A
join (
select id, max(start_time) as start_time
from example B
where start_time< 25
group by id ) MAXSP
ON A.id=MAXSP.id and A.start_time = MAXSP.start_time;
Using NOT IN / IN hive query
SELECT * FROM example WHERE NOT array_contains(array(7,6,5,4,2,12), id)
---THANK YOU---

Hive

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (18)

Similaire à Hive

Similaire à Hive (20)

Dernier

Dernier (20)

Hive