The initiation of The Hadoop Apache Hive began in 2007 by Facebook due to its data growth.
This ETL system began to fail over few years as more people joined Facebook.
In August 2008, Facebook decided to move to scalable a more scalable open-source Hadoop environment; Hive
Facebook, Netflix and Amazons support the Apache Hive SQL now known as the HiveQL
Influencing policy (training slides from Fast Track Impact)
Apache Hadoop Hive
1. The Apache Hadoop HIVE
Omoyayi Ibrahim Omodamilola
Student No.; 20174831
PhD Biomedical Engineering
2. Outline
• Big Data
• History of Database (NoSQL vs SQL)
• New SQL database
• SQL
• NoSQL
• Factors Affecting the Selection of a database
• Hadoop Hive
– Functions of Hive on Hadoop
– Hive vs Java vs Pig
• Hadoop Distributed File System
• Hive Architecture
• Work Flow of Hive
• List of Reference
3.
4. INTRODUCTION
• The initiation of The Hadoop Apache Hive began in 2007 by
Facebook due to its data growth.
• This ETL system began to fail over few years as more people
joined Facebook.
• In August 2008, Facebook decided to move to scalable a
more scalable open-source Hadoop environment; Hive
• Facebook, Netflix and Amazons support the Apache Hive
SQL now known as the HiveQL
6. NEW STRUCTURED QUERY LANGUAGE
NewSQL
• Relational + NoSQL
• designed for Web-scale applications
• provide many of the traditional SQL
operations
Class of modern relational database management systems that seek to
provide the same scalable performance of NoSQL systems for online
transaction processing (OLTP) read-write workloads while still
maintaining the ACID guarantees of a traditional database system.
7. RELATIONAL DATABASES SQL
• Structured Query Language (SQL)
• Consists of two or more tables with columns and
row
• Relationship between tables and field types is called
a schema
• (SQL) is a programming language used by database
(MySQL, Sybase, Oracle, or IBM DM2, SQL)
architects to design relational databases.
• These databases are well understood and widely
supported
8. Popular SQL databases and RDBMS’s
• MySQL—the most popular open-source database
• Oracle—an object-relational DBMS written in the C++ language.
• IMB DB2—a family of database server products from IBM that are
built to handle advanced “big data” analytics.
• Sybase—a relational model database server product for
businesses primarily used on the Unix OS and Linux
• MS SQL Server—a Microsoft-developed RDBMS for enterprise-
level databases that supports both SQL and NoSQL architectures.
• Microsoft Azure—a cloud computing platform that supports any
operating system, and lets you store, compute, and scale data
• MariaDB—an enhanced, drop-in version of MySQL.
• PostgreSQL—an enterprise-level, object-relational DBMS that uses
procedural languages like Perl and Python.
9. NOSQL DATABASES
• Easy to access
• Greater flexibility
• Documents oriented data
• Massive amounts of data
• Uncleared data requirements
• Data Includes: sensor data, social sharing, personal
settings, photos, location-based information, online
activity, usage metrics, etc
11. POPULAR NOSQL DATABASES
• MongoDB—the most popular NoSQL system
• Apache’s CouchDB—a true DB for the web, it uses the
JSON data exchange format to store its documents
• HBase—another Apache project, developed as a part of
Hadoop, this open-source, non-relational “column
store”
• Oracle NoSQL—Oracle’s entry into the NoSQL category.
• Apache’s Cassandra DB—born at Facebook, handling
massive amounts of structured data. Examples:
Instagram, Comcast, Apple, and Spotify (growing app).
• Riak—It has fault-tolerance replication and automatic
data distribution built in for excellent performance.
12.
13. SQL
Pros Cons
Relational databases work with structured data. Relational Databases do not scale out
horizontally very well (concurrency and data
size), only vertically.
They support ACID (Atomicity, Consistency,
Isolation, Durability) transactional consistency
and support.
Data is normalized, meaning lots of joins, which
affects speed.
They come with built-in data integrity and a
large eco-system.
Data is normalized, meaning lots of joins, which
affects speed.
Relationships in this system have constraints. They have problems working with semi-
structured data.
There is limitless indexing. Strong SQL
14. NoSQL
Pros Cons
They scale out horizontally and work with
unstructured and semi-structured data.
Data is deformalized, requiring mass updates
(i.e. product name change).
Some support ACID transactional
consistency.
Weaker or eventual consistency instead of
ACID
Schema-free or Schema-on-read options. Does not have built-in data integrity (must
do in code)
High availability of language training, setup,
and developments cost
Limited support
Databases are open source and so “free” Does not have built-in data integrity (must
do in code)
Numerous commercial products available.
15. Hadoop
• Facebook, Google, Yahoo, Amazon, and Microsoft
• Exponential growth of data
• Doug Cutting developed an open source version of
MapReduce system called Hadoop
• Hadoop is a software ecosystem that allows for
massively parallel computing
• Large data procedure which might takes 20 hours of
processing time on relational database may only
take 3 minutes with Hadoop
• Hive looks like old SQL - HQL
17. Hive is not
• A relational database
• A design for OnLine Transaction Processing
OLTP
• A language for real-time queries and row-level
updates
18. FUCTIONS OF HIVE ON HADOOP
• Data Warehouse system built on top of Hadoop
• Takes advantages of Hadoop processing power
• Facilitates data summarization, ad-hoc queries,
analysis of large datasets stored in Hadoop
• Provides a SQL interface (known as Hive QL – HQL)
which is widely familiar to most programmers
• Saves times using Hadoop MapReduce programmes
• Provides mechanism to project structure onto
Hadoop datasets
• Loads fast and allow flexibility at the cost of query
time
19. Apaches framework
• Sqoop: It is used to import and export data to
and from between HDFS and RDBMS.
• Pig: It is a procedural language platform used
to develop a script for MapReduce operations.
• Hive: It is a platform used to develop SQL type
scripts to do MapReduce operations
20. Hive vs Java and Pig
Java Pig
• Word Count MapReduce
example: List words and
number of occurrences in a
document
Java takes 63 lines of java codes
to write this hive only takes 7
easy lines of code.
• High level programming
language
• Good for ETL
• Powerful transformation
capabilities
• Often used in combination with
HIVE.
22. HIVE DIRECTORY STRUCTURE
• Lib directory
– SHIVE_HOME/lib
– Location of the Hive JAR files
– Contain the actual Java code that implement the Hive
functionality
• Bin directory
– SHIVE_HOME/bin
– Location of Hive Scripts/Services
• Conf directory
– HIVE_HOME/conf
– Location of configuration files
23. Summary & Conclusion
• Hive is a data warehouse infrastructure tool to process
structured data in Hadoop.
• It resides on top Hadoop to summarize Big Data, and
makes querying and analyzing easy.
• Initially Hive was developed by Facebook, later the
Apache Software Foundation took it up and
• Developed it further as an open source under the
name Apache Hive.
• It is used by different companies. For example,
Amazon uses it in Amazon Elastic MapReduce.