2. Agenda
• Introduction
• Hbase vs RDBMS
• Hbase vs HDFS
• Hbase Architecture
• Hbase with Hive
• Hbase with Java
• Hbase with Mapreduce
3. Introduction to HBase
HBase is a Nosql, non-relational, distributed column-oriented database on top of
Hadoop.
NoSQL - NoSQL database are databases that doesn't use SQL engine as query engine.
Hbase Daemons
Daemons are services that run on individual machines and communicate with each other
HMaster — Master server of HBase, contains all meta data.
HRegionserver — Slave server of Hbase, contains the actual data.
HQuorumpeer — Zookeeper daemons for co-ordination service.
Advantages of using HBase
Provides a highly scalable database with nativity with hadoop.
Nodes can be added on the fly.
4. HBase vs RDBMS
Relational Database
•Is Based on a Fixed Schema
• Is a Row-oriented datastore
•Is designed to store Normalized Data
•Contains thin tables
•Has no built-in support for partitioning.
HBase
•Is Schema-less
•Is a Column-oriented datastore
•Is designed to store Denormalized Data
•Contains wide and sparsely populated tables
•Supports Automatic Partitioning
5. HBase vs HDFS
HDFS
•Is suited for High Latency operations batch processing
•Data is primarily accessed through MapReduce
•Is designed for batch processing and hence doesn’t have a concept of random
reads/writes
HBase
•Is built for Low Latency operations
•Provides access to single rows from billions of records
•Data is accessed through shell commands, Client APIs in Java, REST, Avro or
Thrift
7. RDBMS(B+ Tree)
•RDBMS adopts B+ tree to organize its indexes, as shown in figure.
• These B+ trees are often 3-level n-way balance trees. The nodes of a B+ tree are
blocks on disk. So for a update by RDBMS, it likely needs 5 times disk operation.
(3 times for B+ tree to find the block of the target row, 1 time for target block
read, and 1 time for data update).
•On RDBMS, data is written randomly as heap file on disk, but random data
block decrease read performance.
That’s why we need B+ tree index. B+ tree is fit well for data read, but is not
efficient for data updates. Given the large distributed data, B+ tree is not the
competitor for LSM-trees so far(used in Hbase)
9. HBase ( LSM Tree)
LSM-trees can be viewed as n-level merge-trees. It transforms random writes into
sequential writes using logfile and in-memory store.
Data Write(Insert, update): Data is written to logfile sequentially first, then to in-
memory store, where data is organized as sorted tree, like B+ tree. When the in-
memory store is filled up, the tree in the memory will be flushed to a store file on
disk. The store files on disk is arranged like B+ tree . But store files are optimized
for sequential disk access.
Data Read: In-memory store is searched first. Then search the store files on disk.
Data Delete: Give a data record a “delete marker”, system background will do
housekeeping work by merging some store files into a larger one to reduce disk
seeks. A data record will be deleted permanently during the housekeeping.
LSM-trees’ data updates are operated in memory, no disk access, it’s faster than B+
tree. When the data read is always on the data set that is written recently, LSM-trees
will reduce disk seeks, and improve performance. When disk IO is the cost we must
consider, LSM-trees is more suitable than B+ tree.
13. HBase Data Model
Tables – The HBase Tables are more like logical collection of rows stored in separate
partitions called Regions.
Rows – A row is one instance of data in a table and is identified by a rowkey. Rowkeys are
unique in a Table and are always treated as a byte[].
Column Families – Data in a row are grouped together as Column Families. Each Column
Family has one more Columns and these Columns in a family are stored together in a low
level storage file known as Hfile
The table above shows Customer and Sales Column Families. The Customer Column
Family is made up 2 columns – Name and City, whereas the Sales Column Families is made
up to 2 columns – Product and Amount.
14. HBase Data Model
Columns – A Column Family is made of one or more columns. A Column is identified by a
Column Qualifier that consists of the Column Family name concatenated with the Column
name using a colon – example: columnfamily:columnname. There can be multiple Columns
within a Column Family and Rows within a table can have varied number of Columns.
Cell – A Cell stores data and is essentially a unique combination of rowkey, Column Family
and the Column (Column Qualifier). The data stored in a Cell is called its value and the data
type is always treated as byte[].
Version – The data stored in a cell is versioned and versions of data are identified by the
timestamp. The number of versions of data retained in a column family is configurable and
this value by default is 3.
19. Region Server Architecture
.
It contains several components as follows:
1.One Block Cache, which is a LRU priority cache for data reading.
2. One WAL(Write Ahead Log): HBase use Log-Structured-Merge-Tree(LSM tree) to
process data writing. Each data update or delete will be write to WAL first, and then
write to MemStore. WAL is persisted on HDFS.
3. Multiple HRegions: each HRegion is a partition of table as we talk about in 3.3.1.
4. In a HRegion: Multiple HStore: Each HStore is correspond to a Column Family
5. In a HStore: One MemStore: store updates or deletes before flush to disk. Multiple
StoreFile, each of which is correspond to a HFile
6. A HFile is immutable, flushed from MemStore, persisted on HDFS
-The classical data pipelines bring in a data feed, and clean and transform it. A common example of such a feed is logs from Yahoo!'s web servers. These logs undergo a cleaning step where bots, company internal views, and clicks are removed. We also do transformations such as, for each click, finding the page view that preceded that click.
Pig-SQL
Pig Latin is procedural, where SQL is declarative.
Pig Latin allows pipeline developers to decide where to checkpoint data in the pipeline.
Pig Latin allows the developer to select specific operator implementations directly rather than relying on the optimizer.
Pig Latin supports splits in the pipeline.
Pig Latin allows developers to insert their own code almost anywhere in the data pipeline.