4. hbase overview

HBASE Overview
BY,
Anuja G. Gunale

Since 1970, RDBMS is the solution for data storage and
maintenance related problems.
After the advent of big data, companies realized the benefit of
processing big data and started opting for solutions like Hadoop.
Hadoop uses distributed file system for storing big data, and
MapReduce to process it.
Hadoop excels in storing and processing of huge data of various
formats such as arbitrary, semi-, or even unstructured.
Limitations of Hadoop
Hadoop can perform only batch processing, and data will be
accessed only in a sequential manner. That means one has to
search the entire dataset even for the simplest of jobs.
A huge dataset when processed results in another huge data
set, which should also be processed sequentially. At this point, a
new solution is needed to access any point of data in a single unit
of time (random access).

What is Hbase?
Hbase is an open source and sorted map data built
on Hadoop. It is column oriented and horizontally
scalable.
It is based on Google's Big Table.
It has set of tables which keep data in key value
format.
Hbase is well suited for sparse data sets which are
very common in big data use cases.
Hbase provides APIs enabling development in
practically any programming language.
It is a part of the Hadoop ecosystem that provides

Why Hbase?
•RDBMS get exponentially slow as the data becomes
large.
•Expects data to be highly structured, i.e. ability to fit
in a well-defined schema.
•Any change in schema might require a downtime.
•For sparse datasets, too much of overhead of
maintaining NULL values.

Features of Hbase:
•Horizontally scalable: You can add any number of columns anytime.
•Automatic Failover: Automatic failover is a resource that allows a system administrator
to automatically switch data handling to a standby system in the event of system
compromise
•Integrations with Map/Reduce framework: Al the commands and java codes internally
implement Map/ Reduce to do the task and it is built over Hadoop Distributed File
System.
•sparse, distributed, persistent, multidimensional sorted map, which is indexed by row-key,
column-key and timestamp.
•Often referred as a key value store or column family-oriented database, or storing
versioned maps of maps.
•fundamentally, it's a platform for storing and retrieving data with random access.
•It doesn't care about datatypes(storing an integer in one row and a string in another for
the same column).
•It doesn't enforce relationships within your data.

It is a part of the Hadoop ecosystem that provides random real-time read/write
access to data in the Hadoop File System.
One can store the data in HDFS either directly or through HBase.
Data consumer reads/accesses the data in HDFS randomly using HBase.
HBase sits on top of the Hadoop File System and provides read and write access.

Where to Use Hbase?
•Apache HBase is used to have random, real-time read/write
access to Big Data.
•It hosts very large tables on top of clusters of commodity
hardware.
•Apache HBase is a non-relational database modelled after
Google's Bigtable.
•Bigtable acts up on Google File System, likewise Apache
HBase works on top of Hadoop and HDFS.

HBase is a column-oriented NoSQL database.
Although it looks similar to a relational database which contains rows
and columns, but it is not a relational database.
Relational databases are row oriented while HBase is column-
oriented.
So, let us first understand the difference between Column-oriented
and Row-oriented databases:

1. Row-Oriented NoSQL:
•Row-oriented databases store table records in a sequence of rows.
•To better understand it, let us take an example and consider the table below.
•If this table is stored in a row-oriented database. It will store the records as shown below:
1, Paul Walker, US, 231, Gallardo,
2, Vin Diesel, Brazil, 520, Mustang
In row-oriented databases data is stored on the basis of rows or tuples as you can see above.

Column-Oriented NoSQL:
Whereas column-oriented databases store table records in a sequence of
columns, i.e. the entries in a column are stored in contiguous locations on
disks.
In a column-oriented databases, all the column values are stored together
like first column values will be stored together, then the second column values
will be stored together and data in other columns are stored in a similar
manner.
The column-oriented databases store this data as:
1,2, Paul Walker, Vin Diesel, US, Brazil, 231, 520, Gallardo, Mustang

When the amount of data is very huge, like in terms of petabytes or exabytes, we
use column-oriented approach, because the data of a single column is stored
together and can be accessed faster.
While row-oriented approach comparatively handles less number of rows and
columns efficiently, as row-oriented database stores data is a structured format.
When we need to process and analyze a large set of semi-structured or
unstructured data, we use column oriented approach. Such as applications dealing
with Online Analytical Processing like data mining, data warehousing,
applications including analytics, etc.
Whereas, Online Transactional Processing such as banking and finance
domains which handle structured data and require transactional properties (ACID
properties) use row-oriented approach.

HBase tables has following components, shown in
the image below:

•Tables: Data is stored in a table format in HBase. But here tables
are in column-oriented format.
•Row Key: Row keys are used to search records which
make searches fast. You would be curious to know how? I will
architecture part moving ahead in this blog.
•Column Families: Various columns are combined in a column
family. These column families are stored together which makes
faster because data belonging to same column family can be
single seek.
•Column Qualifiers: Each column’s name is known as its column
qualifier.
•Cell: Data is stored in cells. The data is dumped into cells which
are specifically identified by row-key and column qualifiers.
•Timestamp: Timestamp is a combination of date and time.
Whenever data is stored, it is stored with its timestamp. This
a particular version of data.

In a more simple and understanding way, we
can say HBase consists of:
•Set of tables
•Each table with column families and rows
•Row key acts as a Primary key in HBase.
•Any access to HBase tables uses this Primary
•Each column qualifier present in HBase denotes
corresponding to the object which resides in the

HBase Architecture and its Important
Components:

HBase architecture consists mainly of five
components:
•1) HMaster
•2) HRegionserver
•3) HRegions
•4) Zookeeper
•5) HDFS

1) Hmaster:
HMaster in HBase is the implementation of a Master server in
HBase architecture.
It acts as a monitoring agent to monitor all Region Server
instances present in the cluster and acts as an interface for all the
metadata changes.
In a distributed cluster environment, Master runs on NameNode.
Master runs several background threads.

The following are important roles performed by HMaster
in HBase.
1. HMaster Plays a vital role in terms of performance and
maintaining nodes in the cluster.
2. HMaster provides admin performance and distributes
to different region servers.
3. HMaster assigns regions to region servers.
4. HMaster has the features like controlling load
failover to handle the load over nodes present in the
5. When a client wants to change any schema and to
Metadata operations, HMaster takes responsibility for
operations.

Some of the methods exposed by HMaster Interface are
primarily Metadata oriented methods.
•Table (createTable, removeTable, enable, disable)
•ColumnFamily (add Column, modify Column)
•Region (move, assign)
The client communicates in a bi-directional way with both
and ZooKeeper.
For read and write operations, it directly contacts with
servers.
HMaster assigns regions to region servers and in turn,
health status of region servers.
In entire architecture, we have multiple region servers.
 Hlog present in region servers which are going to store all
files.

2) HBase Region Servers
When HBase Region Server receives writes and read requests from the client, it
assigns the request to a specific region, where the actual column family resides.
However, the client can directly contact with HRegion servers, there is no need of
HMaster mandatory permission to the client regarding communication with
HRegion servers.
The client requires HMaster help when operations related to metadata and schema
changes are required.
HRegionServer is the Region Server implementation.
It is responsible for serving and managing regions or data that is present in a
distributed cluster. The region servers run on Data Nodes present in the Hadoop
cluster.
HMaster can get into contact with multiple HRegion servers and performs the
performs the following functions.
•Hosting and managing regions
•Splitting regions automatically
•Handling read and writes requests
•Communicating with the client directly

A Region Server maintains various regions running on the top of HDFS.
Components of a Region Server are:
•WAL: As you can conclude from the above image, Write Ahead Log (WAL) is a
file attached to every Region Server inside the distributed environment. The
that hasn’t been persisted or committed to the permanent storage. It is used
the data sets.
•Block Cache: From the above image, it is clearly visible that Block Cache
resides in the top of Region Server. It stores the frequently read data in the
BlockCache is least recently used, then that data is removed from
•MemStore: It is the write cache. It stores all the incoming data before
committing it to the disk or permanent memory. There is one MemStore for
region. As you can see in the image, there are multiple MemStores for a
contains multiple column families. The data is sorted in lexicographical order
disk.
•HFile: From the above figure you can see HFile is stored on HDFS. Thus it
stores the actual cells on the disk. MemStore commits the data to HFile when
exceeds.

3) HBase Regions
HRegions are the basic building elements of HBase cluster that consists of the
distribution of tables and are comprised of Column families.
It contains multiple stores, one for each column family.
It consists of mainly two components, which are Memstore and Hfile.
So, concluding in a simpler way:
•A table can be divided into a number of regions. A Region is a sorted range of rows
storing data between a start key and an end key.
•A Region has a default size of 256MB which can be configured according to the need.
•A Group of regions is served to the clients by a Region Server.
•A Region Server can serve approximately 1000 regions to the client.

4) ZooKeeper:
HBase Zookeeper is a centralized monitoring server
which maintains configuration information and
provides distributed synchronization.
Distributed synchronization is to access the
distributed applications running across the cluster
with the responsibility of providing coordination
services between nodes.
If the client wants to communicate with regions, the
server’s client has to approach ZooKeeper first.
It is an open source project, and it provides so
many important services.

•Zookeeper acts like a coordinator inside HBase distributed
environment. It helps in maintaining server state inside the
cluster by communicating through sessions.
•Every Region Server along with HMaster Server sends
continuous heartbeat at regular interval to Zookeeper and it
checks which server is alive and available as mentioned in
above image. It also provides server failure notifications so
that, recovery measures can be executed.
•Referring from the above image you can see, there is an
inactive server, which acts as a backup for active server. If
the active server fails, it comes for the rescue.

•The active HMaster sends heartbeats to the Zookeeper while
the inactive HMaster listens for the notification send by active
HMaster. If the active HMaster fails to send a heartbeat the
session is deleted and the inactive HMaster becomes active.
•While if a Region Server fails to send a heartbeat, the session
is expired and all listeners are notified about it. Then HMaster
performs suitable recovery actions which we will discuss later
in this blog.
•Zookeeper also maintains the .META Server’s path, which
helps any client in searching for any region. The Client first has
to check with .META Server in which Region Server a region
belongs, and it gets the path of that Region Server.

•The META table is a special HBase catalog table. It maintains a list of
all the Regions Servers in the HBase storage system, as you can see
in the above image.
•Looking at the figure you can see, .META file maintains the table in
form of keys and values. Key represents the start key of the region and
its id whereas the value contains the path of the Region Server.

Services provided by ZooKeeper
•Maintains Configuration information
•Provides distributed synchronization
•Client Communication establishment with region servers
•Provides ephemeral nodes for which represent different
servers
•Master servers usability of ephemeral nodes for
servers in the cluster
•To track server failure and network partitions

5) HDFS
HDFS is a Hadoop distributed File System, as the name implies it provides a
distributed environment for the storage and it is a file system designed in a way
to run on commodity hardware.
It stores each file in multiple blocks and to maintain fault tolerance, the blocks
are replicated across a Hadoop cluster.
HDFS provides a high degree of fault –tolerance and runs on cheap commodity
hardware.
By adding nodes to the cluster and performing processing & storing by using
the cheap commodity hardware, it will give the client better results as compared
to the existing one.
HDFS get in contact with the HBase components and stores a large amount of
amount of data in a distributed manner.

HBase is a column-oriented database and data is stored in tables.
The tables are sorted by RowId. As shown above, HBase has RowId, which is the
collection of several column families that are present in the table.
The column families that are present in the schema are key-value pairs. If we
observe in detail each column family having multiple numbers of columns.The
column values stored into disk memory.
Each cell of the table has its own Metadata like timestamp and other
information.
Coming to HBase the following are the key terms representing table schema
schema
•Table: Collection of rows present.
•Row: Collection of column families.
•Column Family: Collection of columns.
•Column: Collection of key-value pairs.
•Namespace: Logical grouping of tables.
•Cell: A {row, column, version} tuple exactly specifies a cell definition in HBase.

Row v/s Column oriented Database:

How are Requests Handled in
HBase architecture?

3 mechanisms are followed to handle the requests in Hbase
Architecture:
1. Commence the Search in HBase Architecture
2. Write Mechanism in HBase Architecture
3. Read Mechanism in HBase Architecture

1. Commence the Search in HBase
Architecture
The steps to initialize the search are:
1.The user retrieves the Meta table from ZooKeeper
and then requests for the location of the relevant
Region Server.
2.Then the user will request the exact data from the
Region Server with the help of RowKey.

2. Write Mechanism in HBase Architecture

The write mechanism goes through the following process
sequentially (refer to the above image):
Step 1:
Whenever the client has a write request, the client writes the
WAL (Write Ahead Log).
•The edits are then appended at the end of the WAL file.
•This WAL file is maintained in every Region Server and Region
to recover data which is not committed to the disk.
Step 2:
Once data is written to the WAL, then it is copied to the
Step 3:
Once the data is placed in MemStore, then the client receives
acknowledgment.
Step 4:
When the MemStore reaches the threshold, it dumps or commits
into a HFile.

HBase Write Mechanism- MemStore
•The MemStore always updates the data stored in it, in a
lexicographical order (sequentially in a dictionary manner)
Key-Values. There is one MemStore for each column
the updates are stored in a sorted manner for each column
•When the MemStore reaches the threshold, it dumps all
a new HFile in a sorted manner. This HFile is stored in
contains multiple HFiles for each Column Family.
•Over time, the number of HFile grows as MemStore dumps
•MemStore also saves the last written sequence number, so
Server and MemStore both knows, that what is committed
where to start from. When region starts up, the last
is read, and from that number, new edits start.

HBase Architecture: HBase Write Mechanism- HFile
•The writes are placed sequentially on the disk.
movement of the disk’s read-write head is very less.
write and search mechanism very fast.
•The HFile indexes are loaded in memory whenever an
opened. This helps in finding a record in a single
•The trailer is a pointer which points to the HFile’s
is written at the end of the committed file. It contains
information about timestamp and bloom filters.
•Bloom Filter helps in searching key value pairs, it
which does not contain the required rowkey.
helps in searching a version of the file, it helps in
data.

3. Read Mechanism in HBase
Architecture
To read any data, the user will first have to access the
relevant Region Server. Once the Region Server is known,
the other process includes:
1.The first scan is made at the read cache, which is the
Block cache.
2.The next scan location is MemStore, which is the write
cache.
3.If the data is not found in block cache or MemStore, the
scanner will retrieve the data from HFile.

4. hbase overview

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 4. hbase overview

Similar to 4. hbase overview (20)

Recently uploaded

Recently uploaded (20)

4. hbase overview