SlideShare a Scribd company logo
1 of 48
HBASE Overview
BY,
Anuja G. Gunale
Since 1970, RDBMS is the solution for data storage and
maintenance related problems.
After the advent of big data, companies realized the benefit of
processing big data and started opting for solutions like Hadoop.
Hadoop uses distributed file system for storing big data, and
MapReduce to process it.
Hadoop excels in storing and processing of huge data of various
formats such as arbitrary, semi-, or even unstructured.
Limitations of Hadoop
Hadoop can perform only batch processing, and data will be
accessed only in a sequential manner. That means one has to
search the entire dataset even for the simplest of jobs.
A huge dataset when processed results in another huge data
set, which should also be processed sequentially. At this point, a
new solution is needed to access any point of data in a single unit
of time (random access).
What is Hbase?
Hbase is an open source and sorted map data built
on Hadoop. It is column oriented and horizontally
scalable.
It is based on Google's Big Table.
It has set of tables which keep data in key value
format.
Hbase is well suited for sparse data sets which are
very common in big data use cases.
Hbase provides APIs enabling development in
practically any programming language.
It is a part of the Hadoop ecosystem that provides
Why Hbase?
•RDBMS get exponentially slow as the data becomes
large.
•Expects data to be highly structured, i.e. ability to fit
in a well-defined schema.
•Any change in schema might require a downtime.
•For sparse datasets, too much of overhead of
maintaining NULL values.
Features of Hbase:
•Horizontally scalable: You can add any number of columns anytime.
•Automatic Failover: Automatic failover is a resource that allows a system administrator
to automatically switch data handling to a standby system in the event of system
compromise
•Integrations with Map/Reduce framework: Al the commands and java codes internally
implement Map/ Reduce to do the task and it is built over Hadoop Distributed File
System.
•sparse, distributed, persistent, multidimensional sorted map, which is indexed by row-key,
column-key and timestamp.
•Often referred as a key value store or column family-oriented database, or storing
versioned maps of maps.
•fundamentally, it's a platform for storing and retrieving data with random access.
•It doesn't care about datatypes(storing an integer in one row and a string in another for
the same column).
•It doesn't enforce relationships within your data.
It is a part of the Hadoop ecosystem that provides random real-time read/write
access to data in the Hadoop File System.
One can store the data in HDFS either directly or through HBase.
Data consumer reads/accesses the data in HDFS randomly using HBase.
HBase sits on top of the Hadoop File System and provides read and write access.
Where to Use Hbase?
•Apache HBase is used to have random, real-time read/write
access to Big Data.
•It hosts very large tables on top of clusters of commodity
hardware.
•Apache HBase is a non-relational database modelled after
Google's Bigtable.
•Bigtable acts up on Google File System, likewise Apache
HBase works on top of Hadoop and HDFS.
Hbase v/s RDBMS:
Hbase v/s HDFS:
Hbase Data Model:
HBase is a column-oriented NoSQL database.
Although it looks similar to a relational database which contains rows
and columns, but it is not a relational database.
Relational databases are row oriented while HBase is column-
oriented.
So, let us first understand the difference between Column-oriented
and Row-oriented databases:
1. Row-Oriented NoSQL:
•Row-oriented databases store table records in a sequence of rows.
•To better understand it, let us take an example and consider the table below.
•If this table is stored in a row-oriented database. It will store the records as shown below:
1, Paul Walker, US, 231, Gallardo,
2, Vin Diesel, Brazil, 520, Mustang
In row-oriented databases data is stored on the basis of rows or tuples as you can see above.
Column-Oriented NoSQL:
Whereas column-oriented databases store table records in a sequence of
columns, i.e. the entries in a column are stored in contiguous locations on
disks.
In a column-oriented databases, all the column values are stored together
like first column values will be stored together, then the second column values
will be stored together and data in other columns are stored in a similar
manner.
The column-oriented databases store this data as:
1,2, Paul Walker, Vin Diesel, US, Brazil, 231, 520, Gallardo, Mustang
When the amount of data is very huge, like in terms of petabytes or exabytes, we
use column-oriented approach, because the data of a single column is stored
together and can be accessed faster.
While row-oriented approach comparatively handles less number of rows and
columns efficiently, as row-oriented database stores data is a structured format.
When we need to process and analyze a large set of semi-structured or
unstructured data, we use column oriented approach. Such as applications dealing
with Online Analytical Processing like data mining, data warehousing,
applications including analytics, etc.
Whereas, Online Transactional Processing such as banking and finance
domains which handle structured data and require transactional properties (ACID
properties) use row-oriented approach.
HBase tables has following components, shown in
the image below:
•Tables: Data is stored in a table format in HBase. But here tables
are in column-oriented format.
•Row Key: Row keys are used to search records which
make searches fast. You would be curious to know how? I will
architecture part moving ahead in this blog.
•Column Families: Various columns are combined in a column
family. These column families are stored together which makes
faster because data belonging to same column family can be
single seek.
•Column Qualifiers: Each column’s name is known as its column
qualifier.
•Cell: Data is stored in cells. The data is dumped into cells which
are specifically identified by row-key and column qualifiers.
•Timestamp: Timestamp is a combination of date and time.
Whenever data is stored, it is stored with its timestamp. This
a particular version of data.
In a more simple and understanding way, we
can say HBase consists of:
•Set of tables
•Each table with column families and rows
•Row key acts as a Primary key in HBase.
•Any access to HBase tables uses this Primary
•Each column qualifier present in HBase denotes
corresponding to the object which resides in the
HBase Architecture and its Important
Components:
Hbase Architecture:
HBase architecture consists mainly of five
components:
•1) HMaster
•2) HRegionserver
•3) HRegions
•4) Zookeeper
•5) HDFS
1) Hmaster:
HMaster in HBase is the implementation of a Master server in
HBase architecture.
It acts as a monitoring agent to monitor all Region Server
instances present in the cluster and acts as an interface for all the
metadata changes.
In a distributed cluster environment, Master runs on NameNode.
Master runs several background threads.
The following are important roles performed by HMaster
in HBase.
1. HMaster Plays a vital role in terms of performance and
maintaining nodes in the cluster.
2. HMaster provides admin performance and distributes
to different region servers.
3. HMaster assigns regions to region servers.
4. HMaster has the features like controlling load
failover to handle the load over nodes present in the
5. When a client wants to change any schema and to
Metadata operations, HMaster takes responsibility for
operations.
Some of the methods exposed by HMaster Interface are
primarily Metadata oriented methods.
•Table (createTable, removeTable, enable, disable)
•ColumnFamily (add Column, modify Column)
•Region (move, assign)
The client communicates in a bi-directional way with both
and ZooKeeper.
For read and write operations, it directly contacts with
servers.
HMaster assigns regions to region servers and in turn,
health status of region servers.
In entire architecture, we have multiple region servers.
 Hlog present in region servers which are going to store all
files.
2) HBase Region Servers
When HBase Region Server receives writes and read requests from the client, it
assigns the request to a specific region, where the actual column family resides.
However, the client can directly contact with HRegion servers, there is no need of
HMaster mandatory permission to the client regarding communication with
HRegion servers.
The client requires HMaster help when operations related to metadata and schema
changes are required.
HRegionServer is the Region Server implementation.
It is responsible for serving and managing regions or data that is present in a
distributed cluster. The region servers run on Data Nodes present in the Hadoop
cluster.
HMaster can get into contact with multiple HRegion servers and performs the
performs the following functions.
•Hosting and managing regions
•Splitting regions automatically
•Handling read and writes requests
•Communicating with the client directly
Components of Region Server:
A Region Server maintains various regions running on the top of HDFS.
Components of a Region Server are:
•WAL: As you can conclude from the above image, Write Ahead Log (WAL) is a
file attached to every Region Server inside the distributed environment. The
that hasn’t been persisted or committed to the permanent storage. It is used
the data sets.
•Block Cache: From the above image, it is clearly visible that Block Cache
resides in the top of Region Server. It stores the frequently read data in the
BlockCache is least recently used, then that data is removed from
•MemStore: It is the write cache. It stores all the incoming data before
committing it to the disk or permanent memory. There is one MemStore for
region. As you can see in the image, there are multiple MemStores for a
contains multiple column families. The data is sorted in lexicographical order
disk.
•HFile: From the above figure you can see HFile is stored on HDFS. Thus it
stores the actual cells on the disk. MemStore commits the data to HFile when
exceeds.
3) HBase Regions
HRegions are the basic building elements of HBase cluster that consists of the
distribution of tables and are comprised of Column families.
It contains multiple stores, one for each column family.
It consists of mainly two components, which are Memstore and Hfile.
So, concluding in a simpler way:
•A table can be divided into a number of regions. A Region is a sorted range of rows
storing data between a start key and an end key.
•A Region has a default size of 256MB which can be configured according to the need.
•A Group of regions is served to the clients by a Region Server.
•A Region Server can serve approximately 1000 regions to the client.
4) ZooKeeper:
HBase Zookeeper is a centralized monitoring server
which maintains configuration information and
provides distributed synchronization.
Distributed synchronization is to access the
distributed applications running across the cluster
with the responsibility of providing coordination
services between nodes.
If the client wants to communicate with regions, the
server’s client has to approach ZooKeeper first.
It is an open source project, and it provides so
many important services.
•Zookeeper acts like a coordinator inside HBase distributed
environment. It helps in maintaining server state inside the
cluster by communicating through sessions.
•Every Region Server along with HMaster Server sends
continuous heartbeat at regular interval to Zookeeper and it
checks which server is alive and available as mentioned in
above image. It also provides server failure notifications so
that, recovery measures can be executed.
•Referring from the above image you can see, there is an
inactive server, which acts as a backup for active server. If
the active server fails, it comes for the rescue.
•The active HMaster sends heartbeats to the Zookeeper while
the inactive HMaster listens for the notification send by active
HMaster. If the active HMaster fails to send a heartbeat the
session is deleted and the inactive HMaster becomes active.
•While if a Region Server fails to send a heartbeat, the session
is expired and all listeners are notified about it. Then HMaster
performs suitable recovery actions which we will discuss later
in this blog.
•Zookeeper also maintains the .META Server’s path, which
helps any client in searching for any region. The Client first has
to check with .META Server in which Region Server a region
belongs, and it gets the path of that Region Server.
Meta table:
•The META table is a special HBase catalog table. It maintains a list of
all the Regions Servers in the HBase storage system, as you can see
in the above image.
•Looking at the figure you can see, .META file maintains the table in
form of keys and values. Key represents the start key of the region and
its id whereas the value contains the path of the Region Server.
Services provided by ZooKeeper
•Maintains Configuration information
•Provides distributed synchronization
•Client Communication establishment with region servers
•Provides ephemeral nodes for which represent different
servers
•Master servers usability of ephemeral nodes for
servers in the cluster
•To track server failure and network partitions
5) HDFS
HDFS is a Hadoop distributed File System, as the name implies it provides a
distributed environment for the storage and it is a file system designed in a way
to run on commodity hardware.
It stores each file in multiple blocks and to maintain fault tolerance, the blocks
are replicated across a Hadoop cluster.
HDFS provides a high degree of fault –tolerance and runs on cheap commodity
hardware.
By adding nodes to the cluster and performing processing & storing by using
the cheap commodity hardware, it will give the client better results as compared
to the existing one.
HDFS get in contact with the HBase components and stores a large amount of
amount of data in a distributed manner.
Storage Mechanism in HBase
HBase is a column-oriented database and data is stored in tables.
The tables are sorted by RowId. As shown above, HBase has RowId, which is the
collection of several column families that are present in the table.
The column families that are present in the schema are key-value pairs. If we
observe in detail each column family having multiple numbers of columns.The
column values stored into disk memory.
Each cell of the table has its own Metadata like timestamp and other
information.
Coming to HBase the following are the key terms representing table schema
schema
•Table: Collection of rows present.
•Row: Collection of column families.
•Column Family: Collection of columns.
•Column: Collection of key-value pairs.
•Namespace: Logical grouping of tables.
•Cell: A {row, column, version} tuple exactly specifies a cell definition in HBase.
Row v/s Column oriented Database:
How are Requests Handled in
HBase architecture?
3 mechanisms are followed to handle the requests in Hbase
Architecture:
1. Commence the Search in HBase Architecture
2. Write Mechanism in HBase Architecture
3. Read Mechanism in HBase Architecture
1. Commence the Search in HBase
Architecture
The steps to initialize the search are:
1.The user retrieves the Meta table from ZooKeeper
and then requests for the location of the relevant
Region Server.
2.Then the user will request the exact data from the
Region Server with the help of RowKey.
2. Write Mechanism in HBase Architecture
The write mechanism goes through the following process
sequentially (refer to the above image):
Step 1:
Whenever the client has a write request, the client writes the
WAL (Write Ahead Log).
•The edits are then appended at the end of the WAL file.
•This WAL file is maintained in every Region Server and Region
to recover data which is not committed to the disk.
Step 2:
Once data is written to the WAL, then it is copied to the
Step 3:
Once the data is placed in MemStore, then the client receives
acknowledgment.
Step 4:
When the MemStore reaches the threshold, it dumps or commits
into a HFile.
HBase Write Mechanism- MemStore
•The MemStore always updates the data stored in it, in a
lexicographical order (sequentially in a dictionary manner)
Key-Values. There is one MemStore for each column
the updates are stored in a sorted manner for each column
•When the MemStore reaches the threshold, it dumps all
a new HFile in a sorted manner. This HFile is stored in
contains multiple HFiles for each Column Family.
•Over time, the number of HFile grows as MemStore dumps
•MemStore also saves the last written sequence number, so
Server and MemStore both knows, that what is committed
where to start from. When region starts up, the last
is read, and from that number, new edits start.
HBase Architecture: HBase Write Mechanism- HFile
•The writes are placed sequentially on the disk.
movement of the disk’s read-write head is very less.
write and search mechanism very fast.
•The HFile indexes are loaded in memory whenever an
opened. This helps in finding a record in a single
•The trailer is a pointer which points to the HFile’s
is written at the end of the committed file. It contains
information about timestamp and bloom filters.
•Bloom Filter helps in searching key value pairs, it
which does not contain the required rowkey.
helps in searching a version of the file, it helps in
data.
3. Read Mechanism in HBase
Architecture
To read any data, the user will first have to access the
relevant Region Server. Once the Region Server is known,
the other process includes:
1.The first scan is made at the read cache, which is the
Block cache.
2.The next scan location is MemStore, which is the write
cache.
3.If the data is not found in block cache or MemStore, the
scanner will retrieve the data from HFile.
Thank You….

More Related Content

What's hot

NoSQL databases - An introduction
NoSQL databases - An introductionNoSQL databases - An introduction
NoSQL databases - An introductionPooyan Mehrparvar
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
 
Hive partitioning best practices
Hive partitioning  best practicesHive partitioning  best practices
Hive partitioning best practicesNabeel Moidu
 
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL DatabasesDerek Stainer
 
Cloud Data Warehouses
Cloud Data WarehousesCloud Data Warehouses
Cloud Data WarehousesAsis Mohanty
 
Delta Lake Cheat Sheet.pdf
Delta Lake Cheat Sheet.pdfDelta Lake Cheat Sheet.pdf
Delta Lake Cheat Sheet.pdfkaransharma62792
 
Batch Processing vs Stream Processing Difference
Batch Processing vs Stream Processing DifferenceBatch Processing vs Stream Processing Difference
Batch Processing vs Stream Processing Differencejeetendra mandal
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellKhalid Imran
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake OverviewJames Serra
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureJames Serra
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 

What's hot (20)

NoSQL databases - An introduction
NoSQL databases - An introductionNoSQL databases - An introduction
NoSQL databases - An introduction
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 
Hive partitioning best practices
Hive partitioning  best practicesHive partitioning  best practices
Hive partitioning best practices
 
NOSQL vs SQL
NOSQL vs SQLNOSQL vs SQL
NOSQL vs SQL
 
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL Databases
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Cloud Data Warehouses
Cloud Data WarehousesCloud Data Warehouses
Cloud Data Warehouses
 
Delta Lake Cheat Sheet.pdf
Delta Lake Cheat Sheet.pdfDelta Lake Cheat Sheet.pdf
Delta Lake Cheat Sheet.pdf
 
Batch Processing vs Stream Processing Difference
Batch Processing vs Stream Processing DifferenceBatch Processing vs Stream Processing Difference
Batch Processing vs Stream Processing Difference
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Hadoop
Hadoop Hadoop
Hadoop
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 

Similar to 4. hbase overview (20)

CCS334 BIG DATA ANALYTICS UNIT 5 PPT ELECTIVE PAPER
CCS334 BIG DATA ANALYTICS UNIT 5 PPT  ELECTIVE PAPERCCS334 BIG DATA ANALYTICS UNIT 5 PPT  ELECTIVE PAPER
CCS334 BIG DATA ANALYTICS UNIT 5 PPT ELECTIVE PAPER
 
Uint-5 Big data Frameworks.pdf
Uint-5 Big data Frameworks.pdfUint-5 Big data Frameworks.pdf
Uint-5 Big data Frameworks.pdf
 
Uint-5 Big data Frameworks.pdf
Uint-5 Big data Frameworks.pdfUint-5 Big data Frameworks.pdf
Uint-5 Big data Frameworks.pdf
 
Hbase
HbaseHbase
Hbase
 
Unit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptxUnit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptx
 
HBase.pptx
HBase.pptxHBase.pptx
HBase.pptx
 
Hbase
HbaseHbase
Hbase
 
HBASE Overview
HBASE OverviewHBASE Overview
HBASE Overview
 
Apache hadoop hbase
Apache hadoop hbaseApache hadoop hbase
Apache hadoop hbase
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBase
 
Big data hbase
Big data hbase Big data hbase
Big data hbase
 
HBase
HBaseHBase
HBase
 
01 hbase
01 hbase01 hbase
01 hbase
 
Apache hive
Apache hiveApache hive
Apache hive
 
Hadoop - Apache Hbase
Hadoop - Apache HbaseHadoop - Apache Hbase
Hadoop - Apache Hbase
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
Hbase Quick Review Guide for Interviews
Hbase Quick Review Guide for InterviewsHbase Quick Review Guide for Interviews
Hbase Quick Review Guide for Interviews
 
Hbase 20141003
Hbase 20141003Hbase 20141003
Hbase 20141003
 
Nyc hadoop meetup introduction to h base
Nyc hadoop meetup   introduction to h baseNyc hadoop meetup   introduction to h base
Nyc hadoop meetup introduction to h base
 
Apache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesApache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use Cases
 

Recently uploaded

Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 

4. hbase overview

  • 2. Since 1970, RDBMS is the solution for data storage and maintenance related problems. After the advent of big data, companies realized the benefit of processing big data and started opting for solutions like Hadoop. Hadoop uses distributed file system for storing big data, and MapReduce to process it. Hadoop excels in storing and processing of huge data of various formats such as arbitrary, semi-, or even unstructured. Limitations of Hadoop Hadoop can perform only batch processing, and data will be accessed only in a sequential manner. That means one has to search the entire dataset even for the simplest of jobs. A huge dataset when processed results in another huge data set, which should also be processed sequentially. At this point, a new solution is needed to access any point of data in a single unit of time (random access).
  • 3. What is Hbase? Hbase is an open source and sorted map data built on Hadoop. It is column oriented and horizontally scalable. It is based on Google's Big Table. It has set of tables which keep data in key value format. Hbase is well suited for sparse data sets which are very common in big data use cases. Hbase provides APIs enabling development in practically any programming language. It is a part of the Hadoop ecosystem that provides
  • 4. Why Hbase? •RDBMS get exponentially slow as the data becomes large. •Expects data to be highly structured, i.e. ability to fit in a well-defined schema. •Any change in schema might require a downtime. •For sparse datasets, too much of overhead of maintaining NULL values.
  • 5. Features of Hbase: •Horizontally scalable: You can add any number of columns anytime. •Automatic Failover: Automatic failover is a resource that allows a system administrator to automatically switch data handling to a standby system in the event of system compromise •Integrations with Map/Reduce framework: Al the commands and java codes internally implement Map/ Reduce to do the task and it is built over Hadoop Distributed File System. •sparse, distributed, persistent, multidimensional sorted map, which is indexed by row-key, column-key and timestamp. •Often referred as a key value store or column family-oriented database, or storing versioned maps of maps. •fundamentally, it's a platform for storing and retrieving data with random access. •It doesn't care about datatypes(storing an integer in one row and a string in another for the same column). •It doesn't enforce relationships within your data.
  • 6. It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in the Hadoop File System. One can store the data in HDFS either directly or through HBase. Data consumer reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the Hadoop File System and provides read and write access.
  • 7. Where to Use Hbase? •Apache HBase is used to have random, real-time read/write access to Big Data. •It hosts very large tables on top of clusters of commodity hardware. •Apache HBase is a non-relational database modelled after Google's Bigtable. •Bigtable acts up on Google File System, likewise Apache HBase works on top of Hadoop and HDFS.
  • 11. HBase is a column-oriented NoSQL database. Although it looks similar to a relational database which contains rows and columns, but it is not a relational database. Relational databases are row oriented while HBase is column- oriented. So, let us first understand the difference between Column-oriented and Row-oriented databases:
  • 12. 1. Row-Oriented NoSQL: •Row-oriented databases store table records in a sequence of rows. •To better understand it, let us take an example and consider the table below. •If this table is stored in a row-oriented database. It will store the records as shown below: 1, Paul Walker, US, 231, Gallardo, 2, Vin Diesel, Brazil, 520, Mustang In row-oriented databases data is stored on the basis of rows or tuples as you can see above.
  • 13. Column-Oriented NoSQL: Whereas column-oriented databases store table records in a sequence of columns, i.e. the entries in a column are stored in contiguous locations on disks. In a column-oriented databases, all the column values are stored together like first column values will be stored together, then the second column values will be stored together and data in other columns are stored in a similar manner. The column-oriented databases store this data as: 1,2, Paul Walker, Vin Diesel, US, Brazil, 231, 520, Gallardo, Mustang
  • 14. When the amount of data is very huge, like in terms of petabytes or exabytes, we use column-oriented approach, because the data of a single column is stored together and can be accessed faster. While row-oriented approach comparatively handles less number of rows and columns efficiently, as row-oriented database stores data is a structured format. When we need to process and analyze a large set of semi-structured or unstructured data, we use column oriented approach. Such as applications dealing with Online Analytical Processing like data mining, data warehousing, applications including analytics, etc. Whereas, Online Transactional Processing such as banking and finance domains which handle structured data and require transactional properties (ACID properties) use row-oriented approach.
  • 15. HBase tables has following components, shown in the image below:
  • 16. •Tables: Data is stored in a table format in HBase. But here tables are in column-oriented format. •Row Key: Row keys are used to search records which make searches fast. You would be curious to know how? I will architecture part moving ahead in this blog. •Column Families: Various columns are combined in a column family. These column families are stored together which makes faster because data belonging to same column family can be single seek. •Column Qualifiers: Each column’s name is known as its column qualifier. •Cell: Data is stored in cells. The data is dumped into cells which are specifically identified by row-key and column qualifiers. •Timestamp: Timestamp is a combination of date and time. Whenever data is stored, it is stored with its timestamp. This a particular version of data.
  • 17. In a more simple and understanding way, we can say HBase consists of: •Set of tables •Each table with column families and rows •Row key acts as a Primary key in HBase. •Any access to HBase tables uses this Primary •Each column qualifier present in HBase denotes corresponding to the object which resides in the
  • 18. HBase Architecture and its Important Components:
  • 20. HBase architecture consists mainly of five components: •1) HMaster •2) HRegionserver •3) HRegions •4) Zookeeper •5) HDFS
  • 21. 1) Hmaster: HMaster in HBase is the implementation of a Master server in HBase architecture. It acts as a monitoring agent to monitor all Region Server instances present in the cluster and acts as an interface for all the metadata changes. In a distributed cluster environment, Master runs on NameNode. Master runs several background threads.
  • 22. The following are important roles performed by HMaster in HBase. 1. HMaster Plays a vital role in terms of performance and maintaining nodes in the cluster. 2. HMaster provides admin performance and distributes to different region servers. 3. HMaster assigns regions to region servers. 4. HMaster has the features like controlling load failover to handle the load over nodes present in the 5. When a client wants to change any schema and to Metadata operations, HMaster takes responsibility for operations.
  • 23. Some of the methods exposed by HMaster Interface are primarily Metadata oriented methods. •Table (createTable, removeTable, enable, disable) •ColumnFamily (add Column, modify Column) •Region (move, assign) The client communicates in a bi-directional way with both and ZooKeeper. For read and write operations, it directly contacts with servers. HMaster assigns regions to region servers and in turn, health status of region servers. In entire architecture, we have multiple region servers.  Hlog present in region servers which are going to store all files.
  • 24. 2) HBase Region Servers When HBase Region Server receives writes and read requests from the client, it assigns the request to a specific region, where the actual column family resides. However, the client can directly contact with HRegion servers, there is no need of HMaster mandatory permission to the client regarding communication with HRegion servers. The client requires HMaster help when operations related to metadata and schema changes are required. HRegionServer is the Region Server implementation. It is responsible for serving and managing regions or data that is present in a distributed cluster. The region servers run on Data Nodes present in the Hadoop cluster. HMaster can get into contact with multiple HRegion servers and performs the performs the following functions. •Hosting and managing regions •Splitting regions automatically •Handling read and writes requests •Communicating with the client directly
  • 26. A Region Server maintains various regions running on the top of HDFS. Components of a Region Server are: •WAL: As you can conclude from the above image, Write Ahead Log (WAL) is a file attached to every Region Server inside the distributed environment. The that hasn’t been persisted or committed to the permanent storage. It is used the data sets. •Block Cache: From the above image, it is clearly visible that Block Cache resides in the top of Region Server. It stores the frequently read data in the BlockCache is least recently used, then that data is removed from •MemStore: It is the write cache. It stores all the incoming data before committing it to the disk or permanent memory. There is one MemStore for region. As you can see in the image, there are multiple MemStores for a contains multiple column families. The data is sorted in lexicographical order disk. •HFile: From the above figure you can see HFile is stored on HDFS. Thus it stores the actual cells on the disk. MemStore commits the data to HFile when exceeds.
  • 27. 3) HBase Regions HRegions are the basic building elements of HBase cluster that consists of the distribution of tables and are comprised of Column families. It contains multiple stores, one for each column family. It consists of mainly two components, which are Memstore and Hfile. So, concluding in a simpler way: •A table can be divided into a number of regions. A Region is a sorted range of rows storing data between a start key and an end key. •A Region has a default size of 256MB which can be configured according to the need. •A Group of regions is served to the clients by a Region Server. •A Region Server can serve approximately 1000 regions to the client.
  • 28. 4) ZooKeeper: HBase Zookeeper is a centralized monitoring server which maintains configuration information and provides distributed synchronization. Distributed synchronization is to access the distributed applications running across the cluster with the responsibility of providing coordination services between nodes. If the client wants to communicate with regions, the server’s client has to approach ZooKeeper first. It is an open source project, and it provides so many important services.
  • 29.
  • 30. •Zookeeper acts like a coordinator inside HBase distributed environment. It helps in maintaining server state inside the cluster by communicating through sessions. •Every Region Server along with HMaster Server sends continuous heartbeat at regular interval to Zookeeper and it checks which server is alive and available as mentioned in above image. It also provides server failure notifications so that, recovery measures can be executed. •Referring from the above image you can see, there is an inactive server, which acts as a backup for active server. If the active server fails, it comes for the rescue.
  • 31. •The active HMaster sends heartbeats to the Zookeeper while the inactive HMaster listens for the notification send by active HMaster. If the active HMaster fails to send a heartbeat the session is deleted and the inactive HMaster becomes active. •While if a Region Server fails to send a heartbeat, the session is expired and all listeners are notified about it. Then HMaster performs suitable recovery actions which we will discuss later in this blog. •Zookeeper also maintains the .META Server’s path, which helps any client in searching for any region. The Client first has to check with .META Server in which Region Server a region belongs, and it gets the path of that Region Server.
  • 33. •The META table is a special HBase catalog table. It maintains a list of all the Regions Servers in the HBase storage system, as you can see in the above image. •Looking at the figure you can see, .META file maintains the table in form of keys and values. Key represents the start key of the region and its id whereas the value contains the path of the Region Server.
  • 34. Services provided by ZooKeeper •Maintains Configuration information •Provides distributed synchronization •Client Communication establishment with region servers •Provides ephemeral nodes for which represent different servers •Master servers usability of ephemeral nodes for servers in the cluster •To track server failure and network partitions
  • 35. 5) HDFS HDFS is a Hadoop distributed File System, as the name implies it provides a distributed environment for the storage and it is a file system designed in a way to run on commodity hardware. It stores each file in multiple blocks and to maintain fault tolerance, the blocks are replicated across a Hadoop cluster. HDFS provides a high degree of fault –tolerance and runs on cheap commodity hardware. By adding nodes to the cluster and performing processing & storing by using the cheap commodity hardware, it will give the client better results as compared to the existing one. HDFS get in contact with the HBase components and stores a large amount of amount of data in a distributed manner.
  • 37.
  • 38. HBase is a column-oriented database and data is stored in tables. The tables are sorted by RowId. As shown above, HBase has RowId, which is the collection of several column families that are present in the table. The column families that are present in the schema are key-value pairs. If we observe in detail each column family having multiple numbers of columns.The column values stored into disk memory. Each cell of the table has its own Metadata like timestamp and other information. Coming to HBase the following are the key terms representing table schema schema •Table: Collection of rows present. •Row: Collection of column families. •Column Family: Collection of columns. •Column: Collection of key-value pairs. •Namespace: Logical grouping of tables. •Cell: A {row, column, version} tuple exactly specifies a cell definition in HBase.
  • 39. Row v/s Column oriented Database:
  • 40. How are Requests Handled in HBase architecture?
  • 41. 3 mechanisms are followed to handle the requests in Hbase Architecture: 1. Commence the Search in HBase Architecture 2. Write Mechanism in HBase Architecture 3. Read Mechanism in HBase Architecture
  • 42. 1. Commence the Search in HBase Architecture The steps to initialize the search are: 1.The user retrieves the Meta table from ZooKeeper and then requests for the location of the relevant Region Server. 2.Then the user will request the exact data from the Region Server with the help of RowKey.
  • 43. 2. Write Mechanism in HBase Architecture
  • 44. The write mechanism goes through the following process sequentially (refer to the above image): Step 1: Whenever the client has a write request, the client writes the WAL (Write Ahead Log). •The edits are then appended at the end of the WAL file. •This WAL file is maintained in every Region Server and Region to recover data which is not committed to the disk. Step 2: Once data is written to the WAL, then it is copied to the Step 3: Once the data is placed in MemStore, then the client receives acknowledgment. Step 4: When the MemStore reaches the threshold, it dumps or commits into a HFile.
  • 45. HBase Write Mechanism- MemStore •The MemStore always updates the data stored in it, in a lexicographical order (sequentially in a dictionary manner) Key-Values. There is one MemStore for each column the updates are stored in a sorted manner for each column •When the MemStore reaches the threshold, it dumps all a new HFile in a sorted manner. This HFile is stored in contains multiple HFiles for each Column Family. •Over time, the number of HFile grows as MemStore dumps •MemStore also saves the last written sequence number, so Server and MemStore both knows, that what is committed where to start from. When region starts up, the last is read, and from that number, new edits start.
  • 46. HBase Architecture: HBase Write Mechanism- HFile •The writes are placed sequentially on the disk. movement of the disk’s read-write head is very less. write and search mechanism very fast. •The HFile indexes are loaded in memory whenever an opened. This helps in finding a record in a single •The trailer is a pointer which points to the HFile’s is written at the end of the committed file. It contains information about timestamp and bloom filters. •Bloom Filter helps in searching key value pairs, it which does not contain the required rowkey. helps in searching a version of the file, it helps in data.
  • 47. 3. Read Mechanism in HBase Architecture To read any data, the user will first have to access the relevant Region Server. Once the Region Server is known, the other process includes: 1.The first scan is made at the read cache, which is the Block cache. 2.The next scan location is MemStore, which is the write cache. 3.If the data is not found in block cache or MemStore, the scanner will retrieve the data from HFile.