Big data hbase

Agenda
• Introduction
• Hbase vs RDBMS
• Hbase vs HDFS
• Hbase Architecture
• Hbase with Hive
• Hbase with Java
• Hbase with Mapreduce

Introduction to HBase
HBase is a Nosql, non-relational, distributed column-oriented database on top of
Hadoop.
NoSQL - NoSQL database are databases that doesn't use SQL engine as query engine.
Hbase Daemons
Daemons are services that run on individual machines and communicate with each other
HMaster — Master server of HBase, contains all meta data.
HRegionserver — Slave server of Hbase, contains the actual data.
HQuorumpeer — Zookeeper daemons for co-ordination service.
Advantages of using HBase
Provides a highly scalable database with nativity with hadoop.
Nodes can be added on the fly.

HBase vs RDBMS
Relational Database
•Is Based on a Fixed Schema
• Is a Row-oriented datastore
•Is designed to store Normalized Data
•Contains thin tables
•Has no built-in support for partitioning.
HBase
•Is Schema-less
•Is a Column-oriented datastore
•Is designed to store Denormalized Data
•Contains wide and sparsely populated tables
•Supports Automatic Partitioning

HBase vs HDFS
HDFS
•Is suited for High Latency operations batch processing
•Data is primarily accessed through MapReduce
•Is designed for batch processing and hence doesn’t have a concept of random
reads/writes
HBase
•Is built for Low Latency operations
•Provides access to single rows from billions of records
•Data is accessed through shell commands, Client APIs in Java, REST, Avro or
Thrift

RDBMS(B+ Tree)
•RDBMS adopts B+ tree to organize its indexes, as shown in figure.
• These B+ trees are often 3-level n-way balance trees. The nodes of a B+ tree are
blocks on disk. So for a update by RDBMS, it likely needs 5 times disk operation.
(3 times for B+ tree to find the block of the target row, 1 time for target block
read, and 1 time for data update).
•On RDBMS, data is written randomly as heap file on disk, but random data
block decrease read performance.
That’s why we need B+ tree index. B+ tree is fit well for data read, but is not
efficient for data updates. Given the large distributed data, B+ tree is not the
competitor for LSM-trees so far(used in Hbase)

HBase ( LSM Tree)
LSM-trees can be viewed as n-level merge-trees. It transforms random writes into
sequential writes using logfile and in-memory store.
Data Write(Insert, update): Data is written to logfile sequentially first, then to in-
memory store, where data is organized as sorted tree, like B+ tree. When the in-
memory store is filled up, the tree in the memory will be flushed to a store file on
disk. The store files on disk is arranged like B+ tree . But store files are optimized
for sequential disk access.
Data Read: In-memory store is searched first. Then search the store files on disk.
Data Delete: Give a data record a “delete marker”, system background will do
housekeeping work by merging some store files into a larger one to reduce disk
seeks. A data record will be deleted permanently during the housekeeping.
LSM-trees’ data updates are operated in memory, no disk access, it’s faster than B+
tree. When the data read is always on the data set that is written recently, LSM-trees
will reduce disk seeks, and improve performance. When disk IO is the cost we must
consider, LSM-trees is more suitable than B+ tree.

Normalization vs Denormalization

HBase Data Model
Tables – The HBase Tables are more like logical collection of rows stored in separate
partitions called Regions.
Rows – A row is one instance of data in a table and is identified by a rowkey. Rowkeys are
unique in a Table and are always treated as a byte[].
Column Families – Data in a row are grouped together as Column Families. Each Column
Family has one more Columns and these Columns in a family are stored together in a low
level storage file known as Hfile
The table above shows Customer and Sales Column Families. The Customer Column
Family is made up 2 columns – Name and City, whereas the Sales Column Families is made
up to 2 columns – Product and Amount.

HBase Data Model
Columns – A Column Family is made of one or more columns. A Column is identified by a
Column Qualifier that consists of the Column Family name concatenated with the Column
name using a colon – example: columnfamily:columnname. There can be multiple Columns
within a Column Family and Rows within a table can have varied number of Columns.
Cell – A Cell stores data and is essentially a unique combination of rowkey, Column Family
and the Column (Column Qualifier). The data stored in a Cell is called its value and the data
type is always treated as byte[].
Version – The data stored in a cell is versioned and versions of data are identified by the
timestamp. The number of versions of data retained in a column family is configurable and
this value by default is 3.

HBase Physical Architecture
.
HMaster is the master in such style, which is responsible for RegionServer
monitor, region assignment, metadata operations, RegionServer Failover etc. In a
distributed cluster, HMaster runs on HDFS NameNode.
RegionServer is the slave, which is responsible for serving and managing regions.
In a distributed cluster, it runs on HDFS DataNode.
Zookeeper will track the status of Region Server, where the root table is hosted.
Since HBase 0.90.x, it introduces an even more tighter integration with Zookeeper.
The heartbeat report from Region Server to HMaster is moved to Zookeeper, that is
zookeeper has the responsibility of tracking Region Server status. Moreover,
Zookeeper is the entry point of client, which enable query Zookeeper about the
location of the region hosting the –ROOT- table.

Region Server Architecture
.
It contains several components as follows:
1.One Block Cache, which is a LRU priority cache for data reading.
2. One WAL(Write Ahead Log): HBase use Log-Structured-Merge-Tree(LSM tree) to
process data writing. Each data update or delete will be write to WAL first, and then
write to MemStore. WAL is persisted on HDFS.
3. Multiple HRegions: each HRegion is a partition of table as we talk about in 3.3.1.
4. In a HRegion: Multiple HStore: Each HStore is correspond to a Column Family
5. In a HStore: One MemStore: store updates or deletes before flush to disk. Multiple
StoreFile, each of which is correspond to a HFile
6. A HFile is immutable, flushed from MemStore, persisted on HDFS

-ROOT- and .META table
.
There are two special catalog tables, -ROOT- and .META. table for this.
1.META. table: host the region location info for a specific row key range. The table is
stored on Region Servers, which can be split into as many region as required.
2.ROOT- table: host the .META. table info. There is only one Region Server store the
–ROOT- table. And the Root region never split into more than one region.
The RegionServer RS1 host the –ROOT- table, the .META. table is split into 3
regions: M1, M2, M3, hosted on RS2, RS3, RS1. Table T1 contains three regions, T2
contains four regions. For example, T1R1 is hosted on RS3, the meta info is hosted on
M1.

Region Lookup
.
1. Client query Zookeeper: where is the –ROOT-? On RS1.
2. Client request RS1: Which meta region contains row: T10006? META1 on
RS2
3. Client request RS2: Which region can find the row T10006? Region on RS3
4. Client get the from the region on RS3
5. Client cache the region info, and is refreshed until the region location info
changed.

HBase Write Path
.
The client doesn’t write data directly into HFile on HDFS. Firstly it writes data to
WAL(Write Ahead Log), and Secondly, writes to MemStore shared by a HStore in
memory.
MemStore is a write buffer(64MB by default). When the data in MemStore
accumulates its threshold, data will be flush to a new HFile on HDFS persistently.
Each Column Family can have many HFiles, but each HFile only belongs to one
Column Family.
WAL is for data reliability, WAL is persistent on HDFS and each Region Server has
only on WAL. When the Region Server is down before MemStore flush, HBase can
replay WAL to restore data on a new Region Server.
A data write completes successfully only after the data is written to WAL and
MemStore.

HBase Read Path
.
1. Client will query the MemStore in memory, if it has the target row.
2. When MemStore query failed, client will hit the BlockCache.
3. After the MemStore and BlockCache query failed, HBase will load HFiles into
memory which may contain the target row info.
4. The MemStore and BlockCache is the mechanism for real time data access for
distributed large data.
BlockCache is a LRU(Lease Recently Used) priority cache. Each RegionServer
has a single BlockCache. It keeps frequently accessed data from HFile in memory to
reduce disk data reads. The “Block”(64KB by default) is the smallest index unit of
data or the smallest unit of data that can be read from disk by one pass.
For random data access, small block size is preferred, but block index consumes
more memory. And for sequential data access, large block size is better, fewer index
save more memory.

Deep Dive In Hbase Architecture
.

HFILE

.
The HFile implements the same features as SSTable, but may provide more or less
1. File Format
a.Data Block Size
The size of each data block is 64KB by default, and is configurable in Hfile.
b. Maximum Key Length
The key of each key/value pair is currently up to 64KB in size.
10-100 bytes is a typical size
Even in the data model of HBase, the key (rowkey+column
family:qualifier+timestamp) should not be too long.
c. Compression Algorithm
HFile supports following three algorithms:
(1)NONE
(2)GZ
(3)LZO(Lempel-Ziv-Oberhumer)

HFILE

.
HFile is separated into multiple segments, from beginning to end, they are:
- Data Block segment
To store key/value pairs, may be compressed.
- Meta Block segment (Optional)
To store user defined large metadata, may be compressed.
- File Info segment
It is a small metadata of the HFile, without compression. User can add user defined
small metadata (name/value) here.
- Data Block Index segment
Indexes the data block offset in the HFile. The key of each index is the key of first
key/value pair in the block.
- Meta Block Index segment (Optional)
Indexes the meta block offset in the HFile. The key of each index is the user defined
unique name of the meta block.
- Trailer
The fix sized metadata. To hold the offset of each segment, etc. To read an HFile, we
should always read the Trailer firstly.

HFILE Compaction

.
Minor Compaction
It happens on multiple HFiles in one HStore.
Minor compaction will pick up a couple of adjacent small HFiles and rewrite them into
a larger one.
The process will keep the deleted or expired cells.
The HFile selection standard is configurable.
Since minor compaction will affect HBase performace, there is an upper limit on the
number of HFiles involved (10 by default).
Major Compaction
Major Compaction compact all HFiles in a HStore(Column Family) into one HFile.
It is the only chance to delete records permanently.
Major Compaction will usually have to be triggered manually for large clusters.
Major Compaction is not region merge, it happens to HStore which will not result in
region merge.

HBase Delete

.
When HBase client send delete request, the record will be marked “tombstone”,
It is a “predicate deletion”, which is supported by LSM-tree.
Since HFile is immutable, deletion isn’t available for HFile on HDFS.
Therefore, HBase adopts major compaction to clean up deleted or expired records.

Starting HBase daemons and shell
Execute the command: start-hbase.sh
This command starts the hbase daemons.
Execute the command: hbase shell
This starts the command line interface of Hbase

Creating tables in HBase
To create a table in HBase, do the following:
• Specify the table name and column families.
Note: HBase has a dynamic schema. Thus while creating table we mention just the table name and
the column families. At least on column family must be mentioned during the creation of table.
• Execute the command: create 'table_name','column_family1'...'column_familyN’

Inserting rows
To insert rows in HBase, do the following:
•Specify the table_name.row key.column with the value to be inserted
Note: Hbase stores data in key and values.
•Execute the command: create 'table_name','row_key','columnFamily:column','value'

Scanning tables
To perform a full scan on HBase, do the following:
•Specify scan ‘table_name’ in the Hbase prompt
HBase displays row key, time stamp and its corresponding values.
•Execute the command: scan 'table_name'

Fetching a single row
To fetch a single row in HBase, do the following:
•Specify ‘get table_name.row_key’ in the HBase prompt
Hbase displays row key, time stamp and its corresponding values.
•Execute the command: get 'table_name','row_key'

Listing all tables
To list all the tables in HBase, do the following:
• All the tables present in Hbase is listed by specifying the command ‘list’.
• Execute the command: list

Describe
To see the meta data associated with a table in HBase, do the following:
• Complete meta data of a table can be seen by specifying the table name.
• Execute the command: describe 'table_name'

HBase with Hive

.
1 . Create HBase table
create 'hivehbase', 'ratings'
put 'hivehbase', 'row1', 'ratings:userid', 'user1'
put 'hivehbase', 'row1', 'ratings:bookid', 'book1'
put 'hivehbase', 'row1', 'ratings:rating', '1'




HBase with Hive

.
2. Create Hive external table
CREATE EXTERNAL TABLE hbasehive_table
(key string, userid string,bookid string,rating int)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES
("hbase.columns.mapping" = ":key,ratings:userid,ratings:bookid,ratings:rating")
TBLPROPERTIES ("hbase.table.name" = "hivehbase");
3. Querying HBase via Hive
select * from hbasehive_table;
OK
row1    user1   book1   1
row2    user2   book1   3
row3    user2   book2   3
row4    user2   book4   1

HBase Bulk Load Using PIG

.
DATASET
Custno, firstname, lastname, age, profession
4000001,Kristina,Chung,55,Pilot
4000002,Paige,Chen,74,Teacher
4000003,Sherri,Melton,34,Firefighter
4000004,Gretchen,Hill,66,Computer hardware engineer
4000005,Karen,Puckett,74,Lawyer
4000006,Patrick,Song,42,Veterinarian
4000007,Elsie,Hamilton,43,Pilot
4000008,Hazel,Bender,63,Carpenter 4000009,Malcolm,Wagner,39,Artist

HBase Bulk Load Using PIG

.
# Create a table ‘customers’ with column family ‘customers_data’
hbase(main):001:0> create 'customers', 'customers_data’
Write the following PIG script to load data into the ‘customers’ table in Hbase
raw_data = LOAD '/customers' USING PigStorage(',') AS ( custno:chararray,
firstname:chararray, lastname:chararray, age:int, profession:chararray );
STORE raw_data INTO 'hbase://customers' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage( 'customers_data:firstname
customers_data:lastname customers_data:age customers_data:profession' );

HBase Bulk Load Using ImportTSV

.
In HBase-speak, bulk loading is the process of preparing and loading HFILES directly
into the RegionServers, thus bypassing the write path . It includes 3 steps :
1.Extract the data from a source, typically text files or another database
2. Transform the data into HFiles
3. Load the files into HBase by telling the RegionServers where to find them.

HBase Bulk Load Using ImportTSV

.
STEP :1 First load data into HDFS.
Hadoop fs –mkdir /user/training/data_set
Hadoop fs -put data_set /user/training/data
STEP :2 Create Hbase table .
create 'FlappyTwit', {NAME => 'f'}, {SPLITS => ['g', 'm', 'r', 'w
STEP :3 Convert plain files to HFILE.
hbaseorg.apache.hadoop.hbase.mapreduce.ImportTsv
-Dimporttsv.bulk.output=/user/training/output
-Dimporttsv.columns=HBASE_ROW_KEY,f:username,f:followers,f:count,f:tweet1,f:t
weet2,f:tweet3,f:tweet4,f:tweet5 FlappyTwit /user/training/FlappyTwit/FlappyTwit-
Small.txt
STEP :4 Load HFILE into Hbase
hbaseorg.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /user/training/output
FlappyTwit

HBase with Java

.
DATASET
1,India,Haryana,Chandigarh,2009,April,P1,1,5
2,India,Haryana,Ambala,2009,May,P1,2,10
3,India,Haryana,Panipat,2010,June,P2,3,15
4,United States,California,Fresno,2009,April,P2,2,5
5,United States,California,Long Beach,2010,July,P2,4,10
6,United States,California,San Francisco,2011,August,P1,6,20
USECASE
Following column families have to be created “sample,region,time.product,sale,profit”
Column family region has three column qualifiers : country, state, city
Column family Time has two column qualifiers : year, month

HBase with MapReduce

.
USECASE
Hbase has records of web_access_logs. We record each web page access by a user.
To keep things simple, we are only logging the user_id and the page they visit.
The schema looks like this:
userID_timestamp =>
{
details => {
page:
}
}
To make row-key unique, we have in a timestamp at the end making up a
composite key


.
SAMPLE DATA
ROW PAGES
USER1_T1 a.Html
USER2_T2 b.Html
USER3_T3 c.html
OUTPUT:we want to count how many times we have seen each user
USER COUNT
USER1 3
USER2 2
USER3 1


.
create 'access_logs', 'details'
create 'summary_user', {NAME=>'details', VERSIONS=>1}
MAPPER
INPUT OUTPUT
ImmutableBytesWritable(R
owKey = userID +
timestamp)
ImmutableBytesWritable(u
serID)
Result(Row Result) IntWritable(always ONE)
REDUCER
INPUT OUTPUT
esrID)
serID : same as input)
Iterable<IntWriable>(all
ONEs combined for this
key)
IntWritable(total of all
ONEs for this key)

Conclusion
• Provides near-real time access to HDFS
• Provides a transaction-like data store/database on top of HDFS
• Provides a highly scalable database

Big data hbase

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Big data hbase

Similaire à Big data hbase (20)

Dernier

Dernier (20)

Big data hbase

Notes de l'éditeur