Gen AI in Business - Global Trends Report 2024.pdf
Hbase
1. Prepared by,
Vetri.V
WHAT IS HBASE?
HBase is a database: the Hadoop database.It is indexed by rowkey, column key, and
timestamp.
HBase stores structured and semistructured data naturally so you can load it with
tweets and parsed log files and a catalog of all your products right along with their
customer reviews.
It can store unstructured data too, as long as it’s not too large
HBase is designed to run on a cluster of computers instead of a single computer.The
cluster can be built using commodity hardware; HBase scales horizontally as you
add more machines to the cluster.
Each node in the cluster provides a bit of storage, a bit of cache, and a bit of
computation as well. This makes HBase incredibly flexible and forgiving. No node is
unique, so if one of those machines breaks down, you simply replace it with another.
This adds up to a powerful, scalable approach to data that,until now, hasn’t been
commonly available to mere mortals.
HBASE DATA MODEL:
Hbase Data model - These six concepts form the foundation of HBase.
Table:
HBase organizes data into tables. Table names are Strings and composed of
characters that are safe for use in a file system path.
Row :
Within a table, data is stored according to its row. Rows are identified uniquely by
their rowkey. Rowkeys don’t have a data type and are always treated as a
byte[].
Column family:
Data within a row is grouped by column family. Column families also impact the
physical arrangement of data stored in HBase.
For this reason, they must be defined up front and aren’t easily modified. Every row
in a table has the same column families, although a row need not store data in all its
families. Column family names are Strings and composed of characters that are safe
for use in a file system path.
Column qualifier:
Data within a column family is addressed via its column qualifier,or column. Column
qualifiers need not be specified in advance. Column qualifiers
need not be consistent between rows.
Like rowkeys, column qualifiers don’t have a data type and are always treated as a
byte[].
2. Prepared by,
Vetri.V
Cell:
A combination of rowkey, column family, and column qualifier uniquely identifies a
cell. The data stored in a cell is referred to as that cell’s value. Values
also don’t have a data type and are always treated as a byte[].
Version:
Values within a cell are versioned. Versions are identified by their timestamp,a long.
When a version isn’t specified, the current timestamp is used as the
basis for the operation. The number of cell value versions retained by HBase is
configured via the column family. The default number of cell versions is three.
Hbase Architecture
HBase Tables and Regions
Table is made up of any number of regions.
Region is specified by its startKey and endKey.
Empty table: (Table, NULL, NULL)
Two-region table: (Table, NULL, “com.ABC.www”) and (Table, “com.ABC.www”,
NULL)
Each region may live on a different node and is made up of several HDFS files and blocks,
each of which is replicated by Hadoop
HBase Tables:-
Tables are sorted by Row in lexicographical order
Table schema only defines its column families
Each family consists of any number of columns
Each column consists of any number of versions
Columns only exist when inserted, NULLs are free
Columns within a family are sorted and stored together
Everything except table names are byte[]
3. Prepared by,
Vetri.V
Hbase Table format (Row, Family:Column, Timestamp) -> Value
HBase uses HDFS as its reliable storage layer.It Handles checksums, replication, failover
Hbase consists of,
Java API, Gateway for REST, Thrift, Avro
Master manages cluster
RegionServer manage data
ZooKeeper is used the “neural network” and coordinates cluster
Data is stored in memory and flushed to disk on regular intervals or based on size
Small flushes are merged in the background to keep number of files small
Reads read memory stores first and then disk based files second
Deletes are handled with “tombstone” markers
MemStores:-
After data is written to the WAL the RegionServer saves KeyValues in memory store
Flush to disk based on size, is hbase.hregion.memstore.flush.size
Default size is 64MB
Uses snapshot mechanism to write flush to disk while still serving from it and
accepting new data at the same time
Compactions:-
Two types: Minor and Major Compactions
Minor Compactions
Combine last “few” flushes
Triggered by number of storage files
Major Compactions
Rewrite all storage files
Drop deleted data and those values exceeding TTL and/or number of versions
Key Cardinality:-
The best performance is gained from using row keys
4. Prepared by,
Vetri.V
Time range bound reads can skip store files
So can Bloom Filters
Selecting column families reduces the amount of data to be scanned
Fold, Store, and Shift:-
All values are stored with the full coordinates,including: Row Key, Column Family, Column
Qualifier, and Timestamp
Folds columns into “row per column”
NULLs are cost free as nothing is stored
Versions are multiple “rows” in folded table
DDI:-
Stands for Denormalization, Duplication and Intelligent Keys
Block Cache
Region Splits
Hbase shell and Commands
Hbase Install
$ mkdir hbase-install
$ cd hbase-install
$ wget http://apache.claz.org/hbase/hbase-0.92.1/hbase-0.92.1.tar.gz
5. Prepared by,
Vetri.V
$ tar xvfz hbase-0.92.1.tar.gz
$HBASE_HOME/bin/start-hbase.sh
configuration changes in Hbase
Go to hbase-env.sh
Edit JAVA_HOME
Next go to hdfs-site.xml and edit the following:
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://eattributes:54310/hbase</value>
<description>The directory shared by region servers.
Should be fully-qualified to include the filesystem to use.
E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR
</description>
</property>
<!--
<property>
<name>hbase.master</name>
<value>master:60000</value>
<description>The host and port that the HBase master runs at.
</description>
</property>
-->
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
<description>The host and port that the HBase master runs at.
</description>
</property>
</configuration>
Starting hbase shell:
$ hbase shell
hbase(main):001:0> list
TABLE
6. Prepared by,
Vetri.V
0 row(s) in 0.5710 seconds
General HBase shell commands:
Show cluster status. Can be ‘summary’, ‘simple’, or ‘detailed’. The
default is ‘summary’.
hbase> status
hbase> status ‘simple’
hbase> status ‘summary’
hbase> status ‘detailed’
hbase> version
hbase>whoami
Tables Management commands:
Create a table
hbase(main):002:0> create 'mytable', 'cf'
hbase(main):003:0> list
TABLE
mytable
1 row(s) in 0.0080 seconds
WRITING DATA
hbase(main):004:0> put 'mytable', 'first', 'cf:message', 'hello HBase'
READING DATA
hbase(main):007:0> get 'mytable', 'first'
hbase(main):008:0> scan 'mytable'
describe table
hbase(main):003:0> describe 'users'
DESCRIPTION ENABLED
{NAME => 'users', FAMILIES => [{NAME => 'info', true ,BLOOMFILTER => 'NONE',
REPLICATION_SCOPE => '0 , COMPRESSION => 'NONE', VERSIONS => '3', TTL=>
'2147483647',
BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
1 row(s) in 0.0330 seconds
Disable:
hbase> disable ‘users’
7. Prepared by,
Vetri.V
Disable_all:
Disable all of tables matching the given regex
hbase>disable_all ‘users.*’
Is_Disabled:
verifies Is named table disabled
hbase>is_disabled ‘users.*’
Drop:
Drop the named table. Table must first be disabled
hbase> drop ‘users’
drop_all:
Drop all of the tables matching the given regex
hbase>drop_all ‘users.*’
Enable:
hbase> enable ‘users’
enable_all:
hbase>enable_all ‘users.*’
is_enabled:
hbase>is_enabled ‘users.*’
exists:
hbase> exists ‘users.*’
list:
hbase> list
hbase> list ‘abc.*’
show_filters:
Show all the filters in hbase.
8. Prepared by,
Vetri.V
Count:
Count the number of rows in a table. Return value is the number of rows.
This operation may take a LONG time (Run ‘$HADOOP_HOME/bin/hadoop jar
hbase.jar rowcount’ to run a counting mapreduce job).
Current count is shown every 1000 rows by default. Count interval may be
optionally specified. Scan caching is enabled on count scans by default. Default cache
size is 10 rows.
If your rows are small in size, you may want to increase this
parameter. Examples:hbase> count ‘users.*’
hbase> count ‘users.*’, INTERVAL => 100000
hbase> count ‘users.*’, CACHE => 1000
hbase> count ‘users.*’, INTERVAL => 10, CACHE => 1000
Put:
hbase> put ‘users, ‘r1, ‘c1’, ‘value’, ts1
Configurable block size
hbase(main):002:0> create 'mytable',{NAME => 'colfam1', BLOCKSIZE => '65536'}
Block cache:
Workloads don’t benefit from putting data into a read cache—for instance, if a
certain table or column family in a table is only accessed for sequential scans or
isn’t accessed a lot and you don’t care if Gets or Scans take a little longer.
By default, the block cache is enabled. You can disable it at the time of table
creationor by altering the table:
hbase(main):002:0> create 'mytable',{NAME => 'colfam1', BLOCKCACHE =>
'false’}
Aggressive caching:
You can choose some column families to have a higher priority in the block
cache (LRU cache).
This comes in handy if you expect more random reads on one column
family compared to another. This configuration is also done at table-
instantiation time:
hbase(main):002:0> create 'mytable',
{NAME => 'colfam1', IN_MEMORY => 'true'}
The default value for the IN_MEMORY parameter is false.
9. Prepared by,
Vetri.V
Bloom filters:
hbase(main):007:0> create 'mytable',{NAME => 'colfam1', BLOOMFILTER =>
'ROWCOL'}
The default value for the BLOOMFILTER parameter is NONE.
A row-level bloom filter is enabled with ROW, and a qualifier-level bloom filter is
enabled with ROWCOL.
The rowlevel bloom filter checks for the non-existence of the particular rowkey in
the block,and the qualifier-level bloom filter checks for the non-existence of the row
and column qualifier combination.
The overhead of the ROWCOL bloom filter is higher than that of the ROW bloom
filter.
TTL (Time To Live):
can set the TTL while creating the table like this:
hbase(main):002:0> create 'mytable', {NAME => 'colfam1', TTL => '18000'}
This command sets the TTL on the column family colfam1 as 18,000 seconds = 5
hours. Data in colfam1 that is older than 5 hours is deleted during the next major
compaction.
Compression
Can enable compression on a column family when creating tables like this:
hbase(main):002:0> create 'mytable',
{NAME => 'colfam1', COMPRESSION => 'SNAPPY'}
Note that data is compressed only on disk. It’s kept uncompressed in memory
(Mem-Store or block cache) or while transferring over the network.
Cell versioning:
Versions are also configurable at a column family level and can be specified at
the time of table instantiation:
hbase(main):002:0> create 'mytable', {NAME => 'colfam1', VERSIONS => 1}
hbase(main):002:0> create 'mytable',
{NAME => 'colfam1', VERSIONS => 1, TTL => '18000'}
hbase(main):002:0> create 'mytable', {NAME => 'colfam1', VERSIONS => 5,
MIN_VERSIONS => '1'}
Description of a table: