2. RDBMS Scaling
• Cannot scale for large distributed data sets
• Vendors Offers replication and partition solutions to
grow the database beyond the confines of single node,
but generally complicated to install and maintain
• Such techniques compromise RDBMS features such as
– Joins, Complex queries, Views, Triggers and foreign key
constraints
– These queries becomes expensive
3. Why BigTable?
• Performance of RDBMS system is good for transaction
processing but for very large scale analytic processing, the
solutions are expensive, and specialized.
• Very large scale analytic processing
– Big queries – typically range or table scans.
– Big databases (100s of TB)
• Map reduce on Bigtable with optionally Cascading on top to
support some relational algebras may be a cost effective
solution.
• Sharding (Shared nothing horizontal partitioning) is not a
solution to scale open source RDBMS platforms
• Application specific
• Labor intensive (re)partitionaing
4. Key concept
HBase is a distributed column-oriented database built
on top of HDFS.
• At its core, HBase / BigTable is a map.
• It is a persistent storage.
• HBase and BigTable are built upon distributed file-systems.
• Unlike most map implementations, in
HBase/BigTable the key/value pairs are kept in strict
alphabetical order.
• Multidimensional map.
• Sparse.
5. Map
• A map is "an abstract data type composed of a
collection of keys and a collection of values, where
each key is associated with one value."
{
"Name" : "Subhas",
"Mail" : "subhas.ghosh@siemens.com",
"Location" : "9F-TA-WS-21",
"Phone" : "+918025113529",
"Sal" : ************
}
In this example "Name" is a key, and "Subhas" is the
corresponding value.
6. Persistent
• Persistence merely means that the data you put
in this special map "persists" after the program
that created or accessed it is finished.
• This is no different in concept than any other
kind of persistent storage such as a file on a file-system.
• Each value can be versioned in HBase
7. Distributed
• Built upon distributed file-systems
– file storage can be spread out among an array of
independent machines.
– HBase sits atop either Hadoop's Distributed File System
(HDFS) or Amazon's Simple Storage Service (S3),
– BigTable makes use of the Google File System (GFS).
• Data is replicated across a number of participating
nodes in an analogous manner to how data is striped
across discs in a RAID system.
8. Sorted
Continuing our example, the sorted version looks like this:
{
"Location" : "9F-TA-WS-21",
"Mail" : "subhas.ghosh@siemens.com",
"Name" : "Subhas",
"Phone" : "+918025113529",
"Sal" : ************
}
Sorting can ensure that items of greatest interest to you are
near each other
9. Multidimensional
A map of maps
{
"Location" :
{
"FL" : "9F",
"TOWER" : "A",
"WS" : "21“
},
"Mail" : "subhas@xyz.com",
"Name" :
{
"FIRST": "Subhas",
"MID" : "Kumar",
"LAST" : "Ghosh“
},
"Phone" : "+918025113529",
"Sal" : ************
}
Each key points to a map
with one or more keys:
"FL", "TOWER", "WS" e.g.
Top-level key/map pair is a
"row".
Also, in BigTable/HBase
nomenclature, the "FL" and
"TOWER" mappings would
be called "Column
Families".
10. Multidimensional
• A table's column families are specified when
the table is created, and are difficult or
impossible to modify later.
• It can also be expensive to add new column
families, so it's a good idea to specify all the
ones you'll need up front.
• Fortunately, a column family may have any
number of columns, denoted by a column
"qualifier" or "label".
11. Multidimensional
…
"aaaaa" : {
"A" : {
"foo" : "y",
"bar" : "d"
},
"B" : {
"" : "w" }
},
"aaaab" : {
"A" : {
"foo" : "world",
"bar" : "domination"
},
"B" : {
"" : "ocean" }
}
},
…
Column family with two
columns: "foo" and
"bar",
When asking HBase/BigTable for
data provide the full column name
in the form "<family>:<qualifier>“,
e.g. "A:foo", "A:bar" and "B:".
"B" column family has just
one column whose qualifier
is the empty string ("").
12. Multidimensional
• Labeled tables of rows X columns X timestamp
– Cells addressed by row/column/timestamp
– As (perverse) java declaration:
SortedMap<byte [], SortedMap<byte [],
List<Cell>>>> hbase = new TreeMap<ditto>(new RawByteComparator());
• Row keys uninterpreted byte arrays: E.g. an URL
– Rows are ordered by Comparator (Default: byte-order)
– Row updates are atomic; even if hundreds of columns
• Columns grouped into column-families
– Columns have column-family prefix and then qualifier
• E.g. webpage:mimetype, webpage:language
– Column-family 'printable', qualifier arbitrary bytes
– Column-families in table schema but not qualifiers
13. Multidimensional
• Cell is uninterpreted byte array and a
timestamp
– E.g. webpage content
• Tables partitioned into Regions
– Region defined by start & end row
– Regions are the 'atoms' of distribution
deployed around the cluster.
– start < end - in lexicographic sense
16. What HBase Is Not
• Tables have one primary index, the row key.
• No join operators.
• Scans and queries can select a subset of available columns,
perhaps by using a wildcard.
• There are three types of lookups:
– Fast lookup using row key and optional timestamp.
– Full table scan
– Range scan from region start to end.
• Limited atomicity and transaction support.
– HBase supports multiple batched mutations of single rows only.
– Data is unstructured and untyped.
• Not accessed or manipulated via SQL.
– Programmatic access via Java, REST, or Thrift APIs.
– Scripting via JRuby.
– No JOIN, No sophisticated query engine, No column typing, no
ODBC/JDBC, No Crystal Reports, No transactions, No secondary indices
17. Map-Reduce With HBase
• When we use a map-reduce framework with HBase
table, a map function is executed for each region
independently in parallel.
• Within each map query is answered by scanning the
rows in a ordered manner starting with low ordered
key to higher ordered key.
• Optionally, certain rows and columns (column families)
can be filtered out for better performance.
19. Elements
– Table : a list of tuples sorted by row key ascending, column
name ascending and timestamp descending.
– Regions: A Table is broken up into row ranges called regions.
Each row range contains rows from start-key to end-key. (A set
of regions, sorted appropriately, forms an entire table.)
– HStore: Each column family in a region is managed by an
HStore.
– HFile: Each HStore may have one or more HFile (a Hadoop
HDFS file type).
20. Components
• Master
o Responsible for monitoring region servers
o Load balancing for regions
o Redirect client to correct region servers
o The current SPOF (single point of failure)
• Regionserver slaves
o Serving requests(Write/Read/Scan) of Client
o Send HeartBeat to Master
o Throughput and Region numbers are scalable by
region servers
21. Components
• ZooKeeper
– centralized service for maintaining
• configuration information,
• naming,
• providing distributed synchronization, and
• providing group services.
– ZooKeeper allows distributed processes to coordinate with each other
through a shared hierarchal namespace
• organized similarly to a standard file system.
• The name space consists of data registers - called znodes
• in ZooKeeper parlance - and these are similar to files and directories.
• Unlike a typical file system, which is designed for storage, ZooKeeper
data is kept in-memory, which means ZooKeeper can acheive high
throughput and low latency numbers.
23. Distributed Coordination
• The replicated database is in-memory.
• Updates are logged to disk for recoverability.
• Writes are serialized to disk before they are applied to the in-memory
database.
• Clients connect to exactly one server to submit requests.
• Read requests are serviced from the local replica of each server database.
• Requests that change the state of the service, write requests, are
processed by an agreement protocol.
24. Distributed Coordination
• As part of the agreement protocol all write requests from
clients are forwarded to a single server, called the leader.
• The rest of the ZooKeeper servers, called followers, receive
message proposals from the leader and agree upon message
delivery.
• The messaging layer takes care of replacing leaders on failures
and syncing followers with leaders.
• ZooKeeper uses a custom atomic messaging protocol.
– ZooKeeper can guarantee that the local replicas never diverge.
– When the leader receives a write request, it calculates what the state of
the system is when the write is to be applied and transforms this into a
transaction that captures this new state.
26. The general protocol flow
1. Client contacts the Zookeeper to find where it shall put the data.
2. For this purpose, HBase maintains two catalog tables, namely, -ROOT-, and
.META..
3. First HBase finds information from the -ROOT- table about location of
.META. Table.
4. Subsequently about the server location of the assigned region of a table
from the .META. table.
5. Client caches this information and contacts the HRegionServer.
6. Next the HRegionServer creates a HRegion object corresponding to the
opened region.
1. When the HRegion is "opened" it sets up a HStore instance for each
HColumnFamily for every table as defined by the user beforehand.
2. Each of the Store instances have one or more StoreFile instances
3. StoreFile are lightweight wrappers around the actual storage file called HFile.
27. Where is my data?
Zookeeper
.META.
-ROOT- MyRow
MyTable
Row per table region
Row per META region
Client
28. The general protocol flow
7. The client issues a HTable.put(Put) request to the HRegionServer which hands
the details to the matching HRegion instance.
8. The first step is to decide if the data should be first written to the "Write-Ahead-
Log" (WAL) represented by the HLog class. The WAL is a standard Hadoop
SequenceFile and it stores HLogKey's.
9. These keys contain a sequential number as well as the actual data and are used
to replay not yet persisted data after a server crash.
10. Once the data is written (or not) to the WAL it is placed in the MemStore. At the
same time it is checked if the MemStore is full and in that case a flush to disk is
requested.
11. The store files created on disk are immutable. Sometimes the store files are
merged together; this is done by a process called compaction. This buffer-flush-merge
strategy is a common pattern described in Log-Structured Merge-Tree.
12. After a compaction, if a newly written store file size is greater than the size
specified in hbase.hregion.max.filesize (default 256 MB), the region is split into
two new regions.
Flush Flush Flush Compact Flush Flush Compact Flush Flush Flush Compact
29. Log Structured Merge Trees
• Random IO for writes is bad in HDFS.
• LSM Trees convert random writes to sequential writes.
• Writes go to a commit log and in-memory storage
(MemStore)
• The MemStore is occasionally flushed to disk
(StoreFile)
• The disk stores are periodically compacted to HFile (on
HDFS)
• Use Bloom Filters with merge.
30. Buffer-Flush-Compact (minor)
Region
Memstore
HLog
(Append only WAL on
HDFS)
(Sequence file)
(One per region)
HFile on
HDFS
Compact
HFile on
HDFS
StoreFile
HFile on
HDFS
Buffer
Read
Flush
HFile: immutable sorted map (byte[] byte[])
(row, column, timestamp cell value)
31. Compaction
• Major compaction:
– The most important difference between minor and major compactions is
that major compactions processes delete markers, max versions, etc,
while minor compactions don't.
– This is because delete markers might also affect data in the non-merged
files, so it is only possible to do this when merging all files.
• When a delete is performed in HBase table, nothing gets
deleted immediately, rather a delete marker (a.k.a. tombstone)
is written.
– This is because HBase does not modify files once they are written.
– The deletes are processed during the major compaction process; at
which point the data they hide and the delete marker itself will not be
present in the merged file.
33. Java Example
HBaseConfiguration config = new HBaseConfiguration();
HTable table = new HTable(config, "myTable");
Cell cell = table.get("myRow",
"myColumnFamily:columnQualifier1");
34. Java Example: A Table Mapper
Scan scan = new Scan(); scan.addColumns(COLUMN_FAMILIY_NAME);
//add some more filters to acan here as scan.setFilter(...);
TableMapReduceUtil.initTableMapperJob(TABLE_NAME, scan, Mapper.class,
ImmutableBytesWritable.class, IntWritable.class, job);
TableMapper<ImmutableBytesWritable, IntWritable>
{
@Override
public void map(ImmutableBytesWritable row, Result values, Context context) throws
IOException
{
ImmutableBytesWritable userKey = new ImmutableBytesWritable(row.get());
for (KeyValue value: values.list())
{
ByteBuffer b = ByteBuffer.wrap(value.getValue());
String column = Bytes.toString(value.getColumn());
//compute something and put in the int res
try { context.write(userKey, res); }
catch (InterruptedException e) { throw new IOException(e); }
}
}
}
KeyValue in the HFile is a low-level byte array that allows for "zero-copy" access to the data,
even with lazy or custom parsing if necessary.
37. InputFormat
• InputFormat class is responsible for the actual splitting of the input data as
well as returning a RecordReader instance that defines the classes of
the key and value objects as well as providing a next() method that is used to
iterate over each input record.
• In HBase implementation is called TableInputFormatBase as well as its
subclass TableInputFormat.
• TableInputFormat is a light-weight concrete version.
• You can provide the name of the table to scan and the columns you want to
process during the Map phase.
• It splits the table into proper pieces for you and hands them over to the
subsequent classes.
38. Mapper
• The Mapper class(es) are for the next stage of the MapReduce.
• In this step each record read using the RecordReader is processed using
the map() method.
• A TableMap class that is specific to iterating over a HBase table.
• Once specific implementation is the IdentityTableMap which is also a good
example on how to add your own functionality to the supplied classes.
• The TableMap class itself does not implement anything but only adds the
signatures of what the actual key/value pair classes are.
• The IdentityTableMap is simply passing on the records to the next stage of
the processing.
39. Reducer
• The Reduce stage and class layout is very similar to the Mapper
one explained above.
• This time we get the output of a Mapper class and process it
after the data was shuffled and sorted.
40. OutputFormat
• The final stage is the OutputFormat class and its job to persist the data in
various locations.
• There are specific implementations that allow output to files or to HBase
tables in case of the TableOutputFormat.
• It uses a RecordWriter to write the data into the specific HBase output
table.
• It is important to note the cardinality as well.
• While there are many Mappers handing records to many Reducers, there is
only one OutputFormat that takes each output record from its Reducer
subsequently.
• It is the final class handling the key/value pairs and writes them to their final
destination, this being a file or a table.
• The name of the output table is specified when the job is created.
41. Map-reduce options with HBase
Raw data Table-A Table-B
Raw Data
Map +
Reduce
(Hadoop)
Map only or
Map +
Reduce
Map only or
Map +
Reduce
Table-A
Map only or
Map +
Reduce
Map +
Reduce Map
Table-B
Map only or
Map +
Reduce Map
Map +
Reduce
Output
Input
Reading and writing into same table: hinder the proper distribution of regions
across the servers (open scanners block regions splits) and may or may not see the
new data as you scan. must write in the TableReduce.reduce()
Read from one table and write to another: can write updates directly in the
TableMap.map()
Map stage completely reads a table and then passes the data on in
intermediate files to the Reduce stage.
Reducer reads from DFS and writes into the now idle HBase table
43. Classes
• HBaseAdmin
• HBaseConfiguration
• HTable
• HTableDescriptor
• Put
• Get
• Scanner
• Filters
Database Admin
Table
Family
Column Qualifier
44. Using HBase API
HBaseConfiguration: Adds HBase configuration files to a Configuration
new HBaseConfiguration ( )
new HBaseConfiguration (Configuration c)
<property>
<name> name
</name>
<value> value
</value>
</property>
HBaseAdmin: new HBaseAdmin( HBaseConfiguration conf )
• Ex:
HBaseAdmin admin = new HBaseAdmin(config);
admin.disableTable (“tablename”);
45. Using HBase API
HTableDescriptor: HTableDescriptor contains the name of an HTable, and its
column families.
new HTableDescriptor()
new HTableDescriptor(String name)
• Ex: HTableDescriptor htd = new HTableDescriptor(tablename);
htd.addFamily ( new HColumnDescriptor (“Family”));
HColumnDescriptor: An HColumnDescriptor contains information about a column family
new HColumnDescriptor(String familyname)
• Ex:
HTableDescriptor htd = new HTableDescriptor(tablename);
HColumnDescriptor col = new HColumnDescriptor("content:");
htd.addFamily(col);
46. Using HBase API
HTable: Used for communication with a single HBase table.
new HTable(HBaseConfiguration conf, String tableName)
• Ex:
HTable table = new HTable (conf, Bytes.toBytes ( tablename ));
ResultScanner scanner = table.getScanner ( family );
Put: Used to perform Put operations for a single row.
new Put(byte[] row)
new Put(byte[] row, RowLock rowLock)
• Ex:
HTable table = new HTable (conf, Bytes.toBytes ( tablename ));
Put p = new Put ( brow );
p.add (family, qualifier, value);
table.put ( p );
47. Using HBase API
Get: Used to perform Get operations on a single row.
new Get (byte[] row)
new Get (byte[] row, RowLock rowLock)
• Ex:
HTable table = new HTable(conf, Bytes.toBytes(tablename));
Get g = new Get(Bytes.toBytes(row));
Result: Single row result of a Get or Scan query.
new Result()
• Ex:
HTable table = new HTable(conf, Bytes.toBytes(tablename));
Get g = new Get(Bytes.toBytes(row));
Result rowResult = table.get(g);
Bytes[] ret = rowResult.getValue( (family + ":"+ column ) );
48. Using HBase API
Scanner
• All operations are identical to Get
– Rather than specifying a single row, an optional startRow and stopRow
may be defined.
• If rows are not specified, the Scanner will iterate over all rows.
– = new Scan ()
– = new Scan (byte[] startRow, byte[] stopRow)
– = new Scan (byte[] startRow, Filter filter)
49. HBase Shell
• Non-SQL (intentional) “DSL”
• list : List all tables in hbase
• get : Get row or cell contents; pass table name, row, and optionally a
dictionary of column(s), timestamp and versions.
• put : Put a cell 'value' at specified table/row/column and optionally
timestamp coordinates.
• create : hbase> create 't1', {NAME => 'f1', VERSIONS => 5}
• scan : Scan a table; pass table name and optionally a dictionary of
scanner specifications.
• delete : Put a delete cell value at specified table/row/column and
optionally timestamp coordinates.
• enable : Enable the named table
• disable : Disable the named table: e.g. "hbase> disable 't1'"
• drop : Drop the named table.
50. HBase non-java access
• Languages talking to the JVM:
– Jython interface to HBase
– Groovy DSL for HBase
– Scala interface to HBase
• Languages with a custom protocol
– REST gateway specification for HBase
– Thrift gateway specification for HBase
51. Example: Frequency Counter
• Hbase has records of web_access_logs -We record each web page access by
a user.
• The schema looks like this:
userID_timestamp => {
details => {
page:
}
}
• We want to count how many times
we have seen each user
row details:page
user1_t1 a.html
user2_t2 b.html
user3_t4 a.html
user1_t5 c.html
user1_t6 b.html
user2_t7 c.html
user4_t8 a.html
user count (frequency)
user1 3
user2 2
user3 1
user4 1
52. Tutorial
• hbase shell
create 'access_logs', 'details'
create 'summary_user', {NAME=>'details', VERSIONS=>1}
• Add some data using Importer
• scan 'access_logs', {LIMIT => 5}
• Run 'FreqCounter'
• scan 'summary_user', {LIMIT => 5}
• Show output with PrintUserCount
53. coprocessors
• HBase 0.92 release provides coprocessors functionality which includes
– observers (similar to triggers for certain events) and
– endpoints (similar to stored procedures to be invoked from the client)
• Observers can be at the region, master or at the WAL (Write Ahead Log)
level.
• Once a Region Observer has been created, it can be specified in the hbase-default.
xml which applies to all the regions and the tables in it or else the
Region Observer can be specified on a table in which case it applies only to
that table.
• Arbitrary code can run at each tablet in table server
• High-level call interface for clients
– Calls are addressed to rows or ranges of rows and the coprocessor client library
resolves them to actual locations;
– Calls across multiple rows are automatically split into multiple parallelized RPC
• Provides a very flexible model for building distributed services
• Automatic scaling, load balancing, request routing for applications
54. Three observer interfaces
• RegionObserver: Provides hooks for data manipulation events, Get, Put,
Delete, Scan, and so on. There is an instance of a RegionObserver
coprocessor for every table region and the scope of the observations they
can make is constrained to that region.
• WALObserver: Provides hooks for write-ahead log (WAL) related operations.
This is a way to observe or intercept WAL writing and reconstruction events.
A WALObserver runs in the context of WAL processing. There is one such
context per region server.
• MasterObserver: Provides hooks for DDL-type operation, i.e., create, delete,
modify table, etc. The MasterObserver runs within the context of the HBase
master.
55. Example
package org.apache.hadoop.hbase.coprocessor;
import java.util.List;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.client.Get;
// Sample access-control coprocessor. It utilizes RegionObserver
// and intercept preXXX() method to check user privilege for the given table
// and column family.
public class AccessControlCoprocessor extends BaseRegionObserver {
@Override
public void preGet(final ObserverContext<RegionCoprocessorEnvironment> c,
final Get get, final List<KeyValue> result) throws IOException
throws IOException {
// check permissions..
if (!permissionGranted()) {
throw new AccessDeniedException("User is not allowed to access.");
}
}
// override prePut(), preDelete(), etc.
}
56. Avoiding long pause from The Garbage Collector
• Stop-the-world garbage collections is common in HBase,
especially during loading.
• There are two issues to be addressed
– concurrent mark and sweep (CMS) performance, and
– fragmentation of memstore.
• To address the first, start the CMS earlier than default by adding
-XX:CMSInitiatingOccupancyFraction and setting it down from
defaults. Start at 60 or 70 percent (The lower you bring down
the threshold, the more GCing is done, the more CPU used).
• To address the second fragmentation issue, there is an
experimental facility hbase.hregion.memstore.mslab.enabled
(memstore local allocation buffer) to be set to true in
configuration.
57. For loading data Pre-Create Regions
• Tables in HBase are initially created with one region by
default.
• For bulk imports, this means that all clients will write
to the same region until it is large enough to split and
become distributed across the cluster.
• A useful pattern to speed up the bulk import process is
to pre-create empty regions.
• Note that too-many regions can actually degrade
performance.
58. Enable Scan Caching
• When HBase is used as an input source for a MapReduce job,
set setCaching to something greater than the default (which is
1).
• Using the default value => map-task will make call back to the
region-server for every record processed.
– Setting this value to 80, for example, will transfer 80 rows at a time to
the client to be processed.
• There is a cost/benefit to have the cache value be large because
it costs more in memory for both client and RegionServer, so
bigger isn't always better.
• It appears from the experimentation that selecting a value
between 50 and 100 gives good performance in our setup.
59. Right Scan Attribute Selection
• Whenever a Scan is used to process large numbers of
rows (and especially when used as a MapReduce
source), we shall select the right set of attributes.
• If scan.addFamily is called then all of the attributes in
the specified ColumnFamily will be returned to the
client.
• If only a small number of the available attributes are to
be processed, then only those attributes should be
specified in the input scan because attribute over-selection
is a non-trivial performance penalty over
large datasets.
60. Optimize handler.count
• Count of RPC Listener instances spun up on
RegionServers. Same property is used by the Master
for count of master handlers.
– Default is 10.
• This setting in essence sets how many requests are
concurrently being processed inside the RegionServer
at any one time.
• If multiple map-reduce job is running in the cluster
and there is enough map capacity to handle the jobs
concurrently, then this parameter needs to be tuned.