This is the introductory presentation on HBase given by Hayden Marchant in the monthly Amobee Tech Talk.
In this session, we'll learn about HBase, a NoSQL database that provides real-time, random read and write access to tables meant to store billions of rows and millions of columns.
HBase is an open-source, non-relational distributed column-oriented database, is linearly scalable, and is designed to run on commodity hardware. HBase clusters can be in the hundreds and thousands of nodes, serving extraordinary amounts of information. Tight integration with Hadoop gives way to allows powerful analytical processing on data residing in HBase.
2. Agenda
● What is Hbase?
● Hadoop Overview
● HBase Architecture 101
● Use Cases
● Usage in Amobee
● Questions
3. Apache HBase
● Open Source
● Sparse multi-dimensional sorted map
datastore
● Modeled after Google BigTable
● Key Features:
– Distributed Storage across cluster of machines
– Random, online read/write data access
– Schema-less datamodel (NoSQL)
– Self-managed data partitions
4. Apache Hadoop Dependencies
● Hadoop Distributed File System (HDFS)
– Distributed, fault-tolerant, throughput-optimized data storage
– The Google File System, 2003, Ghemawat et al.
● Apache Zookeeper (ZK)
– Distributed, available, reliable coordination system
– The Chubby Lock Service …, 2006, Burrows
– http://research.google.com/archive/chubby.html
● Apache Hadoop MapReduce (MR)
– Distributed, fault-tolerant, batch-oriented data processing
– MapReduce: …, 2004, Dean and Ghemawat
– http://research.google.com/archive/mapreduce.html
5. What is Hadoop?
● Solution for Big Data
– Deals with complexities of high volume, velocity and variety of
data
● Set of Open Source Projects
● Transforms commodity hardware into a service that:
– Stores petabytes of data reliably
– Allows huge distributed computations
● Key attributes
– Redundant and reliable (no data loss)
– Extremely powerful
– Batch processing centric
– Easy to program
– Runs on commodity hardware
9. What's in a Hadoop machine
● MapReduce server on a machine is called TaskTracker
● HDFS Server on machine is called a DataNode
TaskTracker
DataNode
10. Hadoop Cluster
● Having multiple machines with Hadoop creates a cluster
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
JobTracker NameNode
11. Glossary
● Region
– A subset of table's rows, like a range partition
– Automatically partitioned
● RegionServer (slave)
– Serves data (from regions) for reads and writes
● Master
– Responsible for coordination of region servers
– Assigns regions , detects failures of region servers
– Controls admin functions
12. HBase Distribution
● Store and Access Data across 1 to 1000s of commodity servers (Region
Servers)
● Automatic failover based on Apache Zookeeper
● Linear scaling of capacity and IOPS by adding servers
14. Sorted Map Datastore
● Not a relational data store (very light
schema)
● Table consists of rows, each of which has
primary key (“rowkey”)
● Each row can have any number of columns
– like a Map<byte[],byte[]>
● Rows are stored in sorted order
17. Column Families
● Different sets of columns may have
different properties and access patterns
● Configurable by column family
– Compression (none, gzip, snappy)
– Version retention policies
– Cache priority
● CFs stored separately on disk: access one
without wasting IO on the other
18. Accessing HBase
● Java API (thick client)
● Shell
● REST/HTTP
● MapReduce
● Hive/Pig for analytics
● Various other SQL engines
27. Saas Audit Logging
● Online service requires per-user audit logs
● Row key userid_sessionid_timestamp
allows efficient range scan lookups for per-
user history fetch
● Server-side filter allows for efficient queries
● MapReduce for analytic questions about
user behaviour.
28. OpenTSDB
● Scalable time-series store and metrics
collector
● Thousands of machine generating
hundreds of operational metrics
● Thousands of writes/second
● Web interface to displays graphs per metric
for time period
● Schema:
– Row key: metricid_hourofday
– Col :{timestamp}
– Val: {metric measurement}
29. Amobee
● User Profile Database
– > 1.3 billion profiles
– 10s of properties for each users
● Batch/Real-time updates of profiles
● Provide real-time access to profiles for ad-
targeting
● Central backbone for DMP
30. Use HBase if...
● You need random write, random read, or
both
● You need to do many thousands of
operations per second on multiple TB on
data
● Your access patterns are well-known and
simple
31. Don't use Hbase if...
● You only append to your dataset and tend
to read the whole thing
● You primarily do ad-hoc analytics (i.e ill-
defined access patterns)
● Your data easily fits on one beefy node
32. Where is HBase going?
● Preparation for version 1.0
● Customized balance of
Consistency/Availability/Persistence
● Namespaces
● Cell-level Security
● MapReduce on snapshots