HBase introduction talk

Introduction to HBase
Hayden Marchant

Agenda
● What is Hbase?
● Hadoop Overview
● HBase Architecture 101
● Use Cases
● Usage in Amobee
● Questions

Apache HBase
● Open Source
● Sparse multi-dimensional sorted map
datastore
● Modeled after Google BigTable
● Key Features:
– Distributed Storage across cluster of machines
– Random, online read/write data access
– Schema-less datamodel (NoSQL)
– Self-managed data partitions

Apache Hadoop Dependencies
● Hadoop Distributed File System (HDFS)
– Distributed, fault-tolerant, throughput-optimized data storage
– The Google File System, 2003, Ghemawat et al.
● Apache Zookeeper (ZK)
– Distributed, available, reliable coordination system
– The Chubby Lock Service …, 2006, Burrows
– http://research.google.com/archive/chubby.html
● Apache Hadoop MapReduce (MR)
– Distributed, fault-tolerant, batch-oriented data processing
– MapReduce: …, 2004, Dean and Ghemawat
– http://research.google.com/archive/mapreduce.html

What is Hadoop?
● Solution for Big Data
– Deals with complexities of high volume, velocity and variety of
data
● Set of Open Source Projects
● Transforms commodity hardware into a service that:
– Stores petabytes of data reliably
– Allows huge distributed computations
● Key attributes
– Redundant and reliable (no data loss)
– Extremely powerful
– Batch processing centric
– Easy to program
– Runs on commodity hardware

MapReduce overview
MAP
(K1,V1) => list (K2,V2)
REDUCE
(K2,list(V2)) => list(K3,V3)

What's in a Hadoop machine
● MapReduce server on a machine is called TaskTracker
● HDFS Server on machine is called a DataNode
TaskTracker
DataNode

Hadoop Cluster
● Having multiple machines with Hadoop creates a cluster
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
JobTracker NameNode

Glossary
● Region
– A subset of table's rows, like a range partition
– Automatically partitioned
● RegionServer (slave)
– Serves data (from regions) for reads and writes
● Master
– Responsible for coordination of region servers
– Assigns regions , detects failures of region servers
– Controls admin functions

HBase Distribution
● Store and Access Data across 1 to 1000s of commodity servers (Region
Servers)
● Automatic failover based on Apache Zookeeper
● Linear scaling of capacity and IOPS by adding servers

Sorted Map Datastore
● Not a relational data store (very light
schema)
● Table consists of rows, each of which has
primary key (“rowkey”)
● Each row can have any number of columns
– like a Map<byte[],byte[]>
● Rows are stored in sorted order

(logical view as “records”)

(physical view as “cells”)

Column Families
● Different sets of columns may have
different properties and access patterns
● Configurable by column family
– Compression (none, gzip, snappy)
– Version retention policies
– Cache priority
● CFs stored separately on disk: access one
without wasting IO on the other

Accessing HBase
● Java API (thick client)
● Shell
● REST/HTTP
● MapReduce
● Hive/Pig for analytics
● Various other SQL engines

HBase API
● get(row)
● put(row, Map<column,value> )
● scan(row range, filter)
● increment(row, columns)
● ….(checkAndPut, delete, etc...)

MapReduce over HBase
● MapReduce jobs can access Hbase in
parallel
– Read
– Write
● High-level parallelism

Saas Audit Logging
● Online service requires per-user audit logs
● Row key userid_sessionid_timestamp
allows efficient range scan lookups for per-
user history fetch
● Server-side filter allows for efficient queries
● MapReduce for analytic questions about
user behaviour.

OpenTSDB
● Scalable time-series store and metrics
collector
● Thousands of machine generating
hundreds of operational metrics
● Thousands of writes/second
● Web interface to displays graphs per metric
for time period
● Schema:
– Row key: metricid_hourofday
– Col :{timestamp}
– Val: {metric measurement}

Amobee
● User Profile Database
– > 1.3 billion profiles
– 10s of properties for each users
● Batch/Real-time updates of profiles
● Provide real-time access to profiles for ad-
targeting
● Central backbone for DMP

Use HBase if...
● You need random write, random read, or
both
● You need to do many thousands of
operations per second on multiple TB on
data
● Your access patterns are well-known and
simple

Don't use Hbase if...
● You only append to your dataset and tend
to read the whole thing
● You primarily do ad-hoc analytics (i.e ill-
defined access patterns)
● Your data easily fits on one beefy node

Where is HBase going?
● Preparation for version 1.0
● Customized balance of
Consistency/Availability/Persistence
● Namespaces
● Cell-level Security
● MapReduce on snapshots

HBase introduction talk

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à HBase introduction talk

Similaire à HBase introduction talk (20)

Dernier

Dernier (20)

HBase introduction talk