SlideShare a Scribd company logo
1 of 125
Download to read offline
®
© 2014 MapR Technologies 1
®
© 2014 MapR Technologies
Getting Started with HBase Application development
Keys Botzum kbotzum@mapr.com
Sridhar Reddy sreddy@mapr.com
Carol McDonald cmcdonald@mapr.com
August 2015 http://answers.mapr.com/
®
© 2014 MapR Technologies 2
Download slides to follow along
•  http://goo.gl/cNZ8RH
®
© 2014 MapR Technologies 3
Objectives of this session
•  What is HBase?
–  Why do we need NoSQL / HBase?
–  Overview of HBase & HBase data model
–  HBase Architecture and data flow
•  How to get started
–  Demo/Lab using HBase Shell using MapR Sandbox
•  Design considerations when migrating from RDBMS to HBase
•  Developing Applications using HBase Java API
–  Demo/Lab to perform CRUD operations using put, get, scan, delete
–  How to work around transactions
®
© 2014 MapR Technologies 4
Why do we need NoSQL / HBase?
Relational database model
Relational Data is typed and structured before stored:
–  Entities map to tables, normalized
–  Structured Query Language
•  Joins tables to bring back data
®
© 2014 MapR Technologies 5
Why do we need NoSQL / HBase?
Relational Model
•  Pros
–  Standard persistence model
•  standard language for data manipulation
–  Transactions handle concurrency ,
consistency
–  efficient and robust structure for storing
data
®
© 2014 MapR Technologies 6
What changed to bring on NoSQL? Big data
•  Cons of the Relational Model:
–  Does not scale horizontally:
•  Sharding is difficult to manage
•  Distributed join, transactions do not scale across shards
Horizonal scale : partition or shard tables across cluster
Distributed Joins, Transactions are Expensive
bottleneck
®
© 2014 MapR Technologies 7
Key Value Store
l  Couchbase
l  Riak
l  Citrusleaf
l  Redis
l  BerkeleyDB
l  Membrain
l  ...
Document
l  MongoDB
l  CouchDB
l  RavenDB
l  Couchbase
l  ... Graph
l  OrientDB
l  DEX
l  Neo4j
l  GraphBase
l  ...Wide Column
l  HBase
l  MapR-DB
l  Hypertable
l  Cassandra
l  ...
NoSQL Landscape
®
© 2014 MapR Technologies 8© 2014 MapR Technologies
®
Hbase designed for Distribution, Scale,
Speed
For Tutorial only: Send the PDF earlier to all attendees to help setup the laptop
®
© 2014 MapR Technologies 9
HBase is a Distributed Database
Data is automatically
distributed across the cluster.
•  Table is indexed by row key
•  Key range is used for
horizontal partitioning
•  Table splits happen
automatically as the data
grows
Key
Range
xxxx
xxxx
CF1
colA colB colC
val val
val
CF2
colA colB colC
val val
val
Key
Range
xxxx
xxxx
CF1
colA colB colC
val val
val
CF2
colA colB colC
val val
val
Key
Range
xxxx
xxxx
CF1
colA colB colC
val val
val
CF2
colA colB colC
val val
val
Put, Get by Key
®
© 2014 MapR Technologies 10
HBase is a ColumnFamily oriented Database
Data is accessed and stored together by RowKey
Similar Data is grouped & stored in Column Families
–  share common properties:
•  Number of versions
•  Time to Live (TTL)
•  Compression [lz4, lzf, Zlib]
•  In memory option …
CF1
colA colB colC
Val val
val
CF2
colA colB colC
val val
val
RowKey
axxx
gxxx
Customer id Customer Address data Customer order data
®
© 2014 MapR Technologies 11
HBase designed for Distribution
•  Distributed data stored and accessed together:
–  Key range is used for horizontal partitioning
•  Pros
–  scalable handles data volume and velocity
–  Fast Writes and Reads by Key
•  Cons
–  No joins natively (tools like Drill help)
–  Need to know how data will be queried in advance to do
good schema design to achieve best performance
Key
Range
axxx
kxxx
Key
Range
axxx
kxxx
Key
Range
axxx
kxxx
®
© 2014 MapR Technologies 12
HBase Data Model
Row Keys: identify the rows in an HBase table
Columns are grouped into column families
Row
Key
CF1 CF2 …
colA colB colC colA colB colC colD
R1
axxx val val val val
…
gxxx val val val val
R2
hxxx val val val val val val val
…
jxxx val
R3
kxxx val val val val
…
rxxx val val val val val val
… sxxx val val
®
© 2014 MapR Technologies 13
HBase Data Storage - Cells
•  Data is stored in Key value format
•  Value for each cell is specified by complete coordinates:
•  (Row key, ColumnFamily, Column Qualifier, timestamp ) => Value
–  RowKey:CF:Col:Version:Value
–  smithj:data:city:1391813876369:nashville
Cell Coordinates= Key
Row key Column Family Column Qualifier Timestamp Value
Smithj data city 1391813876369 nashville
Column Key
Value
®
© 2014 MapR Technologies 14
Logical Data Model vs Physical Data Storage
•  Data is stored in Key Value format
•  Key Value is stored for each Cell
•  Column families data are stored in
separate files
RowK
ey
CF1 CF2
colA colB colA colC
ra 1 2
rxxxx
rxxx
Logical Model
Row
Key
CF1:Col version value
ra cf1:cola 1 1
row
Key
CF2:Col version value
ra cf2:cola 1 2
Physical Storage
Key Value Key Value
Physical Storage
®
© 2014 MapR Technologies 15
Sparse Data with Cell Versions
CF1:colA CF1:colB CF1:colC
Row1
Row10
Row11
Row2
@time1:
value1
@time5:
value2
@time7:
value3
@time2:
value1
@time3:
value1
@time4:
value1
@time2:
value1
@time4:
value1
@time6:
value2
®
© 2014 MapR Technologies 16
Versioned Data
•  Version
–  each put, delete adds new cell, new version
–  A long
•  by default the current time in milliseconds if no version specified
–  Last 3 versions are stored by default
•  Configurable by column family
–  You can delete specific cell versions
–  When a cell exceeds the maximum number of versions, the extra
records are removed
Key CF1:Col version value
ra cf1:cola 3 3
ra cf1:cola 2 2
ra cf1:cola 1 1
®
© 2014 MapR Technologies 17
Table Physical View
Physically data is stored per Column family as a sorted map
•  Ordered by row key, column qualifier in ascending order
•  Ordered by timestamp in descending order
Row
key
Column
qualifier
Cell
value
Timestamp
(long)
Row1 CF1:colA value3 time7
Row1 CF1:colA value2 time5
Row1 CF1:colA value1 time1
Row10 CF1:colA value1 time4
Row 10 CF1:colB value1 time4
Sorted by
Row key
and Column
Sorted in
descending
order
®
© 2014 MapR Technologies 18
Logical Data Model vs Physical Data Storage
Key CF1:Col version value
ra cf1:ca 1 1
rb cf1:cb 2 4
rb cf1:cb 1 3
rc cf1:ca 1 5
Row
Key
CF1 CF2
ca cb ca cd
ra 1 2
rb 3,4
rc 5 6,7 8
Key CF2:Col version value
ra cf2:ca 1 2
rc cf2:ca 2 7
rc cf2:ca 1 6
rc cf2:cd 1 8
Physical Storage
Logical
Model
Column families are stored
separately
Row keys, Qualifiers are sorted
lexicographically
Key Value Key Value
®
© 2014 MapR Technologies 19
HBase Table is a Sorted map of maps
SortedMap<Key, Value>
Table
Map of Rows
Map of CF
Map of columns
Map of cells
Key CF1:Col version value
ra cf1:ca v1 1
rb cf1:cb v2 4
rb cf1:cb v1 3
rc cf1:ca v1 5
Key CF2:Col version value
ra cf2:ca v1 2
rc cf2:ca v2 7
rc cf2:ca v1 6
rc cf2:cd v1 8
SortedMap<RowKey,
SortedMap< ColumnFamily,
SortedMap< ColumnName,
SortedMap < version, Value> >>>
®
© 2014 MapR Technologies 20
HBase Table SortedMap<Key, Value>
<ra,<cf1, <ca, <v1, 1>>
<cf2, <ca, <v1, 2>>>
<rb,<cf1, <cb, <v2, 4>
<v1, 3>>>
<rc,<cf1, <ca, <v1, 5>>
<cf2, <ca, <v2, 7>>
<ca, <v1, 6>>
<cd, <v1, 8>>>
Key CF1:Col version value
ra cf1:ca v1 1
rb cf1:cb v2 4
rb cf1:cb v1 3
rc cf1:ca v1 5
Key CF2:Col version value
ra cf2:ca v1 2
rc cf2:ca v2 7
rc cf2:ca v1 6
rc cf2:cd v1 8
®
© 2014 MapR Technologies 21
Basic Table Operations
•  Create Table, define Column Families before data is
imported
–  but not the rows keys or number/names of columns
•  Low level API, technically more demanding
•  Basic data access operations (CRUD):
put Inserts data into rows (both create and update)
get Accesses data from one row
scan Accesses data from a range of rows
delete Delete a row or a range of rows or columns
®
© 2014 MapR Technologies 22© 2014 MapR Technologies
®
HBase Architecture
Data flow for Writes, Reads
Designed to Scale
®
© 2014 MapR Technologies 23
What is a Region?
•  Tables are partitioned into key ranges (regions)
•  Region servers serve data for reads and writes
–  For the range of keys it is responsible for
Region Server
Client
Region Region
HMaster
zookeepe
rzookeeperzookeeper
Region Server
Region Region
Get, Put
Key colB colC
xxx val val
xxx val val
Key colB colC
xxx val val
xxx val val
Key colB colC
xxx val val
xxx val val
Key colB colC
xxx val val
xxx val val
®
© 2014 MapR Technologies 24
Region Server Components
•  WAL: write ahead log on disk (commit log), Used for recovery
•  BlockCache: Read Cache, Least Recently Used evicted
•  MemStore: Write Cache, sorted keyValue updates.
•  Hfile=sorted KeyValues on disk
Region
Server
HDFS Data Node
BlockCache
memstore
HFile
memstore
HFile
Region Region
HFileHFile
memstore memstore
WAL
®
© 2014 MapR Technologies 25
HBase Write Steps
Put
each incoming record written
to WAL for durability:
•  log on disk
•  updates appended sequentially
HDFS Data Node
memstore memstore
Region Server
Region
WAL
®
© 2014 MapR Technologies 26
HBase Write Steps
Put
Next updates are written to the
Memstore:
•  write cache
•  in-memory
•  sorted list of KeyValue updates HDFS Data Node
memstore memstore
Region Server
Region
WAL
Ack
Updates quickly sorted in memory are available to queries after put returns
®
© 2014 MapR Technologies 27
HBase Memstore
•  in-memory
•  sorted list of Key → Value
•  One per column family
•  Updates quickly sorted in memory
memstore memstore
Region
Key CF1:Col version value
ra cf1:ca v1 1
rb cf1:cb v2 4
rb cf1:cb v1 3
rc cf1:ca v1 5
Key CF2:Col version value
ra cf2:ca v1 2
rc cf2:ca v2 7
rc cf2:ca v1 6
rc cf2:cd v1 8
Key Value Key Value
®
© 2014 MapR Technologies 28
HBase Region Flush
When 1 Memstore is full:
•  all memstores in region flushed
to new Hfiles on disk
•  Hfile: sorted list of key → values
On disk
HDFS Data Node
memstore memstore
Region Server
Region
WAL
HFile HFile
FLUSH
HFileHFile
®
© 2014 MapR Technologies 29
HBase HFile
•  On disk sorted list of key → values
•  One per column family
•  Flushed quickly to file
•  Sequential write HDFS Data Node
HFile HFile
Sequential write
Key CF1:Col version value
ra cf1:ca v1 1
rb cf1:cb v2 4
rb cf1:cb v1 3
rc cf1:ca v1 5
Key CF2:Col version value
ra cf2:ca v1 2
rc cf2:ca v2 7
rc cf2:ca v1 6
rc cf2:cd v1 8
Key Value Key Value
®
© 2014 MapR Technologies 30
HBase HFile Structure
•  Memstore flushes to an Hfile
Key-value
pairs are
stored in
increasing
order
Index points to
row keys
location
B+-tree: leaf
index , root index
Key A Value
…
…
…
Key P Value
…..
….
…
Key T Value
…..
….
…
Key z Value
…..
….
…
Leaf Index
Leaf Index
Leaf Index
Root Index
Interm Index
Bloom
Trailer
Bloom
Bloom
Bloom Leaf Index
d
Root Index
Intermediate Index
d l z
Data block
Key a Value
. . .
Key d Value
Leaf Index
l
Leaf Index
z
Data block
Key e Value
. . .
Key l Value
Data block
Key m Value
. . .
Key z Value
®
© 2014 MapR Technologies 31
HBase Read Merge from Memory and Files
•  MemStore creates multiple small store files over time when flushing.
•  When a get/scan comes in, multiple files have to be examined
HDFS Data Node
BlockCache
memstore
Region Server Region
WAL HFile
HFile
HFile
scanner
read Get or Scan searches for
Row Cell KeyValues:
1.  Block Cache ((Memory)
2.  Memstore (Memory)
3.  Load HFiles from Disk
into Block Cache based on
indexes and bloomfilters
1
2
3
®
© 2014 MapR Technologies 32
HDFS Data Node
HBase Compaction •  minor compaction:
•  merges files into fewer larger ones.
•  Major compaction:
•  merge all Hfiles into one per column
family.
•  remove cells marked for
deletion
Region Server
Region Region
memstore memstore
WAL minor
compaction
HFile
HFile
HFile
HFile
HFile
HFile
HFile
HFile
HFile
HFile
HFile
HFile
HFile
updates
HFile
major
compaction
Flush to disk
®
© 2014 MapR Technologies 33
HBase Background: Log-Structured Merge Trees
•  Traditional Databases use B+ trees:
–  expensive to update
•  HBase: Log Structured Merge Trees
–  Sequential writes
•  Writes go to memory And WAL
•  Sorted memstore flushes to disk
–  Sequential Reads
•  From memory, index, sorted disk
Index Log
Index
Memory Disk
Write
Read
predictable disk seeks
®
© 2014 MapR Technologies 34
Data Model for Fast Writes, Reads
•  Predictable disk lay out
•  Minimize disk seek
•  Get, Put by row key: fast access
•  Scan by row key range: stored sorted, efficient
sequential access for key range
Region1
Key Range
ra
rx
Region
Region
Region
Server
Key CF1:Col version value
ra cf1:ca v1 1
rb cf1:cb v2 4
rb cf1:cb v1 3
rc cf1:ca v1 5
Get key
Scan start key,
Stop key
Minimize disk seek
®
© 2014 MapR Technologies 35
Region = contiguous keys
•  Regions fundamental partitioning/sharding object.
•  By default, on table creation 1 region is created that holds the
entire key range.
•  When region becomes too large, splits into two child regions.
•  Typical region size is a few GB, sometimes even 10G or 20G
Region
CF1
colA colB colC
val val
val
CF2
colA colB colC
val val
val
Key
Range
axxx
gxxx
Region
®
© 2014 MapR Technologies 36
Region Split
•  The RegionServer splits a
region
•  daughter regions
–  each with ½ of the
regions keys.
–  opened in parallel on
same server
•  reports the split to the Master
Region 1 Region 2
Region Server 1
Key colB colC
val val
val
Key colB colC
val val
val
Region Server 1
Region 1
Key
Range
axxx
kxxx
Key
Range
Lxxx
zxxx
Key colB colC
val val
val
®
© 2014 MapR Technologies 37© 2014 MapR Technologies
®
HBase Use Cases
®
© 2014 MapR Technologies 38
3 Main Use Case Categories
•  Capturing Incremental data --Time Series Data
–  Hi Volume, Velocity Writes
•  Information Exchange, Messaging
–  Hi Volume, Velocity Write/Read
•  Content Serving, Web Application Backend
–  Hi Volume, Velocity Reads
®
© 2014 MapR Technologies 39
3 Main Use Case Categories
•  Time Series Data, Stuff with a Time Stamp
–  Sensor, System Metrics, Events, log files
–  Stock Ticker, User Activity
–  Hi Volume, Velocity Writes
HBase
Put
App
Server
App
Serverread
Put
Put
Put
Event time stamped
data
sensor
OpenTSDB
Data for real-time monitoring.
®
© 2014 MapR Technologies 40
HBase
Messages
read
Put
App
Server
read
App
Server
read
App
Server Put
Put
Put
App
Server
3 Main Use Case Categories
•  Information Exchange
–  email, Chat, Inbox: Facebook
–  Hi Volume, Velocity Write/Read
https://www.facebook.com/UsingHbase
®
© 2014 MapR Technologies 41
3 Main Use Case Categories
•  Content Serving, Web Application Backend
–  Online Catalog: Gap, World Library Catalog.
–  Search Index: ebay
–  Online Pre-Computed View: Groupon, Pinterest
–  Hi Volume, Velocity Reads
Hbase
Processed
data
read
App
Server
read
App
Server
read
App
Server
Bulk Import
Pre-Computed
Materialized View
®
© 2014 MapR Technologies 42
Agenda
•  Why do we need NoSQL / HBase?
•  Overview of HBase & HBase data model
•  HBase Architecture and data flow
•  Demo/Lab using HBase Shell
–  Create tables and CRUD operations using MapR Sandbox
•  Design considerations when migrating from RDBMS to HBase
•  HBase Java API to perform CRUD operations
–  Demo / Lab using Eclipse, HBase Java API & MapR Sandbox
•  How to work around transactions
®
© 2014 MapR Technologies 43
Hands-On-Labs Info
•  Install a one-node MapR Sandbox on your laptop
•  Install and configure Eclipse to develop HBase
applications using Java API
•  MapR Client is optional
http://goo.gl/cNZ8RH
®
© 2014 MapR Technologies 44
What is MapR Sandbox and how to use it
•  MapR Sandbox is a fully functional single-node
Hadoop cluster running on a virtual machine
Host computer (your laptop) Vmware or VirtualBox
MapR Sandbox
MapR Client
cmd window
Eclipse
cmd You can directly login and
use terminal window to run
Hadoop commands
Browser:
MCS, Hue
®
© 2014 MapR Technologies 45
Software components to install
•  Here are all the software components you need to install to
get MapR Sandbox working on Windows
1. JDK 7
2. MVMware player (or) Oracle VirtualBox (prerequisite)
3. MapR Sandbox
4. MapR Client (optional)
5. Eclipse (for Java developers only)
•  HBaseTutorialInstallGuide.pdf will cover how to download, install
and configure all these components on your laptop
®
© 2014 MapR Technologies 46
Cluster for those that have not installed MapR Sandbox
•  CLDB Nodes:
https://ec2-54-176-71-123.us-west-1.compute.amazonaws.com:8443
https://ec2-54-176-33-167.us-west-1.compute.amazonaws.com:8443
•  Login nodes:
Rows 1 & 2: 54.176.71.123 ip-10-197-54-142
Rows 3 & 4: 54.176.33.167 ip-10-199-46-212
Rows 5 & 6: 54.193.226.196 ip-10-198-76-134
Rows 7 & 8: 54.177.2.5 ip-10-197-8-213
Last Rows: 54.193.140.172 ip-10-198-79-15
Username/passwd: user02 … user50 mapr
®
© 2014 MapR Technologies 47
Lab Exercise
See Hbase_Tutorial_Lab.pdf for all labs
Start MapR Sandbox and log into the cluster
[user: mapr, passwd: mapr]
Use the HBase shell
>Hbase shell
hbase> help
hbase> create ’/user/mapr/mytable’, {NAME =>’cf1’}
hbase> put ’/user/mapr/mytable’, ’row1’, ’cf1:col1’, ‘datacf1c1v1’
hbase> get ’/user/mapr/mytable’, ’row1’
hbase> scan ’/user/mapr/mytable’
hbase> describe ’/user/mapr/mytable’
®
© 2014 MapR Technologies 48
®
© 2014 MapR Technologies
Schema Design Guidelines
•  HBase tables ≠ Relational tables!
•  HBase Design for Access Paterns
®
© 2014 MapR Technologies 49
Use Case Example: Record Stock Trade Information
in a Table
•  Trade data:
Trade
•  timestamp
•  stock symbol
•  price per share
•  volume of trade
•  Example
–  1381396363000
(epoch timestamp with
millisecond granularity)
–  AMZN
–  $304.66
–  1333 shares
®
© 2014 MapR Technologies 50
Intelligent keys
•  Only the row keys are indexed (elastic search & Solr can help)
•  Compose the key with attributes used for searching
–  Composite key : 2 or more identifying attributes
–  Like multi-column index design in RDB
Cell Coordinates (Key)
Granularity
Row key Column Family Column Name Timestamp Value
Restrict disk I/O Restrict network traffic
®
© 2014 MapR Technologies 51
Composite Keys
Use composite rowkey: attributes used for searching
•  Include multiple elements in the rowkey
–  Use a separator or fixed length
•  Example rowkey format:
–  Ex: GOOG_20131012
•  Get operations require complete row key.
•  Scans can use partial keys.
–  Ex: “GOOG” or "GOOG_2014"
SYMBOL + DATE (YYYYMMDD)
®
© 2014 MapR Technologies 52
Consider Access Patterns for Application
•  By date? By hour? By companyId?
–  Rowkey design
•  What if the Date/Timestamp is leftmost ?
How will data be retrieved?
Key
1391813876369_AMZN
1391813876370_AMZN
1391813876371_GOOG
®
© 2014 MapR Technologies 53
Hot-Spotting and Region Splits
•  If rowkeys are written in sequential order then
writes go to only one server
–  Split when full
1900
1950 …
1999
Region Server 1
Key colB col
C
1900 val
val
1999
Region 1Key
Range
1900
1999
Sequential key, like a timestamp
File
Server 1
®
© 2014 MapR Technologies 54
Hot-Spotting and Region Splits
•  Regions split as the table grows.
–  RegionServer Creates two new regions, each with half of the
original regions keys.
•  Sequential writes will go to new region
2040
2050
2000
Region Server 1
File
Server 1Key colB col
C
1900 val
val
1999
Region 1Key
Range
1900
1950
Key colB col
C
1900 val
val
1999
Region 2Key
Range
1950
2050
®
© 2014 MapR Technologies 55
Hot-Spotting and Region Splits
3040
3000
Sequential writes will go
to new region
Region Server 1
File
Server 1Key colB col
C
1900 val
val
1999
Region 1Key
Range
1900
1950
Key colB col
C
1900 val
val
1999
Region 2Key
Range
1950
3050
®
© 2014 MapR Technologies 56
Hot-Spotting and Region Splits
3045
Regions split as the table
grows.
Sequential writes will go
to new region
Region Server 1
File
Server 1Key colB col
C
1900 val
val
1999
Region 1Key
Range
1900
1950
Key colB col
C
1900 val
val
1999
Region 2Key
Range
1950
2050
Key colB col
C
1900 val
val
1999
Region 3Key
Range
2051
3050
3041
3050
®
© 2014 MapR Technologies 57
Random keys
Key
Range
a23148
3d1a5f
e0e9b4
Key
Range
Key
Range
Key
Range
Key
Range
Key
Range
MD5
Hash
rowkey
Random writes will go to different regions
If table was pre-split or big enough to have split
d = MessageDigest.getInstance("MD5");
byte[] prefix = d.digest(Bytes.toBytes(s));
®
© 2014 MapR Technologies 58
Sequential vs. Random keys
Random is better for writing , but sequential is better for scanning
row keys
Writes
Sequential Reads
Sequential
Keys
Performance
Salted
Keys
Promoted
Field Keys
Random
Keys
®
© 2014 MapR Technologies 59
Prefix, Promote a field key
Key
Range
Key
Range
Key
Range
Key
Range
Key
Range
Key
Range
amzn_1999
amzn_2003
amzn_2005
cisc_1998
cisc_2002
cisc_2010
goog_1990
goog_2020
goog_2030
®
© 2014 MapR Technologies 60
Prefix with a Hashed field key
Key
Range
a23148_2003
1d1a5f_1999
e0e9b4_2000
Key
Range
Key
Range
Key
Range
Key
Range
Key
Range
MD5
Hash
prefix
rowkey
prefix the rowkey with a (shortened) hash:
byte[] hash = d.digest(Bytes.toBytes(fieldkey));
Bytes.putBytes(rowkey, 0, hash, 0, length);
g0e8b4_2004
b33148_2006
3d1a5f_2007
®
© 2014 MapR Technologies 61
Consider Access Patterns for Application
•  Which trade data needs fastest access (or most frequent)?
–  Rowkey ordering
•  What if you want to retrieve the stocks by symbol and date?
•  Scan by row key prefix Increasing time: PREFIX_TIMESTAMP
•  What if you usually want to retrieve the most recent?
How will data be retrieved?
Key
AMZN_1391813876369
AMZN_1391813876370
GOOG_1391813876371
SYMBOL + timestamp
®
© 2014 MapR Technologies 62
Last In First Out Access: Use Reverse-Timestamp
•  Row keys are sorted in increasing order
•  For fast access to most-recent writes:
–  design composite rowkey with reverse-timestamp that decreases
over time.
–  Scan by row key prefix Decreasing: [MAXTIME–TIMESTAMP]
•  Ex: Long.MAX_VALUE-date.getTime()
SYMBOL + Reverse timestamp
Key
AMZN_98618600666
AMZN_98618600777
GOOG_98618608888
®
© 2014 MapR Technologies 63
Consider Access Patterns for Application
•  What are the needs for atomicity of transactions?
–  Column design
–  More Values in a single row
•  Works well to get or update multiple values
How will data be retrieved?
®
© 2014 MapR Technologies 64
Rowkey design influences shape of Tables: Tall or Flat
Tall Narrow Flat Wide
®
© 2014 MapR Technologies 65
Tall Table for Stock Trades
Rowkey format:
Ex: AMZN_98618600888
SYMBOL + Reverse timestamp
rowkey
CF: CF1
CF1:price CF1:vol
… … …
AMZN_98618600666 12.34 2000
AMZN_98618600777 12.41 50
AMZN_98618600888 12.37 10000
… … …
CSCO_98618600777 23.01 1000
…
®
© 2014 MapR Technologies 66
Consider Access Patterns for Application
•  Are Price and Volume data typically accessed together, or are
they unrelated?
–  Column family structure
•  Column Families
–  group data that will be read and stored together
–  Can set attributes:
•  # Min/Max versions, compression, in-memory, Time-To-Live
•  Columns
–  Column names are dynamic, not pre-defined
–  every row does not need to have same columns
How will data be retrieved?
®
© 2014 MapR Technologies 67
Wide Table for Stock Trades
rowkey
CF price CF vol
p:10 p:1000 … p:2000 v:10 v:1000 … v:2000
AMZN_986186006 12.37 13 12.34 10000 2000
…
CSCO_986186070 23.01 1000
Rowkey format:
Ex: AMZN_20131020
SYMBOL + Reverse timestamp rounded to the hour
•  Separate price & volume data into column families
•  Segregate time into buckets:
–  Time rounded to the hour in the rowkey
–  Time in column name represents seconds since the timestamp in the key
–  One row stores a bucket of measurements for the hour
®
© 2014 MapR Technologies 68
Wide Table for Stock Trades
rowkey
CF1 CF stats
CF1:10 CF1:1000 … Day Hi … Day LO
AMZN_20131020 {p:12.37,v:
1000}
{p:
12.37,v:
1000}
…
CSCO_20131020
Rowkey format:
Ex: AMZN_20131020
SYMBOL + Reverse timestamp rounded to the day
•  Segregate time into buckets:
–  Time rounded to the day in the rowkey
–  Time in column name represents seconds since the timestamp in the key
–  One row stores a bucket of measurements for the day
®
© 2014 MapR Technologies 69
Lesson:	
  Schemas	
  can	
  be	
  very	
  
flexible	
  and	
  can	
  even	
  
change	
  on	
  the	
  fly	
  
Column names can be dynamic, every row does not
need to have same columns
®
© 2014 MapR Technologies 70
Consider Access Patterns for Application
•  Do all trades need to be saved forever?
–  TTL Time to Live , CF can be set to expire cells
•  How many Versions?
–  Max Versions
–  You can have many versions of data in a cell, default is 3
How will data be retrieved?
®
© 2014 MapR Technologies 71
Wide Table for Stock Trades
rowkey
CF price CF vol CF stats
price:00 … price:23 vol:00 … vol:23 Day Hi Day Lo
AMZN_20131020 12.37 12.34 10000 2000
…
CSCO_20130817 23.01 1000
Rowkey format:
Ex: AMZN_20131020
SYMBOL + date YYYYMMDD
•  Separate price & volume data into column families
•  Segregate time into buckets:
–  Date in the rowkey
–  Hour in the column name
–  Set Column Family to store Max Versions, timestamp in the version
®
© 2014 MapR Technologies 72
Flat-Wide Vs. Tall-Narrow Tables
•  Tall-Narrow provides better query granularity
–  Finer grained Row Key
–  Works well with scan
•  Flat-Wide supports built-in row atomicity
–  More Values in a single row
•  Works well to update multiple values (row atomicity)
•  Works well to get multiple associated values
®
© 2014 MapR Technologies 73
Lesson:	
  Have	
  to	
  know	
  the	
  
queries	
  to	
  design	
  in	
  
performance	
  
®
© 2014 MapR Technologies 74
Comparing Relational schema to HBase
•  HBase is lower-level than relational tables
–  Design is different
•  Relational design
–  Data centric, focus on entities and relations
–  Query, joins
•  New views of data from different tables easily created
–  Does not scale across cluster
•  HBase is designed for clustering:
–  Distributed data is stored and accessed together
–  Query centric, focus on how the data is read
–  Design for the questions
Key
Range
axxx
kxxx
Key
Range
xxx
xxx
Key
Range
xxx
zxxx
®
© 2014 MapR Technologies 75
Design of HBase Table
•  De-normalize data
–  Databases intended for online transaction processing (OLTP) are
typically normalized
–  Databases intended for online analytical processing (OLAP) are
primarily "read mostly" databases and denormalized so that you do not
require to have joins – permits fast lookups
®
© 2014 MapR Technologies 76
l  Relational Databases are typically normalized
l  Goal Normalization:
–  eliminate redundant data
–  Put repeating information in its own table
Normalized database :
–  Causes joins
•  data has to be retrieved from more tables.
•  queries can take more time
Normalization
®
© 2014 MapR Technologies 77
HBase Nested Entity
•  A one-to-many relationship can be modeled as a single row
–  Embedded, Nested Entity
•  Order one-to-many with Line Items
–  Row key: parent id
•  OrderId
–  column name : child id stored
•  line Item id
OrderId Data:date item:id1 item:id2 item:id3
123 20131010 $10 $20 $9.45
®
© 2014 MapR Technologies 78
De-Normalization
Order and Items in same table
De-normalization:
–  store data about an entity and related entities in the same
table.
•  Reads are faster across a cluster
–  retrieve data about entity and related entities in one read
OrderId Data:date item:id1 item:id2 item:id3
123 20131010 $10 $20 $9.45
®
© 2014 MapR Technologies 79
Many to Many Relationship RDBMS
user
id (primary key)
name
alias
email
book
id (primary key)
title
description
user_book_rating
id (primary key)
userId (foreign key)
bookId (foreign key)
rating
1 ∞ 1∞
Online book store
•  Querys
•  Get name for user x
•  Get title for book x
•  Get books and corresponding ratings for userId x
•  Get all userids and corresponding ratings for book y
®
© 2014 MapR Technologies 80
Many to Many Relationship HBase
User table Column family rating, bookid is column name
Key data:fname … rating:bookid1 rating:bookid2
userid1 5 4
Key data:title … rating:userid1 rating:userid2
bookid1 5 4
Book table Column family rating, userid is column name
•  Queries
•  Get books and corresponding ratings for userId x
•  Get all userids and corresponding ratings for book y
®
© 2014 MapR Technologies 81
Generic data: Event, Attributes, Values
•  Event Id, Event name-value pairs , schema-less
patientXYZ-ts1, Temperature , "102"
patientXYZ-ts1, Coughing, "True"
patientXYY-ts2, Heart Rate, "98"
•  This is the advantage of HBase
–  Define columns on the fly,
•  put attribute name in column qualifier
Key event:heartrate event:coughing event:temperature
Patientxyz-ts1 98 true 102
Event type name=qualifierEvent id=row key Event measurement=value
®
© 2014 MapR Technologies 82
Self Join Relationship HBase
•  Example Twitter
•  User_x follows User_y
•  User_y followed by User_z
•  Querys
–  Get all users who Carol follows
–  Get all users following Carol
Key data:timestamp
Carol:follows:SteveJobs
Carol:followedby:BillyBob
®
© 2014 MapR Technologies 83
Hierarchical Data
•  tree like structure
•  Use a flat-wide
–  Parents, children in columns
usa
FLTN
Nashville Miami
Key P:USA P:TN p:FL c:TN C:FL C:Nashvl C:Miami
USA state state
TN country city
FL country city
Nashville state
miami state
®
© 2014 MapR Technologies 84
Inheritance mapping
•  Online Store Example Product table
–  type is in row key for searching
–  Columns are not the same for different types
Key price title details model
Bok+id1 10 Hbase blah
Dvd+Id2 15 stones blah
Kin+Id3 100 blah fire
Product
book dvd kindle
®
© 2014 MapR Technologies 85
Agenda
•  Why do we need NoSQL / HBase?
•  Overview of HBase & HBase data model
•  HBase Architecture and data flow
•  Demo/Lab using HBase Shell
–  Create tables and CRUD operations using MapR Sandbox
•  Design considerations when migrating from RDBMS to HBase
•  HBase Java API to perform CRUD operations
–  Demo / Lab using Eclipse, HBase Java API & MapR Sandbox
•  How to work around transactions
®
© 2014 MapR Technologies 86
© MapR Technologies, confidential ®
HBase
Java API fundamentals to
perform CRUD operations
®
© 2014 MapR Technologies 87
Shoppingcart Application Requirements
•  Need to create Tables: Shoppingcart & Inventory
•  Perform CRUD operations on these tables
–  Create, Read, Update, and Delete items from these tables
®
© 2014 MapR Technologies 88
Inventory & Shoppingcart Tables
Perform checkout operation for Mike
Inventory Table Shoppingcart Table
quantity
Pens 10
Notepads 21
Erasers 10
Pencils 40
pens notepads erasers
Mike 1 2 3
John 3 4 5
Mary 1 2 5
Adam 5 4 0
CF “items"CF “stock "
®
© 2014 MapR Technologies 89
Java API Fundamentals
•  CRUD operations
–  Get, Put, Delete, Scan, checkAndPut, checkAndDelete, Increment
–  KeyValue, Result, Scan – ResultScanner,
–  Batch Operations
®
© 2014 MapR Technologies 90
CRUD Operations Follow A Pattern (mostly)
•  common pattern
–  Instantiate object for an operation: Put put = new Put(key)
–  Add attributes to specify what to insert: put.add(…)
–  invoke operation with HTable: myTable.put(put)
// Insert value1 into rowKey in columnFamily:columnName1
Put put = new Put(rowKey);
put.add(columnFamily, columnName1, value1);
myTable.put(put);
®
© 2014 MapR Technologies 91
erasers notepads pens
Mike 3 2 1
CF “items"
Shoppingcart Table
Shopping Cart Table
Key CF:COL ts value
Mike items:erasers 1391813876369 3
Mike items:notepads 1391813876369 2
Mike items:pens 1391813876369 1
Physical Storage
®
© 2014 MapR Technologies 92
Put Operation
adding multiple column values to a row
byte [] tableName = Bytes.toBytes("/path/Shopping");
byte [] itemsCF = Bytes.toBytes(“items");
byte [] penCol = Bytes.toBytes (“pens”);
byte [] noteCol = Bytes.toBytes (“notes”);
byte [] eraserCol = Bytes.toBytes (“erasers”);
HTableInterface table = new HTable(hbaseConfig, tableName);
Put put = new Put(“Mike”);
put.add(itemsCF, penCol, Bytes.toBytes(l));
put.add(itemsCF, noteCol, Bytes.toBytes(2));
put.add(itemsCF, eraserCol, Bytes.toBytes(3));
table.put(put);
Key CF:COL ts value
Mike items:erasers 1391813876369 3
Mike items:notepads 1391813876369 2
Mike items:pens 1391813876369 1
®
© 2014 MapR Technologies 93
Get Example
byte [] tableName = Bytes.toBytes("/user/user01/shoppingcart");
byte [] itemsCF = Bytes.toBytes(“items");
byte [] penCol = Bytes.toBytes (“pens”);
HTableInterface table = new HTable(hbaseConfig, tableName);
Get get = new Get(“Mike”);
get.addColumn(itemsCF, penCol);
Result result = myTable.get(get);
byte[] val = result.getValue(itemsCF, penCol);
System.out.println("Value: " + Bytes.toLong(val)); //prints 1
Key CF:COL ts value
Mike items:erasers 1391813876369 3
Mike items:notepads 1391813876369 2
Mike items:pens 1391813876369 1
®
© 2014 MapR Technologies 94
Result Class
•  A Result instance wraps data from a row returned from a get or a
scan operation. Result wraps KeyValues
•  Result toString() looks like this :
keyvalues={Adam/items:erasers/1391813876369/Put/vlen=8/ts=0,
Adam/items:notepads/1391813876369/Put/vlen=8/ts=0,
Adam/items:pens/1391813876369/Put/vlen=8/ts=0}
•  The Result object provides methods to return values
byte[] b = result.getValue(columnFamilyName,columnName1);
Items:erasers Items:notepads Items:pens
Adam 0 4 5
http://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/client/Result.html
®
© 2014 MapR Technologies 95
KeyValue – The Fundamental HBase Type
•  A KeyValue instance is a cell instance
–  Contains Key (cell coordinates) and the Value (data)
•  Cell coordinates: Row key, Column family, Column qualifier,
Timestamp
•  KeyValue toString() looks like this :
Adam/items:erasers/1391813876369/Put/vlen=8/
Key =Cell Coordinates
Row key Column Family Column Qualifier Timestamp Value
Value
Adam items erasers 1391813876369 0
®
© 2014 MapR Technologies 96
Bytes class
http://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/util/Bytes.html
•  org.apache.hadoop.hbase.util.Bytes
•  Provides methods to convert Java types to and from byte[] arrays
•  Support for
–  String, boolean, short, int, long, double, and float
byte[] bytesTable = Bytes.toBytes("Shopping");
String table = Bytes.toString(bytesTable);
byte[] amountBytes = Bytes.toBytes(1000l);
long amount = Bytes.toLong(amount);
®
© 2014 MapR Technologies 97
Scan Operation – Example
byte[] startRow=Bytes.toBytes(“Adam”);
byte[] stopRow=Bytes.toBytes(“N”);
Scan s = new Scan(startRow, stopRow);
scan.addFamily(columnFamily);
ResultScanner rs = myTable.getScanner(s);
®
© 2014 MapR Technologies 98
ResultScanner - Example
Resultscanner provides iterator-like functionality
Scan scan = new Scan();
scan.addFamily(columnFamily);
ResultScanner scanner = myTable.getScanner(scan);
try {
for (Result res : scanner) {
System.out.println(res);
}
} catch (Exception e) {
System.out.println(e);
} finally {
scanner.close();
}
Calls scanner.next()
Always put in finally block
®
© 2014 MapR Technologies 99
Lab Exercise Program Structure
•  ShoppingCartApp– main class
•  InventoryDAO – A DAO for the Inventory CRUD
functionality
•  ShoppingCartDAO – A DAO for the Inventory
CRUD functionality
•  Inventory – A Java object that holds data for a
single Inventory row
•  ShoppingCart – A Java object that holds data for
a single Inventory row
•  MockHtable – in memory test hbase table, allows
to run code, debug without hbase running on a
cluster or vm.
®
© 2014 MapR Technologies 100
Lab Exercise
•  See lab PDF
–  Import the project “lab-exercises-shopping” into Eclipse
–  Setup creates Inventory and Shoppingcart Tables and inserts data
–  Use Get, Put, Scan, and Delete operations
®
© 2014 MapR Technologies 101
Lab: Import, build
•  Download the code
•  Import Maven project lab-exercises-shopping
into Eclipse
•  Build : Run As -> Maven Install
®
© 2014 MapR Technologies 102
Lab: run TestInventorySetup JUnit
•  Select Test Class, Then Run As -> JUnit Test
•  Uses MockHTable https://gist.github.com/agaoglu/613217
®
© 2014 MapR Technologies 103© 2014 MapR Technologies
®
Shoppingcart Checkout functionality
®
© 2014 MapR Technologies 104
Inventory & Shoppingcart Tables
How can we implement Checkout functionality?
quantity
Pens 10
Notepads 21
Erasers 10
Pencils 40
CF “stock"
Inventory Table
pens notepads erasers
Mike 0 0 0
John 3 4 5
Mary 1 2 5
Adam 5 4 0
CF “items"
Shoppingcart Table
®
© 2014 MapR Technologies 105
Inventory & Shoppingcart Tables
after checkAndPut for Mike
quantity Mike …
Pens 10 9 1
Notepads 21 19 2
Erasers 10 7 3
Pencils 50
CF “stock"
Inventory Table
pens notepads erasers
Mike 1 2 3
John 3 4 5
Mary 1 2 5
Adam 5 4 0
CF “items"
Shoppingcart Table
®
© 2014 MapR Technologies 106
checkAndPut Method
quantity
pens 10
CF “stock"
Inventory Table before
quantity Mike
pens 9 1
CF “stock"
Inventory Table after
boolean checkAndPut(byte[] row, byte[] family,
byte[] qualifier, byte[] value, Put put)
myTable.checkAndPut(“pens”, CF, COL, 10, put);
Check a Value Put a row update
®
© 2014 MapR Technologies 107
checkAndPut Method – Atomic Put with Check
•  Atomic operation guarded by a check
–  Check is executed first then if success put operation is
executed
–  Similar to java compareAndSet(expectedValue,
updateValue)
•  Relies on checking and modifying the same row
–  Atomicity guaranteed on single row
•  Method signature:
boolean checkAndPut(byte[] row, byte[] family,
byte[] qualifier,
byte[] value, Put put) throws IOException
•  Use for concurrent transactions, multiple clients
updating
®
© 2014 MapR Technologies 108
checkAndPut method
byte [] rowPens = Bytes.toBytes (“pens”);
byte [] CF = Bytes.toBytes(“stock");
byte [] COL = Bytes.toBytes (“quantity”);
byte [] oldquantity= Bytes.toBytes(10);
byte [] amount= Bytes.toBytes(1);
byte [] newquantity= Bytes.toBytes(9);
Put put = new Put(rowPens);
put.add(CF, COL, newquantity);
put.add(CF, Bytes.toBytes(“Mike"), amount);
boolean ret = myTable.checkAndPut(rowPens, CF,COL,
oldquantity, put);
System.out.println("Put succeeded: " + ret);
Best practice
®
© 2014 MapR Technologies 109
What are the problems with this implementation?
•  Conflicts: What happens when two checkout operations read the
same inventory value and try to change it?
•  One fails
•  Application developer has to take this into consideration
and in such a case, get the latest value and try
checkAndPut again …
•  Not efficient in a large scale system as this can happen
quite a lot …
•  Is there a simpler way to do this in HBase?
•  Yes, Use Counters – Increment
®
© 2014 MapR Technologies 110
Counters – Overview
•  Counters provide an atomic increment of a number
•  Common use cases
–  Clicks, Page hits, views
•  Two types of counters:
–  single column counter with HTable method
–  multiple columns counter with an Increment object.
key stats: clicks
com.example/home 1000
A counter is a long column value
®
© 2014 MapR Technologies 111
Single Column Counter
•  HTable incrementColumnValue method
–  Atomically increments a single column value
–  Provides atomic read and modify
–  Easy to use
public long incrementColumnValue(byte[] row,
byte[] family,
byte[] qualifier,
long amount)
cf: qualifier
row amount
®
© 2014 MapR Technologies 112
HTable incrementColumnValue
long returnVal =
myTable.incrementColumnValue(
rowA,
stats,
clicks, // counter column name
42L);
key stats: clicks
rowA 1000 1042
®
© 2014 MapR Technologies 113
Counters – Possible Values
Value Description
Greater than zero Increase the counter by the given value
Zero Retrieve the current value of the counter
Less than zero Decreases the counter by the given value
None Increase the counter by 1
§  Effect	
  based	
  on	
  provided	
  values:	
  
§  You	
  must	
  use	
  a	
  long	
  
§  Don’t	
  need	
  to	
  ini6alize	
  (zero	
  by	
  default)	
  
®
© 2014 MapR Technologies 114
Multiple Column Counter- Increment object
Increment increment1 = new Increment(rowKey);
increment1.addColumn(CF, qualifier1, amount1);
increment1.addColumn(CF, qualifier1, amount1);
Result result1 = table.increment(increment1);
Create Increment object with Row Key
Add columns to
Increment
Call table increment
®
© 2014 MapR Technologies 115
Inventory Table Increment
	
   [CF] stock 	
  
	
   quantity	
   Mike	
  
pens	
   13 12	
   1	
  
Increment increment1 = new Increment(pens);
increment1.addColumn(stock, quantity, ?? );
increment1.addColumn(stock, Mike, ? );
Result result1 = table.increment(increment1);
-1
1
®
© 2014 MapR Technologies 116
Summary
•  Why NoSQL and why Hbase
•  Try these hands-on-labs on
•  MapR Sandbox or
•  MapR Developer Release (M3)
•  Design considerations when migrating from RDBMS to Hbase
•  De-normalize data
•  Column Families & Columns
•  Versions
•  RowKey design
•  Transactions using checkAndPut and Increment
®
© 2014 MapR Technologies 117
®
© 2014 MapR Technologies 118
References
•  http://hbase.apache.org/
•  http://hbase.apache.org/0.94/apidocs/
•  http://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/client/
package-summary.html
•  http://hbase.apache.org/book/book.html
•  http://doc.mapr.com/display/MapR/MapR+Overview
•  http://doc.mapr.com/display/MapR/M7+-+Native+Storage+for+MapR+Tables
•  http://doc.mapr.com/display/MapR/MapR+Sandbox+for+Hadoop
•  http://doc.mapr.com/display/MapR/Migrating+Between+Apache+HBase
+Tables+and+MapR+Tables
®
© 2014 MapR Technologies 119
HBase Challenges
Reliability
•  Compactions disrupt operations
•  Very slow crash recovery
Business continuity
•  Common hardware/software issues cause downtime
•  Administration requires downtime
•  No point-in-time recovery
•  Complex backup process
Performance
•  Many bottlenecks result in low throughput
•  Limited data locality
•  Limited # of tables
Manageability
•  Compactions, splits and merges must be done manually (in reality)
•  Basic operations like backup or table rename are complex
®
© 2014 MapR Technologies 120
Fast NoSQL with Zero Administration
Benefit Features
High Performance Over 1 million ops/sec with 10 node cluster
Continuous Low Latency No I/O storms, no compactions
24x7 Applications
Instant recovery, online schema modification, snapshots,
mirroring
Zero Administration No processes to manage, automated splits, self-tuning
High Scalability 1 trillion tables, billions of rows, millions of columns
Low TCO Files and tables on one platform, more work with fewer nodes
Performance
Reliability
Easy
Administration
MapR-DB
®
© 2014 MapR Technologies 121
MapR-DB: Fast, Integrated NoSQL
Operational
Analytics
Resilience/Business
Continuity
Performance/Scale
Hadoop-enabled
architecture
•  Operational/architectural
simplicity, analytics on live
data
•  Hadoop-scale (petabytes)
Proven production
readiness
•  Enterprise-grade HA/DR
•  Consistent snapshots
•  Integrated security
Consistent performance at
any scale
•  High throughput (100M
data points per second
ingestion)
•  Consistent low latency, no
compaction delays, even
at 95th and 99th percentiles
(< 10ms in YCSB)
®
© 2014 MapR Technologies 122
Mapr-DB Performance: Consistent, Low Latency
--- M7 Read Latency --- Others Read Latency
®
© 2014 MapR Technologies 123
MapR Editions
MapR Community Edition MapR Enterprise Edition MapR Enterprise
Database Edition
MapR-DB: Fastest in-Hadoop
database (No high availability(except
Yarn), snapshot, mirroring)
Everything in Community Edition,
except MapR-DB plus…
Everything in Enterprise Edition
plus….
Easy system-monitoring &
management with MapR Control
System
99.999% high availability & self
healing
MapR-DB Fastest in-Hadoop
database with enterprise HA features
[e.g. Mirroring, snapshot]
Real-time data flows with direct
access NFS
Data protection with snapshots &
mirroring
Consistent low latency with no
compactions
World record performance Multi-tenancy & job placement control Zero database administration
Instant recovery, snapshots and
mirroring for tables
MapR Makes Its NoSQL Database Freely Available in Community Edition
®
© 2014 MapR Technologies 124
Licensing
Community
Edition
Enterprise
Edition
Enterprise
Database Edition
M3 M5 M7
Support ✖
Exceptions (competitive
price pressure)
✔

 ✔

MapR-DB ✔

(4.0.1+)

Fastest and most reliable in-
Hadoop database, and an
“M7 development option”…
✖ ✔

MapR-DB with all the
enterprise capabilities
HBase support ✖

Exceptions only
(competitive price pressure)
$$$ $$$
®
© 2014 MapR Technologies 125
Q&A
@mapr maprtech
kbotzum@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies

More Related Content

What's hot

Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Carol McDonald
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesDataWorks Summit/Hadoop Summit
 
Cmu-2011-09.pptx
Cmu-2011-09.pptxCmu-2011-09.pptx
Cmu-2011-09.pptxTed Dunning
 
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...The Hive
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drilltshiran
 
Design Patterns for working with Fast Data
Design Patterns for working with Fast DataDesign Patterns for working with Fast Data
Design Patterns for working with Fast DataMapR Technologies
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentationMapR Technologies
 
Understanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillUnderstanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillDataWorks Summit
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into ProductionMapR Technologies
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Emilio Coppa
 
Performance Hive+Tez 2
Performance Hive+Tez 2Performance Hive+Tez 2
Performance Hive+Tez 2t3rmin4t0r
 

What's hot (19)

Apache drill
Apache drillApache drill
Apache drill
 
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different Rules
 
Cmu-2011-09.pptx
Cmu-2011-09.pptxCmu-2011-09.pptx
Cmu-2011-09.pptx
 
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
Spark vstez
Spark vstezSpark vstez
Spark vstez
 
Yahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at ScaleYahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at Scale
 
Design Patterns for working with Fast Data
Design Patterns for working with Fast DataDesign Patterns for working with Fast Data
Design Patterns for working with Fast Data
 
Introduction to Apache Drill
Introduction to Apache DrillIntroduction to Apache Drill
Introduction to Apache Drill
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
 
Understanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillUnderstanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache Drill
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into Production
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)
 
Performance Hive+Tez 2
Performance Hive+Tez 2Performance Hive+Tez 2
Performance Hive+Tez 2
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 

Viewers also liked

Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)Ferran Galí Reniu
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015Patrick McFadin
 
Apache spark when things go wrong
Apache spark   when things go wrongApache spark   when things go wrong
Apache spark when things go wrongPawel Szulc
 
Writing your own RDD for fun and profit
Writing your own RDD for fun and profitWriting your own RDD for fun and profit
Writing your own RDD for fun and profitPawel Szulc
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesRussell Spitzer
 
Functional Programming & Event Sourcing - a pair made in heaven
Functional Programming & Event Sourcing - a pair made in heavenFunctional Programming & Event Sourcing - a pair made in heaven
Functional Programming & Event Sourcing - a pair made in heavenPawel Szulc
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceUwe Printz
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APICarol McDonald
 
Apache Spark 101 [in 50 min]
Apache Spark 101 [in 50 min]Apache Spark 101 [in 50 min]
Apache Spark 101 [in 50 min]Pawel Szulc
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Mac Moore
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBaseCarol McDonald
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerEvan Chan
 
MongoDB & Spark
MongoDB & SparkMongoDB & Spark
MongoDB & SparkMongoDB
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detectionhadooparchbook
 
Apache spark workshop
Apache spark workshopApache spark workshop
Apache spark workshopPawel Szulc
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detectionhadooparchbook
 
Dynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark ApplicationDynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark ApplicationDataWorks Summit
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks
 

Viewers also liked (20)

Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Apache spark when things go wrong
Apache spark   when things go wrongApache spark   when things go wrong
Apache spark when things go wrong
 
Writing your own RDD for fun and profit
Writing your own RDD for fun and profitWriting your own RDD for fun and profit
Writing your own RDD for fun and profit
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector Dataframes
 
Functional Programming & Event Sourcing - a pair made in heaven
Functional Programming & Event Sourcing - a pair made in heavenFunctional Programming & Event Sourcing - a pair made in heaven
Functional Programming & Event Sourcing - a pair made in heaven
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduce
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka API
 
Apache Spark 101 [in 50 min]
Apache Spark 101 [in 50 min]Apache Spark 101 [in 50 min]
Apache Spark 101 [in 50 min]
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
MongoDB & Spark
MongoDB & SparkMongoDB & Spark
MongoDB & Spark
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
 
Apache spark workshop
Apache spark workshopApache spark workshop
Apache spark workshop
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
 
Dynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark ApplicationDynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark Application
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
 

Similar to Getting started with HBase

NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill Carol McDonald
 
Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Jeremy Walsh
 
Westpac Bank Tech Talk 1: Dive into Apache Kafka
Westpac Bank Tech Talk 1: Dive into Apache KafkaWestpac Bank Tech Talk 1: Dive into Apache Kafka
Westpac Bank Tech Talk 1: Dive into Apache Kafkaconfluent
 
Dchug m7-30 apr2013
Dchug m7-30 apr2013Dchug m7-30 apr2013
Dchug m7-30 apr2013jdfiori
 
Introduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityIntroduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityMapR Technologies
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Scaling web applications with cassandra presentation
Scaling web applications with cassandra presentationScaling web applications with cassandra presentation
Scaling web applications with cassandra presentationMurat Çakal
 
Cics TS 5.1 user experience
Cics TS 5.1 user experienceCics TS 5.1 user experience
Cics TS 5.1 user experienceLarry Lawler
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High PerformanceInderaj (Raj) Bains
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 
Revisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS SchedulerRevisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS SchedulerYongseok Oh
 
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth WiesmanWebinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth WiesmanVerverica
 
Renegotiating the boundary between database latency and consistency
Renegotiating the boundary between database latency  and consistencyRenegotiating the boundary between database latency  and consistency
Renegotiating the boundary between database latency and consistencyScyllaDB
 
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DBStructured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DBCarol McDonald
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processingconfluent
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...confluent
 

Similar to Getting started with HBase (20)

NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
 
Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14
 
Westpac Bank Tech Talk 1: Dive into Apache Kafka
Westpac Bank Tech Talk 1: Dive into Apache KafkaWestpac Bank Tech Talk 1: Dive into Apache Kafka
Westpac Bank Tech Talk 1: Dive into Apache Kafka
 
Dchug m7-30 apr2013
Dchug m7-30 apr2013Dchug m7-30 apr2013
Dchug m7-30 apr2013
 
Introduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityIntroduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and Security
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Scaling web applications with cassandra presentation
Scaling web applications with cassandra presentationScaling web applications with cassandra presentation
Scaling web applications with cassandra presentation
 
Cics TS 5.1 user experience
Cics TS 5.1 user experienceCics TS 5.1 user experience
Cics TS 5.1 user experience
 
2020 icldla-updated
2020 icldla-updated2020 icldla-updated
2020 icldla-updated
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Making KVS 10x Scalable
Making KVS 10x ScalableMaking KVS 10x Scalable
Making KVS 10x Scalable
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Revisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS SchedulerRevisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS Scheduler
 
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth WiesmanWebinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
 
NoSQL & HBase overview
NoSQL & HBase overviewNoSQL & HBase overview
NoSQL & HBase overview
 
Renegotiating the boundary between database latency and consistency
Renegotiating the boundary between database latency  and consistencyRenegotiating the boundary between database latency  and consistency
Renegotiating the boundary between database latency and consistency
 
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DBStructured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processing
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
 

More from Carol McDonald

Introduction to machine learning with GPUs
Introduction to machine learning with GPUsIntroduction to machine learning with GPUs
Introduction to machine learning with GPUsCarol McDonald
 
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...Carol McDonald
 
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DBAnalyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DBCarol McDonald
 
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Analysis of Popular Uber Locations using Apache APIs:  Spark Machine Learning...Analysis of Popular Uber Locations using Apache APIs:  Spark Machine Learning...
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...Carol McDonald
 
Predicting Flight Delays with Spark Machine Learning
Predicting Flight Delays with Spark Machine LearningPredicting Flight Delays with Spark Machine Learning
Predicting Flight Delays with Spark Machine LearningCarol McDonald
 
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...Carol McDonald
 
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...Carol McDonald
 
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...Carol McDonald
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareCarol McDonald
 
Demystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep LearningDemystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep LearningCarol McDonald
 
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Carol McDonald
 
Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures Carol McDonald
 
Spark machine learning predicting customer churn
Spark machine learning predicting customer churnSpark machine learning predicting customer churn
Spark machine learning predicting customer churnCarol McDonald
 
Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Carol McDonald
 
Applying Machine Learning to Live Patient Data
Applying Machine Learning to  Live Patient DataApplying Machine Learning to  Live Patient Data
Applying Machine Learning to Live Patient DataCarol McDonald
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesCarol McDonald
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataCarol McDonald
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine LearningCarol McDonald
 
Machine Learning Recommendations with Spark
Machine Learning Recommendations with SparkMachine Learning Recommendations with Spark
Machine Learning Recommendations with SparkCarol McDonald
 

More from Carol McDonald (20)

Introduction to machine learning with GPUs
Introduction to machine learning with GPUsIntroduction to machine learning with GPUs
Introduction to machine learning with GPUs
 
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
 
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DBAnalyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
 
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Analysis of Popular Uber Locations using Apache APIs:  Spark Machine Learning...Analysis of Popular Uber Locations using Apache APIs:  Spark Machine Learning...
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
 
Predicting Flight Delays with Spark Machine Learning
Predicting Flight Delays with Spark Machine LearningPredicting Flight Delays with Spark Machine Learning
Predicting Flight Delays with Spark Machine Learning
 
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
 
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
 
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health Care
 
Demystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep LearningDemystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep Learning
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
 
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
 
Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures
 
Spark machine learning predicting customer churn
Spark machine learning predicting customer churnSpark machine learning predicting customer churn
Spark machine learning predicting customer churn
 
Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1
 
Applying Machine Learning to Live Patient Data
Applying Machine Learning to  Live Patient DataApplying Machine Learning to  Live Patient Data
Applying Machine Learning to Live Patient Data
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision Trees
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming Data
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine Learning
 
Machine Learning Recommendations with Spark
Machine Learning Recommendations with SparkMachine Learning Recommendations with Spark
Machine Learning Recommendations with Spark
 

Recently uploaded

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

Getting started with HBase

  • 1. ® © 2014 MapR Technologies 1 ® © 2014 MapR Technologies Getting Started with HBase Application development Keys Botzum kbotzum@mapr.com Sridhar Reddy sreddy@mapr.com Carol McDonald cmcdonald@mapr.com August 2015 http://answers.mapr.com/
  • 2. ® © 2014 MapR Technologies 2 Download slides to follow along •  http://goo.gl/cNZ8RH
  • 3. ® © 2014 MapR Technologies 3 Objectives of this session •  What is HBase? –  Why do we need NoSQL / HBase? –  Overview of HBase & HBase data model –  HBase Architecture and data flow •  How to get started –  Demo/Lab using HBase Shell using MapR Sandbox •  Design considerations when migrating from RDBMS to HBase •  Developing Applications using HBase Java API –  Demo/Lab to perform CRUD operations using put, get, scan, delete –  How to work around transactions
  • 4. ® © 2014 MapR Technologies 4 Why do we need NoSQL / HBase? Relational database model Relational Data is typed and structured before stored: –  Entities map to tables, normalized –  Structured Query Language •  Joins tables to bring back data
  • 5. ® © 2014 MapR Technologies 5 Why do we need NoSQL / HBase? Relational Model •  Pros –  Standard persistence model •  standard language for data manipulation –  Transactions handle concurrency , consistency –  efficient and robust structure for storing data
  • 6. ® © 2014 MapR Technologies 6 What changed to bring on NoSQL? Big data •  Cons of the Relational Model: –  Does not scale horizontally: •  Sharding is difficult to manage •  Distributed join, transactions do not scale across shards Horizonal scale : partition or shard tables across cluster Distributed Joins, Transactions are Expensive bottleneck
  • 7. ® © 2014 MapR Technologies 7 Key Value Store l  Couchbase l  Riak l  Citrusleaf l  Redis l  BerkeleyDB l  Membrain l  ... Document l  MongoDB l  CouchDB l  RavenDB l  Couchbase l  ... Graph l  OrientDB l  DEX l  Neo4j l  GraphBase l  ...Wide Column l  HBase l  MapR-DB l  Hypertable l  Cassandra l  ... NoSQL Landscape
  • 8. ® © 2014 MapR Technologies 8© 2014 MapR Technologies ® Hbase designed for Distribution, Scale, Speed For Tutorial only: Send the PDF earlier to all attendees to help setup the laptop
  • 9. ® © 2014 MapR Technologies 9 HBase is a Distributed Database Data is automatically distributed across the cluster. •  Table is indexed by row key •  Key range is used for horizontal partitioning •  Table splits happen automatically as the data grows Key Range xxxx xxxx CF1 colA colB colC val val val CF2 colA colB colC val val val Key Range xxxx xxxx CF1 colA colB colC val val val CF2 colA colB colC val val val Key Range xxxx xxxx CF1 colA colB colC val val val CF2 colA colB colC val val val Put, Get by Key
  • 10. ® © 2014 MapR Technologies 10 HBase is a ColumnFamily oriented Database Data is accessed and stored together by RowKey Similar Data is grouped & stored in Column Families –  share common properties: •  Number of versions •  Time to Live (TTL) •  Compression [lz4, lzf, Zlib] •  In memory option … CF1 colA colB colC Val val val CF2 colA colB colC val val val RowKey axxx gxxx Customer id Customer Address data Customer order data
  • 11. ® © 2014 MapR Technologies 11 HBase designed for Distribution •  Distributed data stored and accessed together: –  Key range is used for horizontal partitioning •  Pros –  scalable handles data volume and velocity –  Fast Writes and Reads by Key •  Cons –  No joins natively (tools like Drill help) –  Need to know how data will be queried in advance to do good schema design to achieve best performance Key Range axxx kxxx Key Range axxx kxxx Key Range axxx kxxx
  • 12. ® © 2014 MapR Technologies 12 HBase Data Model Row Keys: identify the rows in an HBase table Columns are grouped into column families Row Key CF1 CF2 … colA colB colC colA colB colC colD R1 axxx val val val val … gxxx val val val val R2 hxxx val val val val val val val … jxxx val R3 kxxx val val val val … rxxx val val val val val val … sxxx val val
  • 13. ® © 2014 MapR Technologies 13 HBase Data Storage - Cells •  Data is stored in Key value format •  Value for each cell is specified by complete coordinates: •  (Row key, ColumnFamily, Column Qualifier, timestamp ) => Value –  RowKey:CF:Col:Version:Value –  smithj:data:city:1391813876369:nashville Cell Coordinates= Key Row key Column Family Column Qualifier Timestamp Value Smithj data city 1391813876369 nashville Column Key Value
  • 14. ® © 2014 MapR Technologies 14 Logical Data Model vs Physical Data Storage •  Data is stored in Key Value format •  Key Value is stored for each Cell •  Column families data are stored in separate files RowK ey CF1 CF2 colA colB colA colC ra 1 2 rxxxx rxxx Logical Model Row Key CF1:Col version value ra cf1:cola 1 1 row Key CF2:Col version value ra cf2:cola 1 2 Physical Storage Key Value Key Value Physical Storage
  • 15. ® © 2014 MapR Technologies 15 Sparse Data with Cell Versions CF1:colA CF1:colB CF1:colC Row1 Row10 Row11 Row2 @time1: value1 @time5: value2 @time7: value3 @time2: value1 @time3: value1 @time4: value1 @time2: value1 @time4: value1 @time6: value2
  • 16. ® © 2014 MapR Technologies 16 Versioned Data •  Version –  each put, delete adds new cell, new version –  A long •  by default the current time in milliseconds if no version specified –  Last 3 versions are stored by default •  Configurable by column family –  You can delete specific cell versions –  When a cell exceeds the maximum number of versions, the extra records are removed Key CF1:Col version value ra cf1:cola 3 3 ra cf1:cola 2 2 ra cf1:cola 1 1
  • 17. ® © 2014 MapR Technologies 17 Table Physical View Physically data is stored per Column family as a sorted map •  Ordered by row key, column qualifier in ascending order •  Ordered by timestamp in descending order Row key Column qualifier Cell value Timestamp (long) Row1 CF1:colA value3 time7 Row1 CF1:colA value2 time5 Row1 CF1:colA value1 time1 Row10 CF1:colA value1 time4 Row 10 CF1:colB value1 time4 Sorted by Row key and Column Sorted in descending order
  • 18. ® © 2014 MapR Technologies 18 Logical Data Model vs Physical Data Storage Key CF1:Col version value ra cf1:ca 1 1 rb cf1:cb 2 4 rb cf1:cb 1 3 rc cf1:ca 1 5 Row Key CF1 CF2 ca cb ca cd ra 1 2 rb 3,4 rc 5 6,7 8 Key CF2:Col version value ra cf2:ca 1 2 rc cf2:ca 2 7 rc cf2:ca 1 6 rc cf2:cd 1 8 Physical Storage Logical Model Column families are stored separately Row keys, Qualifiers are sorted lexicographically Key Value Key Value
  • 19. ® © 2014 MapR Technologies 19 HBase Table is a Sorted map of maps SortedMap<Key, Value> Table Map of Rows Map of CF Map of columns Map of cells Key CF1:Col version value ra cf1:ca v1 1 rb cf1:cb v2 4 rb cf1:cb v1 3 rc cf1:ca v1 5 Key CF2:Col version value ra cf2:ca v1 2 rc cf2:ca v2 7 rc cf2:ca v1 6 rc cf2:cd v1 8 SortedMap<RowKey, SortedMap< ColumnFamily, SortedMap< ColumnName, SortedMap < version, Value> >>>
  • 20. ® © 2014 MapR Technologies 20 HBase Table SortedMap<Key, Value> <ra,<cf1, <ca, <v1, 1>> <cf2, <ca, <v1, 2>>> <rb,<cf1, <cb, <v2, 4> <v1, 3>>> <rc,<cf1, <ca, <v1, 5>> <cf2, <ca, <v2, 7>> <ca, <v1, 6>> <cd, <v1, 8>>> Key CF1:Col version value ra cf1:ca v1 1 rb cf1:cb v2 4 rb cf1:cb v1 3 rc cf1:ca v1 5 Key CF2:Col version value ra cf2:ca v1 2 rc cf2:ca v2 7 rc cf2:ca v1 6 rc cf2:cd v1 8
  • 21. ® © 2014 MapR Technologies 21 Basic Table Operations •  Create Table, define Column Families before data is imported –  but not the rows keys or number/names of columns •  Low level API, technically more demanding •  Basic data access operations (CRUD): put Inserts data into rows (both create and update) get Accesses data from one row scan Accesses data from a range of rows delete Delete a row or a range of rows or columns
  • 22. ® © 2014 MapR Technologies 22© 2014 MapR Technologies ® HBase Architecture Data flow for Writes, Reads Designed to Scale
  • 23. ® © 2014 MapR Technologies 23 What is a Region? •  Tables are partitioned into key ranges (regions) •  Region servers serve data for reads and writes –  For the range of keys it is responsible for Region Server Client Region Region HMaster zookeepe rzookeeperzookeeper Region Server Region Region Get, Put Key colB colC xxx val val xxx val val Key colB colC xxx val val xxx val val Key colB colC xxx val val xxx val val Key colB colC xxx val val xxx val val
  • 24. ® © 2014 MapR Technologies 24 Region Server Components •  WAL: write ahead log on disk (commit log), Used for recovery •  BlockCache: Read Cache, Least Recently Used evicted •  MemStore: Write Cache, sorted keyValue updates. •  Hfile=sorted KeyValues on disk Region Server HDFS Data Node BlockCache memstore HFile memstore HFile Region Region HFileHFile memstore memstore WAL
  • 25. ® © 2014 MapR Technologies 25 HBase Write Steps Put each incoming record written to WAL for durability: •  log on disk •  updates appended sequentially HDFS Data Node memstore memstore Region Server Region WAL
  • 26. ® © 2014 MapR Technologies 26 HBase Write Steps Put Next updates are written to the Memstore: •  write cache •  in-memory •  sorted list of KeyValue updates HDFS Data Node memstore memstore Region Server Region WAL Ack Updates quickly sorted in memory are available to queries after put returns
  • 27. ® © 2014 MapR Technologies 27 HBase Memstore •  in-memory •  sorted list of Key → Value •  One per column family •  Updates quickly sorted in memory memstore memstore Region Key CF1:Col version value ra cf1:ca v1 1 rb cf1:cb v2 4 rb cf1:cb v1 3 rc cf1:ca v1 5 Key CF2:Col version value ra cf2:ca v1 2 rc cf2:ca v2 7 rc cf2:ca v1 6 rc cf2:cd v1 8 Key Value Key Value
  • 28. ® © 2014 MapR Technologies 28 HBase Region Flush When 1 Memstore is full: •  all memstores in region flushed to new Hfiles on disk •  Hfile: sorted list of key → values On disk HDFS Data Node memstore memstore Region Server Region WAL HFile HFile FLUSH HFileHFile
  • 29. ® © 2014 MapR Technologies 29 HBase HFile •  On disk sorted list of key → values •  One per column family •  Flushed quickly to file •  Sequential write HDFS Data Node HFile HFile Sequential write Key CF1:Col version value ra cf1:ca v1 1 rb cf1:cb v2 4 rb cf1:cb v1 3 rc cf1:ca v1 5 Key CF2:Col version value ra cf2:ca v1 2 rc cf2:ca v2 7 rc cf2:ca v1 6 rc cf2:cd v1 8 Key Value Key Value
  • 30. ® © 2014 MapR Technologies 30 HBase HFile Structure •  Memstore flushes to an Hfile Key-value pairs are stored in increasing order Index points to row keys location B+-tree: leaf index , root index Key A Value … … … Key P Value ….. …. … Key T Value ….. …. … Key z Value ….. …. … Leaf Index Leaf Index Leaf Index Root Index Interm Index Bloom Trailer Bloom Bloom Bloom Leaf Index d Root Index Intermediate Index d l z Data block Key a Value . . . Key d Value Leaf Index l Leaf Index z Data block Key e Value . . . Key l Value Data block Key m Value . . . Key z Value
  • 31. ® © 2014 MapR Technologies 31 HBase Read Merge from Memory and Files •  MemStore creates multiple small store files over time when flushing. •  When a get/scan comes in, multiple files have to be examined HDFS Data Node BlockCache memstore Region Server Region WAL HFile HFile HFile scanner read Get or Scan searches for Row Cell KeyValues: 1.  Block Cache ((Memory) 2.  Memstore (Memory) 3.  Load HFiles from Disk into Block Cache based on indexes and bloomfilters 1 2 3
  • 32. ® © 2014 MapR Technologies 32 HDFS Data Node HBase Compaction •  minor compaction: •  merges files into fewer larger ones. •  Major compaction: •  merge all Hfiles into one per column family. •  remove cells marked for deletion Region Server Region Region memstore memstore WAL minor compaction HFile HFile HFile HFile HFile HFile HFile HFile HFile HFile HFile HFile HFile updates HFile major compaction Flush to disk
  • 33. ® © 2014 MapR Technologies 33 HBase Background: Log-Structured Merge Trees •  Traditional Databases use B+ trees: –  expensive to update •  HBase: Log Structured Merge Trees –  Sequential writes •  Writes go to memory And WAL •  Sorted memstore flushes to disk –  Sequential Reads •  From memory, index, sorted disk Index Log Index Memory Disk Write Read predictable disk seeks
  • 34. ® © 2014 MapR Technologies 34 Data Model for Fast Writes, Reads •  Predictable disk lay out •  Minimize disk seek •  Get, Put by row key: fast access •  Scan by row key range: stored sorted, efficient sequential access for key range Region1 Key Range ra rx Region Region Region Server Key CF1:Col version value ra cf1:ca v1 1 rb cf1:cb v2 4 rb cf1:cb v1 3 rc cf1:ca v1 5 Get key Scan start key, Stop key Minimize disk seek
  • 35. ® © 2014 MapR Technologies 35 Region = contiguous keys •  Regions fundamental partitioning/sharding object. •  By default, on table creation 1 region is created that holds the entire key range. •  When region becomes too large, splits into two child regions. •  Typical region size is a few GB, sometimes even 10G or 20G Region CF1 colA colB colC val val val CF2 colA colB colC val val val Key Range axxx gxxx Region
  • 36. ® © 2014 MapR Technologies 36 Region Split •  The RegionServer splits a region •  daughter regions –  each with ½ of the regions keys. –  opened in parallel on same server •  reports the split to the Master Region 1 Region 2 Region Server 1 Key colB colC val val val Key colB colC val val val Region Server 1 Region 1 Key Range axxx kxxx Key Range Lxxx zxxx Key colB colC val val val
  • 37. ® © 2014 MapR Technologies 37© 2014 MapR Technologies ® HBase Use Cases
  • 38. ® © 2014 MapR Technologies 38 3 Main Use Case Categories •  Capturing Incremental data --Time Series Data –  Hi Volume, Velocity Writes •  Information Exchange, Messaging –  Hi Volume, Velocity Write/Read •  Content Serving, Web Application Backend –  Hi Volume, Velocity Reads
  • 39. ® © 2014 MapR Technologies 39 3 Main Use Case Categories •  Time Series Data, Stuff with a Time Stamp –  Sensor, System Metrics, Events, log files –  Stock Ticker, User Activity –  Hi Volume, Velocity Writes HBase Put App Server App Serverread Put Put Put Event time stamped data sensor OpenTSDB Data for real-time monitoring.
  • 40. ® © 2014 MapR Technologies 40 HBase Messages read Put App Server read App Server read App Server Put Put Put App Server 3 Main Use Case Categories •  Information Exchange –  email, Chat, Inbox: Facebook –  Hi Volume, Velocity Write/Read https://www.facebook.com/UsingHbase
  • 41. ® © 2014 MapR Technologies 41 3 Main Use Case Categories •  Content Serving, Web Application Backend –  Online Catalog: Gap, World Library Catalog. –  Search Index: ebay –  Online Pre-Computed View: Groupon, Pinterest –  Hi Volume, Velocity Reads Hbase Processed data read App Server read App Server read App Server Bulk Import Pre-Computed Materialized View
  • 42. ® © 2014 MapR Technologies 42 Agenda •  Why do we need NoSQL / HBase? •  Overview of HBase & HBase data model •  HBase Architecture and data flow •  Demo/Lab using HBase Shell –  Create tables and CRUD operations using MapR Sandbox •  Design considerations when migrating from RDBMS to HBase •  HBase Java API to perform CRUD operations –  Demo / Lab using Eclipse, HBase Java API & MapR Sandbox •  How to work around transactions
  • 43. ® © 2014 MapR Technologies 43 Hands-On-Labs Info •  Install a one-node MapR Sandbox on your laptop •  Install and configure Eclipse to develop HBase applications using Java API •  MapR Client is optional http://goo.gl/cNZ8RH
  • 44. ® © 2014 MapR Technologies 44 What is MapR Sandbox and how to use it •  MapR Sandbox is a fully functional single-node Hadoop cluster running on a virtual machine Host computer (your laptop) Vmware or VirtualBox MapR Sandbox MapR Client cmd window Eclipse cmd You can directly login and use terminal window to run Hadoop commands Browser: MCS, Hue
  • 45. ® © 2014 MapR Technologies 45 Software components to install •  Here are all the software components you need to install to get MapR Sandbox working on Windows 1. JDK 7 2. MVMware player (or) Oracle VirtualBox (prerequisite) 3. MapR Sandbox 4. MapR Client (optional) 5. Eclipse (for Java developers only) •  HBaseTutorialInstallGuide.pdf will cover how to download, install and configure all these components on your laptop
  • 46. ® © 2014 MapR Technologies 46 Cluster for those that have not installed MapR Sandbox •  CLDB Nodes: https://ec2-54-176-71-123.us-west-1.compute.amazonaws.com:8443 https://ec2-54-176-33-167.us-west-1.compute.amazonaws.com:8443 •  Login nodes: Rows 1 & 2: 54.176.71.123 ip-10-197-54-142 Rows 3 & 4: 54.176.33.167 ip-10-199-46-212 Rows 5 & 6: 54.193.226.196 ip-10-198-76-134 Rows 7 & 8: 54.177.2.5 ip-10-197-8-213 Last Rows: 54.193.140.172 ip-10-198-79-15 Username/passwd: user02 … user50 mapr
  • 47. ® © 2014 MapR Technologies 47 Lab Exercise See Hbase_Tutorial_Lab.pdf for all labs Start MapR Sandbox and log into the cluster [user: mapr, passwd: mapr] Use the HBase shell >Hbase shell hbase> help hbase> create ’/user/mapr/mytable’, {NAME =>’cf1’} hbase> put ’/user/mapr/mytable’, ’row1’, ’cf1:col1’, ‘datacf1c1v1’ hbase> get ’/user/mapr/mytable’, ’row1’ hbase> scan ’/user/mapr/mytable’ hbase> describe ’/user/mapr/mytable’
  • 48. ® © 2014 MapR Technologies 48 ® © 2014 MapR Technologies Schema Design Guidelines •  HBase tables ≠ Relational tables! •  HBase Design for Access Paterns
  • 49. ® © 2014 MapR Technologies 49 Use Case Example: Record Stock Trade Information in a Table •  Trade data: Trade •  timestamp •  stock symbol •  price per share •  volume of trade •  Example –  1381396363000 (epoch timestamp with millisecond granularity) –  AMZN –  $304.66 –  1333 shares
  • 50. ® © 2014 MapR Technologies 50 Intelligent keys •  Only the row keys are indexed (elastic search & Solr can help) •  Compose the key with attributes used for searching –  Composite key : 2 or more identifying attributes –  Like multi-column index design in RDB Cell Coordinates (Key) Granularity Row key Column Family Column Name Timestamp Value Restrict disk I/O Restrict network traffic
  • 51. ® © 2014 MapR Technologies 51 Composite Keys Use composite rowkey: attributes used for searching •  Include multiple elements in the rowkey –  Use a separator or fixed length •  Example rowkey format: –  Ex: GOOG_20131012 •  Get operations require complete row key. •  Scans can use partial keys. –  Ex: “GOOG” or "GOOG_2014" SYMBOL + DATE (YYYYMMDD)
  • 52. ® © 2014 MapR Technologies 52 Consider Access Patterns for Application •  By date? By hour? By companyId? –  Rowkey design •  What if the Date/Timestamp is leftmost ? How will data be retrieved? Key 1391813876369_AMZN 1391813876370_AMZN 1391813876371_GOOG
  • 53. ® © 2014 MapR Technologies 53 Hot-Spotting and Region Splits •  If rowkeys are written in sequential order then writes go to only one server –  Split when full 1900 1950 … 1999 Region Server 1 Key colB col C 1900 val val 1999 Region 1Key Range 1900 1999 Sequential key, like a timestamp File Server 1
  • 54. ® © 2014 MapR Technologies 54 Hot-Spotting and Region Splits •  Regions split as the table grows. –  RegionServer Creates two new regions, each with half of the original regions keys. •  Sequential writes will go to new region 2040 2050 2000 Region Server 1 File Server 1Key colB col C 1900 val val 1999 Region 1Key Range 1900 1950 Key colB col C 1900 val val 1999 Region 2Key Range 1950 2050
  • 55. ® © 2014 MapR Technologies 55 Hot-Spotting and Region Splits 3040 3000 Sequential writes will go to new region Region Server 1 File Server 1Key colB col C 1900 val val 1999 Region 1Key Range 1900 1950 Key colB col C 1900 val val 1999 Region 2Key Range 1950 3050
  • 56. ® © 2014 MapR Technologies 56 Hot-Spotting and Region Splits 3045 Regions split as the table grows. Sequential writes will go to new region Region Server 1 File Server 1Key colB col C 1900 val val 1999 Region 1Key Range 1900 1950 Key colB col C 1900 val val 1999 Region 2Key Range 1950 2050 Key colB col C 1900 val val 1999 Region 3Key Range 2051 3050 3041 3050
  • 57. ® © 2014 MapR Technologies 57 Random keys Key Range a23148 3d1a5f e0e9b4 Key Range Key Range Key Range Key Range Key Range MD5 Hash rowkey Random writes will go to different regions If table was pre-split or big enough to have split d = MessageDigest.getInstance("MD5"); byte[] prefix = d.digest(Bytes.toBytes(s));
  • 58. ® © 2014 MapR Technologies 58 Sequential vs. Random keys Random is better for writing , but sequential is better for scanning row keys Writes Sequential Reads Sequential Keys Performance Salted Keys Promoted Field Keys Random Keys
  • 59. ® © 2014 MapR Technologies 59 Prefix, Promote a field key Key Range Key Range Key Range Key Range Key Range Key Range amzn_1999 amzn_2003 amzn_2005 cisc_1998 cisc_2002 cisc_2010 goog_1990 goog_2020 goog_2030
  • 60. ® © 2014 MapR Technologies 60 Prefix with a Hashed field key Key Range a23148_2003 1d1a5f_1999 e0e9b4_2000 Key Range Key Range Key Range Key Range Key Range MD5 Hash prefix rowkey prefix the rowkey with a (shortened) hash: byte[] hash = d.digest(Bytes.toBytes(fieldkey)); Bytes.putBytes(rowkey, 0, hash, 0, length); g0e8b4_2004 b33148_2006 3d1a5f_2007
  • 61. ® © 2014 MapR Technologies 61 Consider Access Patterns for Application •  Which trade data needs fastest access (or most frequent)? –  Rowkey ordering •  What if you want to retrieve the stocks by symbol and date? •  Scan by row key prefix Increasing time: PREFIX_TIMESTAMP •  What if you usually want to retrieve the most recent? How will data be retrieved? Key AMZN_1391813876369 AMZN_1391813876370 GOOG_1391813876371 SYMBOL + timestamp
  • 62. ® © 2014 MapR Technologies 62 Last In First Out Access: Use Reverse-Timestamp •  Row keys are sorted in increasing order •  For fast access to most-recent writes: –  design composite rowkey with reverse-timestamp that decreases over time. –  Scan by row key prefix Decreasing: [MAXTIME–TIMESTAMP] •  Ex: Long.MAX_VALUE-date.getTime() SYMBOL + Reverse timestamp Key AMZN_98618600666 AMZN_98618600777 GOOG_98618608888
  • 63. ® © 2014 MapR Technologies 63 Consider Access Patterns for Application •  What are the needs for atomicity of transactions? –  Column design –  More Values in a single row •  Works well to get or update multiple values How will data be retrieved?
  • 64. ® © 2014 MapR Technologies 64 Rowkey design influences shape of Tables: Tall or Flat Tall Narrow Flat Wide
  • 65. ® © 2014 MapR Technologies 65 Tall Table for Stock Trades Rowkey format: Ex: AMZN_98618600888 SYMBOL + Reverse timestamp rowkey CF: CF1 CF1:price CF1:vol … … … AMZN_98618600666 12.34 2000 AMZN_98618600777 12.41 50 AMZN_98618600888 12.37 10000 … … … CSCO_98618600777 23.01 1000 …
  • 66. ® © 2014 MapR Technologies 66 Consider Access Patterns for Application •  Are Price and Volume data typically accessed together, or are they unrelated? –  Column family structure •  Column Families –  group data that will be read and stored together –  Can set attributes: •  # Min/Max versions, compression, in-memory, Time-To-Live •  Columns –  Column names are dynamic, not pre-defined –  every row does not need to have same columns How will data be retrieved?
  • 67. ® © 2014 MapR Technologies 67 Wide Table for Stock Trades rowkey CF price CF vol p:10 p:1000 … p:2000 v:10 v:1000 … v:2000 AMZN_986186006 12.37 13 12.34 10000 2000 … CSCO_986186070 23.01 1000 Rowkey format: Ex: AMZN_20131020 SYMBOL + Reverse timestamp rounded to the hour •  Separate price & volume data into column families •  Segregate time into buckets: –  Time rounded to the hour in the rowkey –  Time in column name represents seconds since the timestamp in the key –  One row stores a bucket of measurements for the hour
  • 68. ® © 2014 MapR Technologies 68 Wide Table for Stock Trades rowkey CF1 CF stats CF1:10 CF1:1000 … Day Hi … Day LO AMZN_20131020 {p:12.37,v: 1000} {p: 12.37,v: 1000} … CSCO_20131020 Rowkey format: Ex: AMZN_20131020 SYMBOL + Reverse timestamp rounded to the day •  Segregate time into buckets: –  Time rounded to the day in the rowkey –  Time in column name represents seconds since the timestamp in the key –  One row stores a bucket of measurements for the day
  • 69. ® © 2014 MapR Technologies 69 Lesson:  Schemas  can  be  very   flexible  and  can  even   change  on  the  fly   Column names can be dynamic, every row does not need to have same columns
  • 70. ® © 2014 MapR Technologies 70 Consider Access Patterns for Application •  Do all trades need to be saved forever? –  TTL Time to Live , CF can be set to expire cells •  How many Versions? –  Max Versions –  You can have many versions of data in a cell, default is 3 How will data be retrieved?
  • 71. ® © 2014 MapR Technologies 71 Wide Table for Stock Trades rowkey CF price CF vol CF stats price:00 … price:23 vol:00 … vol:23 Day Hi Day Lo AMZN_20131020 12.37 12.34 10000 2000 … CSCO_20130817 23.01 1000 Rowkey format: Ex: AMZN_20131020 SYMBOL + date YYYYMMDD •  Separate price & volume data into column families •  Segregate time into buckets: –  Date in the rowkey –  Hour in the column name –  Set Column Family to store Max Versions, timestamp in the version
  • 72. ® © 2014 MapR Technologies 72 Flat-Wide Vs. Tall-Narrow Tables •  Tall-Narrow provides better query granularity –  Finer grained Row Key –  Works well with scan •  Flat-Wide supports built-in row atomicity –  More Values in a single row •  Works well to update multiple values (row atomicity) •  Works well to get multiple associated values
  • 73. ® © 2014 MapR Technologies 73 Lesson:  Have  to  know  the   queries  to  design  in   performance  
  • 74. ® © 2014 MapR Technologies 74 Comparing Relational schema to HBase •  HBase is lower-level than relational tables –  Design is different •  Relational design –  Data centric, focus on entities and relations –  Query, joins •  New views of data from different tables easily created –  Does not scale across cluster •  HBase is designed for clustering: –  Distributed data is stored and accessed together –  Query centric, focus on how the data is read –  Design for the questions Key Range axxx kxxx Key Range xxx xxx Key Range xxx zxxx
  • 75. ® © 2014 MapR Technologies 75 Design of HBase Table •  De-normalize data –  Databases intended for online transaction processing (OLTP) are typically normalized –  Databases intended for online analytical processing (OLAP) are primarily "read mostly" databases and denormalized so that you do not require to have joins – permits fast lookups
  • 76. ® © 2014 MapR Technologies 76 l  Relational Databases are typically normalized l  Goal Normalization: –  eliminate redundant data –  Put repeating information in its own table Normalized database : –  Causes joins •  data has to be retrieved from more tables. •  queries can take more time Normalization
  • 77. ® © 2014 MapR Technologies 77 HBase Nested Entity •  A one-to-many relationship can be modeled as a single row –  Embedded, Nested Entity •  Order one-to-many with Line Items –  Row key: parent id •  OrderId –  column name : child id stored •  line Item id OrderId Data:date item:id1 item:id2 item:id3 123 20131010 $10 $20 $9.45
  • 78. ® © 2014 MapR Technologies 78 De-Normalization Order and Items in same table De-normalization: –  store data about an entity and related entities in the same table. •  Reads are faster across a cluster –  retrieve data about entity and related entities in one read OrderId Data:date item:id1 item:id2 item:id3 123 20131010 $10 $20 $9.45
  • 79. ® © 2014 MapR Technologies 79 Many to Many Relationship RDBMS user id (primary key) name alias email book id (primary key) title description user_book_rating id (primary key) userId (foreign key) bookId (foreign key) rating 1 ∞ 1∞ Online book store •  Querys •  Get name for user x •  Get title for book x •  Get books and corresponding ratings for userId x •  Get all userids and corresponding ratings for book y
  • 80. ® © 2014 MapR Technologies 80 Many to Many Relationship HBase User table Column family rating, bookid is column name Key data:fname … rating:bookid1 rating:bookid2 userid1 5 4 Key data:title … rating:userid1 rating:userid2 bookid1 5 4 Book table Column family rating, userid is column name •  Queries •  Get books and corresponding ratings for userId x •  Get all userids and corresponding ratings for book y
  • 81. ® © 2014 MapR Technologies 81 Generic data: Event, Attributes, Values •  Event Id, Event name-value pairs , schema-less patientXYZ-ts1, Temperature , "102" patientXYZ-ts1, Coughing, "True" patientXYY-ts2, Heart Rate, "98" •  This is the advantage of HBase –  Define columns on the fly, •  put attribute name in column qualifier Key event:heartrate event:coughing event:temperature Patientxyz-ts1 98 true 102 Event type name=qualifierEvent id=row key Event measurement=value
  • 82. ® © 2014 MapR Technologies 82 Self Join Relationship HBase •  Example Twitter •  User_x follows User_y •  User_y followed by User_z •  Querys –  Get all users who Carol follows –  Get all users following Carol Key data:timestamp Carol:follows:SteveJobs Carol:followedby:BillyBob
  • 83. ® © 2014 MapR Technologies 83 Hierarchical Data •  tree like structure •  Use a flat-wide –  Parents, children in columns usa FLTN Nashville Miami Key P:USA P:TN p:FL c:TN C:FL C:Nashvl C:Miami USA state state TN country city FL country city Nashville state miami state
  • 84. ® © 2014 MapR Technologies 84 Inheritance mapping •  Online Store Example Product table –  type is in row key for searching –  Columns are not the same for different types Key price title details model Bok+id1 10 Hbase blah Dvd+Id2 15 stones blah Kin+Id3 100 blah fire Product book dvd kindle
  • 85. ® © 2014 MapR Technologies 85 Agenda •  Why do we need NoSQL / HBase? •  Overview of HBase & HBase data model •  HBase Architecture and data flow •  Demo/Lab using HBase Shell –  Create tables and CRUD operations using MapR Sandbox •  Design considerations when migrating from RDBMS to HBase •  HBase Java API to perform CRUD operations –  Demo / Lab using Eclipse, HBase Java API & MapR Sandbox •  How to work around transactions
  • 86. ® © 2014 MapR Technologies 86 © MapR Technologies, confidential ® HBase Java API fundamentals to perform CRUD operations
  • 87. ® © 2014 MapR Technologies 87 Shoppingcart Application Requirements •  Need to create Tables: Shoppingcart & Inventory •  Perform CRUD operations on these tables –  Create, Read, Update, and Delete items from these tables
  • 88. ® © 2014 MapR Technologies 88 Inventory & Shoppingcart Tables Perform checkout operation for Mike Inventory Table Shoppingcart Table quantity Pens 10 Notepads 21 Erasers 10 Pencils 40 pens notepads erasers Mike 1 2 3 John 3 4 5 Mary 1 2 5 Adam 5 4 0 CF “items"CF “stock "
  • 89. ® © 2014 MapR Technologies 89 Java API Fundamentals •  CRUD operations –  Get, Put, Delete, Scan, checkAndPut, checkAndDelete, Increment –  KeyValue, Result, Scan – ResultScanner, –  Batch Operations
  • 90. ® © 2014 MapR Technologies 90 CRUD Operations Follow A Pattern (mostly) •  common pattern –  Instantiate object for an operation: Put put = new Put(key) –  Add attributes to specify what to insert: put.add(…) –  invoke operation with HTable: myTable.put(put) // Insert value1 into rowKey in columnFamily:columnName1 Put put = new Put(rowKey); put.add(columnFamily, columnName1, value1); myTable.put(put);
  • 91. ® © 2014 MapR Technologies 91 erasers notepads pens Mike 3 2 1 CF “items" Shoppingcart Table Shopping Cart Table Key CF:COL ts value Mike items:erasers 1391813876369 3 Mike items:notepads 1391813876369 2 Mike items:pens 1391813876369 1 Physical Storage
  • 92. ® © 2014 MapR Technologies 92 Put Operation adding multiple column values to a row byte [] tableName = Bytes.toBytes("/path/Shopping"); byte [] itemsCF = Bytes.toBytes(“items"); byte [] penCol = Bytes.toBytes (“pens”); byte [] noteCol = Bytes.toBytes (“notes”); byte [] eraserCol = Bytes.toBytes (“erasers”); HTableInterface table = new HTable(hbaseConfig, tableName); Put put = new Put(“Mike”); put.add(itemsCF, penCol, Bytes.toBytes(l)); put.add(itemsCF, noteCol, Bytes.toBytes(2)); put.add(itemsCF, eraserCol, Bytes.toBytes(3)); table.put(put); Key CF:COL ts value Mike items:erasers 1391813876369 3 Mike items:notepads 1391813876369 2 Mike items:pens 1391813876369 1
  • 93. ® © 2014 MapR Technologies 93 Get Example byte [] tableName = Bytes.toBytes("/user/user01/shoppingcart"); byte [] itemsCF = Bytes.toBytes(“items"); byte [] penCol = Bytes.toBytes (“pens”); HTableInterface table = new HTable(hbaseConfig, tableName); Get get = new Get(“Mike”); get.addColumn(itemsCF, penCol); Result result = myTable.get(get); byte[] val = result.getValue(itemsCF, penCol); System.out.println("Value: " + Bytes.toLong(val)); //prints 1 Key CF:COL ts value Mike items:erasers 1391813876369 3 Mike items:notepads 1391813876369 2 Mike items:pens 1391813876369 1
  • 94. ® © 2014 MapR Technologies 94 Result Class •  A Result instance wraps data from a row returned from a get or a scan operation. Result wraps KeyValues •  Result toString() looks like this : keyvalues={Adam/items:erasers/1391813876369/Put/vlen=8/ts=0, Adam/items:notepads/1391813876369/Put/vlen=8/ts=0, Adam/items:pens/1391813876369/Put/vlen=8/ts=0} •  The Result object provides methods to return values byte[] b = result.getValue(columnFamilyName,columnName1); Items:erasers Items:notepads Items:pens Adam 0 4 5 http://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/client/Result.html
  • 95. ® © 2014 MapR Technologies 95 KeyValue – The Fundamental HBase Type •  A KeyValue instance is a cell instance –  Contains Key (cell coordinates) and the Value (data) •  Cell coordinates: Row key, Column family, Column qualifier, Timestamp •  KeyValue toString() looks like this : Adam/items:erasers/1391813876369/Put/vlen=8/ Key =Cell Coordinates Row key Column Family Column Qualifier Timestamp Value Value Adam items erasers 1391813876369 0
  • 96. ® © 2014 MapR Technologies 96 Bytes class http://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/util/Bytes.html •  org.apache.hadoop.hbase.util.Bytes •  Provides methods to convert Java types to and from byte[] arrays •  Support for –  String, boolean, short, int, long, double, and float byte[] bytesTable = Bytes.toBytes("Shopping"); String table = Bytes.toString(bytesTable); byte[] amountBytes = Bytes.toBytes(1000l); long amount = Bytes.toLong(amount);
  • 97. ® © 2014 MapR Technologies 97 Scan Operation – Example byte[] startRow=Bytes.toBytes(“Adam”); byte[] stopRow=Bytes.toBytes(“N”); Scan s = new Scan(startRow, stopRow); scan.addFamily(columnFamily); ResultScanner rs = myTable.getScanner(s);
  • 98. ® © 2014 MapR Technologies 98 ResultScanner - Example Resultscanner provides iterator-like functionality Scan scan = new Scan(); scan.addFamily(columnFamily); ResultScanner scanner = myTable.getScanner(scan); try { for (Result res : scanner) { System.out.println(res); } } catch (Exception e) { System.out.println(e); } finally { scanner.close(); } Calls scanner.next() Always put in finally block
  • 99. ® © 2014 MapR Technologies 99 Lab Exercise Program Structure •  ShoppingCartApp– main class •  InventoryDAO – A DAO for the Inventory CRUD functionality •  ShoppingCartDAO – A DAO for the Inventory CRUD functionality •  Inventory – A Java object that holds data for a single Inventory row •  ShoppingCart – A Java object that holds data for a single Inventory row •  MockHtable – in memory test hbase table, allows to run code, debug without hbase running on a cluster or vm.
  • 100. ® © 2014 MapR Technologies 100 Lab Exercise •  See lab PDF –  Import the project “lab-exercises-shopping” into Eclipse –  Setup creates Inventory and Shoppingcart Tables and inserts data –  Use Get, Put, Scan, and Delete operations
  • 101. ® © 2014 MapR Technologies 101 Lab: Import, build •  Download the code •  Import Maven project lab-exercises-shopping into Eclipse •  Build : Run As -> Maven Install
  • 102. ® © 2014 MapR Technologies 102 Lab: run TestInventorySetup JUnit •  Select Test Class, Then Run As -> JUnit Test •  Uses MockHTable https://gist.github.com/agaoglu/613217
  • 103. ® © 2014 MapR Technologies 103© 2014 MapR Technologies ® Shoppingcart Checkout functionality
  • 104. ® © 2014 MapR Technologies 104 Inventory & Shoppingcart Tables How can we implement Checkout functionality? quantity Pens 10 Notepads 21 Erasers 10 Pencils 40 CF “stock" Inventory Table pens notepads erasers Mike 0 0 0 John 3 4 5 Mary 1 2 5 Adam 5 4 0 CF “items" Shoppingcart Table
  • 105. ® © 2014 MapR Technologies 105 Inventory & Shoppingcart Tables after checkAndPut for Mike quantity Mike … Pens 10 9 1 Notepads 21 19 2 Erasers 10 7 3 Pencils 50 CF “stock" Inventory Table pens notepads erasers Mike 1 2 3 John 3 4 5 Mary 1 2 5 Adam 5 4 0 CF “items" Shoppingcart Table
  • 106. ® © 2014 MapR Technologies 106 checkAndPut Method quantity pens 10 CF “stock" Inventory Table before quantity Mike pens 9 1 CF “stock" Inventory Table after boolean checkAndPut(byte[] row, byte[] family, byte[] qualifier, byte[] value, Put put) myTable.checkAndPut(“pens”, CF, COL, 10, put); Check a Value Put a row update
  • 107. ® © 2014 MapR Technologies 107 checkAndPut Method – Atomic Put with Check •  Atomic operation guarded by a check –  Check is executed first then if success put operation is executed –  Similar to java compareAndSet(expectedValue, updateValue) •  Relies on checking and modifying the same row –  Atomicity guaranteed on single row •  Method signature: boolean checkAndPut(byte[] row, byte[] family, byte[] qualifier, byte[] value, Put put) throws IOException •  Use for concurrent transactions, multiple clients updating
  • 108. ® © 2014 MapR Technologies 108 checkAndPut method byte [] rowPens = Bytes.toBytes (“pens”); byte [] CF = Bytes.toBytes(“stock"); byte [] COL = Bytes.toBytes (“quantity”); byte [] oldquantity= Bytes.toBytes(10); byte [] amount= Bytes.toBytes(1); byte [] newquantity= Bytes.toBytes(9); Put put = new Put(rowPens); put.add(CF, COL, newquantity); put.add(CF, Bytes.toBytes(“Mike"), amount); boolean ret = myTable.checkAndPut(rowPens, CF,COL, oldquantity, put); System.out.println("Put succeeded: " + ret); Best practice
  • 109. ® © 2014 MapR Technologies 109 What are the problems with this implementation? •  Conflicts: What happens when two checkout operations read the same inventory value and try to change it? •  One fails •  Application developer has to take this into consideration and in such a case, get the latest value and try checkAndPut again … •  Not efficient in a large scale system as this can happen quite a lot … •  Is there a simpler way to do this in HBase? •  Yes, Use Counters – Increment
  • 110. ® © 2014 MapR Technologies 110 Counters – Overview •  Counters provide an atomic increment of a number •  Common use cases –  Clicks, Page hits, views •  Two types of counters: –  single column counter with HTable method –  multiple columns counter with an Increment object. key stats: clicks com.example/home 1000 A counter is a long column value
  • 111. ® © 2014 MapR Technologies 111 Single Column Counter •  HTable incrementColumnValue method –  Atomically increments a single column value –  Provides atomic read and modify –  Easy to use public long incrementColumnValue(byte[] row, byte[] family, byte[] qualifier, long amount) cf: qualifier row amount
  • 112. ® © 2014 MapR Technologies 112 HTable incrementColumnValue long returnVal = myTable.incrementColumnValue( rowA, stats, clicks, // counter column name 42L); key stats: clicks rowA 1000 1042
  • 113. ® © 2014 MapR Technologies 113 Counters – Possible Values Value Description Greater than zero Increase the counter by the given value Zero Retrieve the current value of the counter Less than zero Decreases the counter by the given value None Increase the counter by 1 §  Effect  based  on  provided  values:   §  You  must  use  a  long   §  Don’t  need  to  ini6alize  (zero  by  default)  
  • 114. ® © 2014 MapR Technologies 114 Multiple Column Counter- Increment object Increment increment1 = new Increment(rowKey); increment1.addColumn(CF, qualifier1, amount1); increment1.addColumn(CF, qualifier1, amount1); Result result1 = table.increment(increment1); Create Increment object with Row Key Add columns to Increment Call table increment
  • 115. ® © 2014 MapR Technologies 115 Inventory Table Increment   [CF] stock     quantity   Mike   pens   13 12   1   Increment increment1 = new Increment(pens); increment1.addColumn(stock, quantity, ?? ); increment1.addColumn(stock, Mike, ? ); Result result1 = table.increment(increment1); -1 1
  • 116. ® © 2014 MapR Technologies 116 Summary •  Why NoSQL and why Hbase •  Try these hands-on-labs on •  MapR Sandbox or •  MapR Developer Release (M3) •  Design considerations when migrating from RDBMS to Hbase •  De-normalize data •  Column Families & Columns •  Versions •  RowKey design •  Transactions using checkAndPut and Increment
  • 117. ® © 2014 MapR Technologies 117
  • 118. ® © 2014 MapR Technologies 118 References •  http://hbase.apache.org/ •  http://hbase.apache.org/0.94/apidocs/ •  http://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/client/ package-summary.html •  http://hbase.apache.org/book/book.html •  http://doc.mapr.com/display/MapR/MapR+Overview •  http://doc.mapr.com/display/MapR/M7+-+Native+Storage+for+MapR+Tables •  http://doc.mapr.com/display/MapR/MapR+Sandbox+for+Hadoop •  http://doc.mapr.com/display/MapR/Migrating+Between+Apache+HBase +Tables+and+MapR+Tables
  • 119. ® © 2014 MapR Technologies 119 HBase Challenges Reliability •  Compactions disrupt operations •  Very slow crash recovery Business continuity •  Common hardware/software issues cause downtime •  Administration requires downtime •  No point-in-time recovery •  Complex backup process Performance •  Many bottlenecks result in low throughput •  Limited data locality •  Limited # of tables Manageability •  Compactions, splits and merges must be done manually (in reality) •  Basic operations like backup or table rename are complex
  • 120. ® © 2014 MapR Technologies 120 Fast NoSQL with Zero Administration Benefit Features High Performance Over 1 million ops/sec with 10 node cluster Continuous Low Latency No I/O storms, no compactions 24x7 Applications Instant recovery, online schema modification, snapshots, mirroring Zero Administration No processes to manage, automated splits, self-tuning High Scalability 1 trillion tables, billions of rows, millions of columns Low TCO Files and tables on one platform, more work with fewer nodes Performance Reliability Easy Administration MapR-DB
  • 121. ® © 2014 MapR Technologies 121 MapR-DB: Fast, Integrated NoSQL Operational Analytics Resilience/Business Continuity Performance/Scale Hadoop-enabled architecture •  Operational/architectural simplicity, analytics on live data •  Hadoop-scale (petabytes) Proven production readiness •  Enterprise-grade HA/DR •  Consistent snapshots •  Integrated security Consistent performance at any scale •  High throughput (100M data points per second ingestion) •  Consistent low latency, no compaction delays, even at 95th and 99th percentiles (< 10ms in YCSB)
  • 122. ® © 2014 MapR Technologies 122 Mapr-DB Performance: Consistent, Low Latency --- M7 Read Latency --- Others Read Latency
  • 123. ® © 2014 MapR Technologies 123 MapR Editions MapR Community Edition MapR Enterprise Edition MapR Enterprise Database Edition MapR-DB: Fastest in-Hadoop database (No high availability(except Yarn), snapshot, mirroring) Everything in Community Edition, except MapR-DB plus… Everything in Enterprise Edition plus…. Easy system-monitoring & management with MapR Control System 99.999% high availability & self healing MapR-DB Fastest in-Hadoop database with enterprise HA features [e.g. Mirroring, snapshot] Real-time data flows with direct access NFS Data protection with snapshots & mirroring Consistent low latency with no compactions World record performance Multi-tenancy & job placement control Zero database administration Instant recovery, snapshots and mirroring for tables MapR Makes Its NoSQL Database Freely Available in Community Edition
  • 124. ® © 2014 MapR Technologies 124 Licensing Community Edition Enterprise Edition Enterprise Database Edition M3 M5 M7 Support ✖ Exceptions (competitive price pressure) ✔ ✔ MapR-DB ✔ (4.0.1+) Fastest and most reliable in- Hadoop database, and an “M7 development option”… ✖ ✔ MapR-DB with all the enterprise capabilities HBase support ✖ Exceptions only (competitive price pressure) $$$ $$$
  • 125. ® © 2014 MapR Technologies 125 Q&A @mapr maprtech kbotzum@mapr.com Engage with us! MapR maprtech mapr-technologies