Amandeep Khurana, a solutions architect at Cloudera, gave a presentation titled "Building Applications with HBase" at the Big Data TechCon conference in Boston in April 2013. The presentation introduced HBase, an open-source, distributed, big data store modeled after Google's Bigtable. It covered how to set up HBase locally and interact with it using the shell and Java API, and provided examples of building a Twitter clone application called TwitBase using the HBase data model.
Strategies for Landing an Oracle DBA Job as a Fresher
Building apps with HBase - Big Data TechCon Boston
1. Building
applica2ons
with
HBase Goes
Here
Headline
Amandeep
Khurana
|
Solu7ons
AHere
Speaker
Name
or
Subhead
Goes
rchitect
Big
Data
TechCon,
Boston,
April
2013
1
Friday, April 12, 13
2. About
me
• Solu2ons
Architect,
Cloudera
Inc
• Amazon
Web
Services
• Interested
in
large
scale
distributed
systems
• Co-‐author,
HBase
In
Ac2on
• TwiHer:
amansk
Nick Dimiduk
Amandeep Khurana
MANNING
2
Friday, April 12, 13
3. About
the
talk
• Setup
local
HBase
environment
• Simple
client
• Deeper
dive
on
API
• Data
model
3
Friday, April 12, 13
4. What
is
HBase?
• Scalable,
distributed
data
store
• Open
source
avatar
of
Google’s
Bigtable
• Sorted
map
of
maps
• Sparse
• Mul2
dimensional
• Tightly
integrated
with
Hadoop
• Not
a
RDBMS
4
Friday, April 12, 13
5. So,
what’s
the
big
deal?
• Give
up
system
complexity
and
sophis2cated
features
for
ease
of
scale
and
manageability
• Simple
system
!=
easy
to
build
applica2ons
on
top
• Goal:
predictable
read
and
write
performance
at
scale
• Effec2ve
API
usage
• Op2mal
schema
design
• Understand
internals
and
leverage
at
scale
5
Friday, April 12, 13
6. Use
cases
• Canonical
use
case:
storing
crawl
data
and
indices
for
search
Web Search
powered by Bigtable
Crawlers
Internets Crawlers Bigtable
Crawlers
Crawlers
1
2
MapReduce
Indexing the Internet
1 Crawlers constantly scour the Internet for new pages.
Those pages are stored as individual records in Bigtable. 4 3
2 A MapReduce job runs over the entire table, generating Search
search indexes for the Web Search application. Search
Index
Search Web Search
Index
Index
5 You
Searching the Internet
3 The user initiates a Web Search request.
4 The Web Search application queries the Search Indexes
and retries matching documents directly from Bigtable.
5 Search results are presented to the user.
6
Friday, April 12, 13
7. Use
cases
(con2nued)
• Capturing
incremental
trickle
data
• Content
serving
/
OLTP
• Blob
store
7
Friday, April 12, 13
8. Installing
HBase
• Get
it
here
-‐>
hHp://hbase.apache.org/
• Downloads
-‐>
Choose
a
mirror
-‐>
Choose
a
release
• Download
the
tarball
• Untar
it
and
run
it
8
Friday, April 12, 13
9. Basics
• Modes
of
opera2on
• Standalone
• Pseudo-‐distributed
• Fully
distributed
• Start
up
in
standalone
mode
• cd
hbase-‐0.94.1
• bin/start-‐hbase.sh
• Logs
are
in
./logs/
• Check
out
hHp://localhost:60010
in
your
browser
9
Friday, April 12, 13
11. Shell
• Basic
interac2on
tool
• WriHen
in
JRuby
• Uses
the
Java
client
library
• Good
place
to
start
poking
around
11
Friday, April 12, 13
12. Shell
(con2nued)
• bin/hbase
shell
• Create
table
• create
‘users’
,
‘info’
• List
tables
• list
• Describe
table
• describe
‘users’
12
Friday, April 12, 13
13. Shell
(con2nued)
• Put
a
row
• put
‘users’
,
‘u1’,
‘info:name’
,
‘Mr.
Rogers’
• Get
a
row
• get
‘users’
,
‘u1’
• Put
more
• put
‘users’
,
‘u1’
,
‘info:email’
,
‘mail@rogers.com’
• put
‘users’
,
‘u1’
,
‘info:password’
,
‘foobar’
• Get
a
row
• get
‘users’
,
‘u1’
• Scan
table
• scan
‘users’
13
Friday, April 12, 13
15. Java
API
...
can’t
really
build
an
applica7on
just
with
the
shell
15
Friday, April 12, 13
16. Important
terms
• Table
• Consists
of
rows
and
columns
• Row
• Has
a
bunch
of
columns.
• Iden2fied
by
a
rowkey
(primary
key)
• Column
Qualifier
• Dynamic
column
name
• Column
Family
• Column
groups
-‐
logical
and
physical
(Similar
access
paHern)
• Cell
• The
actual
element
that
contains
the
data
for
a
row-‐column
intersec2on
• Version
• Every
cell
has
mul2ple
versions.
16
Friday, April 12, 13
17. Introducing
TwitBase
• TwiHer
clone
built
on
HBase
• Designed
to
scale
• It’s
an
example,
not
a
produc2on
app
• Start
with
users
and
twits
for
now
• Users
• Have
profiles
• Post
twits
• Twits
• Have
text
posted
by
users
• Get
sample
code
at:
hHps://github.com/hbaseinac2on/twitbase
17
Friday, April 12, 13
18. Java
API
• Configura)on
-‐
holds
config
informa2on
about
the
cluster.
Roughly
equivalent
to
JDBC
connec2on
string
• HConnec)on
-‐
represents
connec2on
to
a
cluster
• HBaseAdmin
-‐
admin
tool
to
handle
DDL
opera2ons
• HTablePool
-‐
pool
of
connec2ons
to
cluster
• HTable
(HTableInterface)
-‐
handle
on
a
single
HBase
table.
Sends
commands
to
table
18
Friday, April 12, 13
19. Connec2ng
to
HBase
• Using default confs:
HTableInterface usersTable = new HTable("users");
• Passing custom conf:
Configuration myConf = HBaseConfiguration.create();
myConf.set("parameter_name", "parameter_value");
HTableInterface usersTable = new HTable(myConf, "users");
• Connection pooling
HTablePool pool = new HTablePool();
HTableInterface usersTable = pool.getTable("users");
// work with the table
usersTable.close();
19
Friday, April 12, 13
20. Inser2ng
data
• Put
call
on
HTable
Put p = new Put(Bytes.toBytes("TheRealMT"));
p.add(Bytes.toBytes("info"),
Bytes.toBytes("name"),
Bytes.toBytes("Mark Twain"));
p.add(Bytes.toBytes("info"),
Bytes.toBytes("email"),
Bytes.toBytes("samuel@clemens.org"));
p.add(Bytes.toBytes("info"),
Bytes.toBytes("password"),
Bytes.toBytes("Langhorne"));
usersTable.put(p);
20
Friday, April 12, 13
21. Data
coordinates
• Row
is
addressed
using
rowkey
• Cell
is
addressed
using
[rowkey
+
family
+
qualifier]
21
Friday, April 12, 13
22. Upda2ng
data
•Update
==
Write
Put p = new Put(Bytes.toBytes("TheRealMT"));
p.add(Bytes.toBytes("info"),
Bytes.toBytes("password"),
Bytes.toBytes("abc123"));
usersTable.put(p);
22
Friday, April 12, 13
23. Write
path
WAL
Write goes to the write-ahead log (WAL)
Memstore as well as the in memory write buffer called
the Memstore.
Client HFile Client's don't interact directly with the
underlying HFiles during writes.
HFile
Memstore
New HFile Memstore gets flushed to disk to
form a new HFile
when filled up with writes.
HFile
HFile
23
Friday, April 12, 13
24. Reading
data
• Get
the
en2re
row
Get g = new Get(Bytes.toBytes("TheRealMT"));
Result r = usersTable.get(g);
• Get
selected
columns
Get g = new Get(Bytes.toBytes("TheRealMT"));
g.addColumn(Bytes.toBytes("info"),
Bytes.toBytes("password"));
Result r = usersTable.get(g);
• Extract
the
value
out
of
the
Result
object
byte[] b = r.getValue(Bytes.toBytes("info"),
Bytes.toBytes("email"));
String email = Bytes.toString(b);
24
Friday, April 12, 13
25. Reading
data
(con2nued)
• Scan
Scan s = new Scan(startRow, stopRow);
ResultsScanner rs = twits.getScanner(s);
for(Result r : rs) {
// iterate over scanner results
}
• Single threaded iteration over defined set of rows. Could
be the entire table too.
25
Friday, April 12, 13
26. What
happens
when
you
read
data?
BlockCache
Memstore
Client HFile
HFile
26
Friday, April 12, 13
27. Dele2ng
data
• Another
simple
API
call
• Delete
en2re
row
Delete d = new Delete(Bytes.toBytes("TheRealMT"));
usersTable.delete(d);
• Delete
selected
columns
Delete d = new Delete(Bytes.toBytes("TheRealMT"));
d.deleteColumns(Bytes.toBytes("info"),
Bytes.toBytes("email"));
usersTable.delete(d);
27
Friday, April 12, 13
28. Delete
internals
• Tombstone
markers
• Dele2on
only
happens
during
major
compac2ons
• Compac2ons
• Minor
• Major
28
Friday, April 12, 13
29. Compac2ons
Memstore Memstore
HFile Two or more HFiles get combined
Compacted to form a single HFile during
HFile a compaction.
HFile
HFile HFile
29
Friday, April 12, 13
30. DAO
• Like
any
other
applica2on
• Makes
data
access
management
easy
• Cleaner
code
• Table
info
at
a
single
place
• Op2miza2ons
like
connec2on
pooling
etc
• Easier
management
of
database
configs
30
Friday, April 12, 13
31. Admin
HBaseAdmin Table
Opera2ons
• Administra2ve
API
to
the
• createTable()
cluster.
• deleteTable()
• Provides
an
interface
to
• addColumn()
manage
HBase
database
tables
metadata
and
• modifyColumn()
general
administra2ve
• deleteColumn()
func2ons.
• listTables()
Friday, April 12, 13
33. More
API
features
• Filters
• Compare
and
Swap
• Coprocessors
33
Friday, April 12, 13
34. Data
Model
It’s
not
a
rela7onal
database
system
34
Friday, April 12, 13
35. HBase
is
...
• Column
family
oriented
database
• Column
family
oriented
• Tables
consis2ng
of
rows
and
columns
• Persisted
Map
• Sparse
• Mul2
dimensional
• Sorted
• Indexed
by
rowkey,
column
and
2mestamp
• Key
Value
store
• [rowkey,
col
family,
col
qualifier,
2mestamp]
-‐>
cell
value
35
Friday, April 12, 13
36. HBase
is
not
...
• A
rela2onal
database
• No
SQL
query
language
• No
joins
• No
secondary
indexing
• No
transac2ons
36
Friday, April 12, 13
37. Tabular
representa2on
2 Column Family - Info
1 3
Rowkey name email password
GrandpaD Mark Twain samuel@clemens.org abc123
HMS_Surprise Patrick O'Brien aubrey@sea.com abc123
The table is lexicographically
sorted on the rowkeys
SirDoyle Fyodor Dostoyevsky fyodor@brothers.net abc123
TheRealMT Sir Arthur Conan Doyle art@TheQueensMen.co.uk Langhorne
abc123
4
Cells ts1=1329088321289 ts2=1329088818321
Each cell has multiple
The coordinates used to identify data in an HBase table are: versions,
(1) rowkey, (2) column family, (3) column qualifier, (4) version typically represented by
the timestamp
of when they were
inserted into the table
(ts2>ts1)
37
Friday, April 12, 13
38. Key-‐Value
store
Keys Values
[TheRealMT, info, password, 1329088818321] abc123
[TheRealMT, info, password, 1329088321289] Langhorne
A single KeyValue instance
38
Friday, April 12, 13
39. Key-‐Value
store
[TheRealMT, info, password, 1329088818321] abc123
1 Start with coordinates of full precision
{
1329088818321 : "abc123",
[TheRealMT, info, password]
1329088321289 : "Langhorne"
}
2 Drop version and you're left with a map of version to values
Keys {
"email" : {
1329088321289 : "samuel@clemens.org"
},
"name" : {
[TheRealMT, info] 1329088321289 : "Mark Twain"
},
Values
"password" : {
1329088818321 : "abc123",
1329088321289 : "Langhorne"
}
}
3 Omit qualifier and you have a map of qualifiers to the previous maps
{
"info" : {
"email" : {
1329088321289 : "samuel@clemens.org"
},
"name" : {
1329088321289 : "Mark Twain"
[TheRealMT]
},
"password" : {
1329088818321 : "abc123",
1329088321289 : "Langhorne"
}
}
}
4 Finally, drop the column family and you have a row, a map of maps
39
Friday, April 12, 13
41. HFiles
and
physical
data
model
• HFiles
are
• Immutable
• Sorted
on
rowkey
+
qualifier
+
2mestamp
• In
the
context
of
a
column
family
per
region
"TheRealMT" , "info" , "email" , 1329088321289, "samuel@clemens.org"
"TheRealMT" , "info" , "name" , 1329088321289 , "Mark Twain"
"TheRealMT" , "info" , "password" , 1329088818321 , "abc123",
"TheRealMT" , "info" , "password" , 1329088321289 , "Langhorne"
HFile for the info column family in the users table
41
Friday, April 12, 13
42. HBase
in
distributed
mode
...
what
really
goes
on
under
the
hood
42
Friday, April 12, 13
43. Distributed
mode
• Table
is
split
into
Regions
T1R1
00001 John 415-111-1234
00002 Paul 408-432-9922
00001 John 415-111-1234 00003 Ron 415-993-2124
00001 John 415-111-1234 00002 Paul 408-432-9922 00004 Bob 818-243-9988
00002 Paul 408-432-9922 00003 Ron 415-993-2124
00003 Ron 415-993-2124 00004 Bob 818-243-9988
00004 Bob 818-243-9988 T1R2
00005 Carly 206-221-9123 00005 Carly 206-221-9123 00005 Carly 206-221-9123
00006 Scott 818-231-2566 00006 Scott 818-231-2566 00006 Scott 818-231-2566
00007 Simon 425-112-9877 00007 Simon 425-112-9877 00007 Simon 425-112-9877
00008 Lucas 415-992-4432 00008 Lucas 415-992-4432 00008 Lucas 415-992-4432
00009 Steve 530-288-9832
00010 Kelly 916-992-1234 00009 Steve 530-288-9832
T1R3
00011 Betty 650-241-1192 00010 Kelly 916-992-1234
00012 Anne 206-294-1298 00011 Betty 650-241-1192 00009 Steve 530-288-9832
00012 Anne 206-294-1298 00010 Kelly 916-992-1234
00011 Betty 650-241-1192
00012 Anne 206-294-1298
Full table T1 Table T1 split into 3 regions - R1, R2 and R3
43
Friday, April 12, 13
44. Distributed
mode
(con2nued)
• Regions
are
served
by
RegionServers DataNode RegionServer
T1R1
00001
00002
00003
John
Paul
Ron
415-111-1234
408-432-9922
415-993-2124
00004 Bob 818-243-9988
• RegionServers
are
Host1
Java
processes,
T1R2
00005 Carly 206-221-9123
00006 Scott 818-231-2566
DataNode RegionServer 00007 Simon 425-112-9877
00008 Lucas 415-992-4432
typically
collocated
T1R3
00009
00010
00011
Steve
Kelly
Betty
530-288-9832
916-992-1234
650-241-1192
with
the
DataNodes
Host2 00012 Anne 206-294-1298
DataNode RegionServer
Host3
44
Friday, April 12, 13
45. Finding
your
region
• Two
special
tables:
-‐ROOT-‐
and
.META.
-ROOT- table, consisting of only 1 region
-ROOT-
hosted on region server RS1
RS1
.META. table, consisting of 3 regions,
M1 M2 M3
hosted on RS1, RS2 and RS3
RS2 RS3 RS1
User tables T1 and T2, consisting of
T1R1 T1R2 T2R1 T1R3 T2R2 T2R3 T2R4 4 and 3 regions respectively,
distributed amongst RS1, RS2 and RS3
RS3 RS1 RS2 RS3 RS2 RS1 RS1
45
Friday, April 12, 13
46. Regions
distributed
across
the
cluster
T1-R1
00001 John 415-111-1234
DataNode RegionServer 00002 Paul 408-432-9922
00003 Ron 415-993-2124
RegionServer 1 (RS1), hosting regions R1 of
00004 Bob 818-243-9988
the user table T1 and M2 of .META.
.META. - M2
T1:00009-T1:00012 R3 RS2
Host1
T1-R2
00005 Carly 206-221-9123
00006 Scott 818-231-2566
00007 Simon 425-112-9877
DataNode RegionServer 00008 Lucas 415-992-4432
T1-R3
RegionServer 2 (RS2), hosting regions R2,
00009 Steve 530-288-9832 R3 of the user table and M1 of .META.
00010 Kelly 916-992-1234
00011 Betty 650-241-1192
00012 Anne 206-294-1298
Host2
.META. - M1
T1:00001-T1:00004 R1 RS1
T1:00005-T1:00008 R2 RS2
DataNode RegionServer -ROOT-
M:T100001-M:T100009 M1 RS2
RegionServer 3 (RS3), hosting just -ROOT-
M:T100010-M:T100012 M2 RS1
Host3
46
Friday, April 12, 13
47. What
the
client
does
MyTable
.META.
TheRealMT
-ROOT-
ZooKeeper
47
Friday, April 12, 13
48. What
the
client
does
MyTable
.META.
TheRealMT
-ROOT-
ZooKeeper
47
Friday, April 12, 13
49. What
the
client
does
MyTable
.META.
TheRealMT
-ROOT-
ZooKeeper
Row per META region
47
Friday, April 12, 13
50. What
the
client
does
MyTable
.META.
TheRealMT
-ROOT-
ZooKeeper
Row per META region
Row per table region
47
Friday, April 12, 13
51. What
the
client
does
MyTable
.META.
TheRealMT
-ROOT-
ZooKeeper
Row per META region
Row per table region
47
Friday, April 12, 13