The document discusses HBase, an open-source, non-relational, distributed database modeled after Google's Bigtable. It outlines some limitations of relational databases that HBase addresses like scaling to large datasets and high write throughput. Key aspects of HBase covered include its column-oriented design, data model organized by row keys and column families, and architecture involving a master node, Zookeeper, and region servers. Common uses of HBase and how its schema differs from relational databases are also summarized.
3. Why are we here ?
Something about RDBMS
Limitations of RDBMS
Why Hbase or any NoSql solution
Overview of Hbase
Specific Use cases
Paradigm shift in Schema Design
Architecture of Hbase
Hbase Interface – Java API, Thrift
Conclusion 3
6. Data Set going into PetaBytes
RDBMS don't scale inherently
Scale up/Scale out ( Load Balancing + Replication)
Hard to shard / partition
Both read / write throughput not possible
Transactional / Analytical databases
Specialized Hardware …... is very expensive
Oracle clustering
6
8. Master
Writes
Reads
Slave nodes
MySQL master becomes a problem
All Slaves must have the same write capacity as master
Single point of failure, no easy failover
8
14. Hbase is
A Sql Database
No Joins, no query engine, no datatypes, no sql
No Schema
Denormalized data
Wide and sparsely populated data structure(key-
value)
No DBA needed
14
15. Bigness
Big data, big number of users, big number of computers
Massive write performance
Facebook needs 135 billion messages a month
Twitter stores 7 TB data per day
Fast key-value access
Write availability
No Single point of failure
15
16. Specific
Managing large streams of non-transactional data: Apache
logs, application logs, MySQL logs, etc.
Real-time inserts, updates, and queries.
Fraud detection by comparing transactions to known
patterns in real-time.
Analytics - Use MapReduce, Hive, or Pig to perform
analytical queries
16
17. Column-oriented database
Table are sorted by Row
Table schema only defines Column families
column family can have any number of columns
Each cell value has a timestamp
17
21. A BIG SORTED MAP
Row Key+ Column Key + timestamp => value
Column family
Student table
Row Key Column Key Timestamp Value
1 info:name 1273516197868 Gaurav
1 info:age 1273871824184 28
Sorted by 2 Versions
Row key and 1 info:age 1273871823022 34 of this row
column key
1 info:sex 1273746281432 Male
2 info:name 1273863723227 Harsh
3 Info:name 1273822456433 Raman
Column Qualifier/Name Timestamp is a long value
21
22. Example of a Student and Subject
Student Table Subject Table
PK id PK id
m n
name title
age introduction
sex teacher_id
Student-Subject Table
student_id
subject_id
type
22
23. RDBMS
Example of a Student and Subject
Student table
key name age sex
1 Gaurav 28 Male
Subject table
id title introduction teacher_id
1 Hbase Hbase is cool 10
Student-Subject table
student_id subject_id type
1 1 elective
23
24. Hbase
Student-Subject schema - Hbase
Student table
Row Key Column family Column Keys
student_id info name, age, sex
student_id subjects Subject Id's as qualifier(key)
Subject table
Row Key Column family Column Keys
subject_id info title, introduction, teacher_id
subject_id students Student id's as qualifier(key)
24
25. Hbase
Student-Subject schema - Hbase
Student table
key info subjects
1 info:name=Gaurav subjects:1=”elective”
info:age=28 subjects:2=”main”
info:sex=Male
Subject table
key info students
1 info:title=Hbase students:1
info:introduction=Hbase is cool students:2
info:teacher_id=10
25
27. Region: Contiguous set of lexicographically sorted
rows
hbase.hregion.max.filesize (default:256 Mb)
Region hosted by Region Servers
Each Table is partitioned into Regions
27
28. Regions and
row1
row200
row201
row500
new row
28
29. Regions and
row1
row200
row201
row350
row 351
row 501
29