2. Agenda
• Course Credit
• One common web site story
• Why RDB not affordable ?
• Big Data
• Why use noSQL ?
• HBase Indroduction
• Hands-on
• noSQL architecture common practices
• Case study
1
3. 一個網站的故事 (1/3)
• RDBMS是Persistence tier一個理所當然的選擇
• 它可以幫我們處理transaction(ACID),確保完整性限制
(Integrity Constraints),標準的SQL語言,甚至還有Stored
Procedure可以用
• 第一次,你的使用者人數越來越多時…
• 使用AP Servers Cluster,它們共用一台DB Server
• 第二次,你的使用者人數越來越多時…
• DB Server分成Master-Slave架構
• 從Slave Servers讀取資料
• 寫入資料至Master Server 2
Hbase: The Definitive Guide - http://www.amazon.com/HBase-Definitive-Guide-Lars-
George/dp/1449396100/ref=sr_1_1?ie=UTF8&qid=1339060175&sr=8-1
6. Why RDB not affordable ? (1/6)
• Bottleneck of Relational-DB
• 90s V.S. recent years (Web 2.0)
• Memcachd + mySQL
• Mitigate read stress effectively, but not write stress
• mySQL Cluster solution
• Master/Slave
• Not affordable for highly-concurrency scenario
• Vertical Partitioning
• Vertical/Horizontal Partitioning (Database sharding)
• Complex
• Hard to scale-out and change requirements
• Low availability
5
• Some type of simple but big size data cause this condition
http://www.infoq.com/cn/news/2011/01/nosql-why
7. Why RDB not affordable ? (2/6) –
A general HA system architecture design
6
軟體專案的素質之四 ─ 整體設計之 架構設計案例 ─ http://takeshi-
experience.blogspot.tw/2012/04/blog-post.html
8. Why RDB not affordable ? (3/6) –
Master/Slave
7
14. 大資料時代下的新寵兒 ─
• Not only SQL
• 於2009年開始
• 有以下特性
• 不使用關聯式資料模型
• 天生分散式儲存
• 易於水平式擴充的
• 開放原始碼的
• 易於擴充的
• 簡單的API操作 (CRUD,通常沒有SQL支援)
• CAP (不同於ACID)
• Eventually Consistency、Availability、Partition-Tolerance
• 儲存巨量且異質的資料 13
http://nosql-database.org/
15. Why use noSQL ?
• Easy to scale-out
• Unlike RDB, no relationship therefore easy to scale-out
• High performance even in the big data
• Table-level cache (RDB) V.S. Record-level cache (noSQL)
• Elastic data model
• Schema V.S. Schema-less/Dynamic schema
• High availability
• Easy to add new machines (nodes) without any performance
impact
14
16. Comparison between RDB and noSQL
If given a really huge of big data…
Aspects RDB noSQL
Performance Getting lower Sustain as a small size of data
Scalability Mainly for scale up Mainly for scale out
Reliability ACID CAP
Availability Hard to maintain SLA Easy to maintain SLA
Security Robust Depends
Economics High-end machines Commodity machines
Data Model Relational, Fix-schema Depends but more likely
simple, Schema-less
Maturity Very mature Not mature, various products
Commercial Global company Small start-ups
support
OLAP/BI Mature Immature 15
Human resource Easy to find Hard to find
18. Apache Hbase介紹
• ASF的top-level專案
• 屬於noSQL DB中的Key-Value類型
• 源自於Google的
• Bigtable: A Distributed Storage System for Structured Data
• a distributed storage system for managing structured data that is
designed to scale to a very large size: petabytes of data across
thousands of commodity servers
• a sparse, distributed, persistent multi-dimensional sorted map
17
Hbase: The Definitive Guide - http://www.amazon.com/HBase-Definitive-Guide-Lars-
George/dp/1449396100/ref=sr_1_1?ie=UTF8&qid=1339060175&sr=8-1
20. Apache Hbase Concepts – Column-
Oriented (2/2)
• a sparse, distributed, persistent multi-dimensional sorted map
• which is indexed by row key, column key (column family +
qualifiers), and a timestamp
Column Families
19
22. Hands-on (1/3) –
Use your VM (Virtual Machine) to install tm-puppet
• Please refer to SPN Dev hbase training program again~
• Install git on your PC
• Install tm-puppet on your VM
21
23. Hands-on (2/3) –
Use HBase shell
• Basic operations
• help, list, scan
• Create
• A table ‘MY_FIRST_TABLE’
• Two column families ‘FAM_1’, ‘FAM_2’
• Ex.
• create 't1', {NAME => 'f1'}, {NAME => 'f2'}
• Create ‘t1’, ‘f1’, ‘f2’
• Put two records (column)
• Ex. put 't1', 'r1', 'c1', 'value'
• Update a record (column) (It is also a put)
• Delete a record (column) 22
• delete 't1', 'r1', 'c1'
24. Hands-on (3/3) –
Requirements
• Put your successful installed tm-puppet image file to git
• Use following commands
• Jps
• Ifconfig
• Cut the image
• Path : ${git_home}/hbase-training/001/hands-
on/${your_name}/hands-on-001.jpg
• Put your hbase shell records image file to git
• Use following commands
• Scan ‘MY_TEST_TABLE’
• Ifconfig
• Cut the image
• Path : ${git_home}/ hbase-training/001/hands-
on/${your_name}/hands-on-002.jpg 23
• Commit and push your git
25. noSQL architecture practices (1/8) –
Use noSQL as complement
• Use noSQL as a mirror (implemented by code)
• The RDB is still a major storage device, and noSQL as a mirror
24
NoSQL架構實踐(一)— 以NoSQL為輔 ─
http://www.infoq.com/cn/news/2011/02/nosql-architecture-practice
26. noSQL architecture practices (2/8) –
Use noSQL as complement
//PSEUDO CODE for noSQL as a mirror
//We want to store the data Object
bool status = false;
DB.startTransaction(); //start transaction
id = DB.Insert(data); //write data Object to RDB
if(id > 0){
status = NoSQL.Add(id, data); //write data Object to noSQL by id
}
if(id > 0 && status == true){
DB.commit(); //commit transaction
} else {
DB.rollback(); //failed, rollback transaction
}
25
27. noSQL architecture practices (3/8) –
Use noSQL as complement
• Use noSQL as a mirror (implemented by synchronization)
26
29. noSQL architecture practices (5/8) –
Use noSQL as complement
//PSEUDO CODE for RDB & noSQL combination
//we want to store the data Object
data.title = "title";
data.name = "name";
data.time = "2009-12-01 10:10:01";
data.from = "1";
bool status = false;
DB.startTransaction(); //start transaction
//write into RDB, data.from is a value needed by search criteria
id = DB.Insert("INSERT INTO table (from) VALUES(data.from)");
if(id > 0){
//write data Object to noSQL by id
status = NoSQL.Add(id, data);
}
if(id>0 && status==true){
DB.commit(); //commit transaction 28
}else{
DB.rollback(); //failed, rollback transaction
}
30. noSQL architecture practices (6/8) –
Use noSQL as complement
• What benefits we can get from the RDB & noSQL combination
practice
• Decrease the I/O of RDB, therefore save more storage space
• Increase the RDB table-level cache hitrate, only the key
values(PK, FK, search criteria related values) updated will
refresh the cache
• Increase the synchronization efficiency for RDB Master/Slave
architecture
• Increase the RDB backup/recover efficiency
• Increase the scalability/performance for whole system
29
31. noSQL architecture practices (7/8) –
Use noSQL as master
• Use only with noSQL
• Mainly for simple query requirements systems
• But there are noSQL products can fulfill the more complex
queries
• MonngoDB, Tokyo Cabinet, etc
30
NoSQL架構實踐(二)— 以NoSQL為主 ─
http://www.infoq.com/cn/news/2011/03/nosql-architecture-practice-2
32. noSQL architecture practices (8/8) –
Use noSQL as master
• Use noSQL as major data source
• APs only write data into noSQL
• Then synchronize the data from noSQL to other data stores
based on their application
31
33. Case Study (1/4) –
Facebook’s Real-time Message System
• Use HBase to store 135+ billion messages a month
• Beat off other few competitors such as Cassandra, mySQL-
Sharding, etc
• Data Patterns
• A short set of temporal data that tends to be volatile
• An ever-growing set of data that rarely gets accessed
32
Facebook's New Real-time Messaging System: HBase to Store 135+ Billion
Messages a Month - http://highscalability.com/blog/2010/11/16/facebooks-new-
real-time-messaging-system-hbase-to-store-135.html
34. Case Study (2/4) –
Facebook’s Real-time Message System
• Some key aspects of their system:
• HBase
• Has a simpler consistency model than Cassandra.
• Very good scalability and performance for their data patterns.
• Most feature rich for their requirements: auto load balancing and
failover, compression support, multiple shards per server, etc.
• HDFS, the filesystem used by HBase, supports replication, end-to-end
checksums, and automatic rebalancing.
• Facebook's operational teams have a lot of experience using HDFS
because Facebook is a big user of Hadoop and Hadoop uses HDFS as
its distributed file system.
33
35. Case Study (3/4) –
Facebook’s Real-time Message System
• Haystack is used to store attachments.
• A custom application server was written from scratch in order
to service the massive inflows of messages from many
different sources.
• A user discovery service was written on top of ZooKeeper.
• Infrastructure services are accessed for: email account
verification, friend relationships, privacy decisions, and
delivery decisions
• Keeping with their small teams doing amazing things approach,
20 new infrastructures services are being released by 15
engineers in one year.
• Facebook is not going to standardize on a single database 34
platform, they will use separate platforms for separate tasks.
36. Case Study (4/4) –
Alibaba China Site architecture
35
http://www.infoq.com/cn/presentations/hl-alibaba-cn-architecture-design-practice
38. Data Access pattern as the key
for noSQL
• Data Structure
• Structured
• Semi-structured
• Unstructured
• Size
• How many & how often writes/read (proportion)
• Data Writing
• Transaction
• Data Reading
• Random access
• Sequential access
• Relationship 37