001 hbase introduction

HBase Introduction
Scott Miao
2012/06/25

Agenda
• Course Credit
• One common web site story
• Why RDB not affordable ?
• Big Data
• Why use noSQL ?
• HBase Indroduction
• Hands-on
• noSQL architecture common practices
• Case study

1

一個網站的故事 (1/3)
• RDBMS是Persistence tier一個理所當然的選擇
• 它可以幫我們處理transaction(ACID)，確保完整性限制
(Integrity Constraints)，標準的SQL語言，甚至還有Stored
Procedure可以用

• 第一次，你的使用者人數越來越多時…
• 使用AP Servers Cluster，它們共用一台DB Server

• 第二次，你的使用者人數越來越多時…
• DB Server分成Master-Slave架構
• 從Slave Servers讀取資料
• 寫入資料至Master Server 2
Hbase: The Definitive Guide - http://www.amazon.com/HBase-Definitive-Guide-Lars-
George/dp/1449396100/ref=sr_1_1?ie=UTF8&qid=1339060175&sr=8-1

• 第三次，你的使用者人數越來越多時…
• 針對讀取資料的瓶頸
• 在Server程式和DB之間，加入Cache，例如Memcached (Memory
DB)
• 但Server程式的Cache和DB之間，很可能出現資料不一致的問題
• 針對寫入資料的瓶頸
• 增加DB Server的機器規格(CPU、Memory、Disk等，Vertically
Scaling)
• 別忘記！我們也要連同Slave Severs的規格也要一起增加ㄛ…

3

• 第四次，你的使用者人數越來越多時…
• 使用Database Sharding技術
• 從Vertically Scaling轉換成Horizontally Scaling
• 開啟管理的惡夢
• RDBMS天生不適合分散式儲存 (ACID，Fixed Schema)
• DBA要設定一組Sharding Rules
• 當其中某一台DB Server掛掉，或是儲存容量滿了，就要開始手動作
Resharding
• Resharding包含了要重新調整Sharding Rules，接著需要作大量IO的資料複製
和遷移工作，同時間要保證網站可以正常服務，或是要在一定時間內中斷服
務

• 這通常是事後不得已，而且少數可選擇的解決方案
• 天知道我的網站會這麼紅？
4

Why RDB not affordable ? (1/6)
• Bottleneck of Relational-DB
• 90s V.S. recent years (Web 2.0)

• Memcachd + mySQL
• Mitigate read stress effectively, but not write stress

• mySQL Cluster solution
• Master/Slave
• Not affordable for highly-concurrency scenario
• Vertical Partitioning
• Vertical/Horizontal Partitioning (Database sharding)
• Complex
• Hard to scale-out and change requirements
• Low availability
5
• Some type of simple but big size data cause this condition
http://www.infoq.com/cn/news/2011/01/nosql-why

Why RDB not affordable ? (2/6) –
A general HA system architecture design

6

軟體專案的素質之四 ─ 整體設計之架構設計案例 ─ http://takeshi-
experience.blogspot.tw/2012/04/blog-post.html

Master/Slave

7

Vertical Partitioning

8

Master/Slave + Vertical Partitioning

9

Vertical/Horizontal Partitioning

10

• 過去3年所產生的資料量，比過去四萬年創造的資料量還
多！
• WallMart的資料量是美國國會圖書館的167倍！
• eBay分析平台每天處理的資料量高達100PB！(約
1,000,000GB)
• 截至2010年，世界電子資料儲存量為1.2ZB！
(1,200,000PB)
• 根據IDC預測，2020年世界電子資料儲存量會是2009年的
基礎上，再加上44倍，達到35萬億GB！
• 35,000,000,000,000 Giga Bytes

11

架构师 10 月刊 ─ http://www.infoq.com/cn/minibooks/architect-oct-10-2011

Trend Micro’s problem
• 每人每天造訪約20 ~ 60 html頁面
• 每個html頁面約包含15 ~ 30 URI
• 每個URI物件大小約10 ~ 150 KB
• 以一百萬個用戶而言
• 100萬 X 20 = 2,000萬個html頁面
• 2,000萬個html頁面 X 15 = 30,000萬個URI (三十億)
• 30,000萬個URI物件 X 10 = 30,000KB (3TB)
• 以上純屬台灣區的資料量

• 趨勢是個全球性的公司
• 故每天的資料量約數十個TB
12

趨勢的雲端發現之旅 ─ http://findbook.tw/book/9789866126185/basic

大資料時代下的新寵兒 ─
• Not only SQL
• 於2009年開始
• 有以下特性
• 不使用關聯式資料模型
• 天生分散式儲存
• 易於水平式擴充的
• 開放原始碼的
• 易於擴充的
• 簡單的API操作 (CRUD，通常沒有SQL支援)
• CAP (不同於ACID)
• Eventually Consistency、Availability、Partition-Tolerance
• 儲存巨量且異質的資料 13

http://nosql-database.org/

Why use noSQL ?
• Easy to scale-out
• Unlike RDB, no relationship therefore easy to scale-out

• High performance even in the big data
• Table-level cache (RDB) V.S. Record-level cache (noSQL)

• Elastic data model
• Schema V.S. Schema-less/Dynamic schema
• High availability
• Easy to add new machines (nodes) without any performance
impact
14

Comparison between RDB and noSQL
If given a really huge of big data…

Aspects RDB noSQL
Performance Getting lower Sustain as a small size of data
Scalability Mainly for scale up Mainly for scale out
Reliability ACID CAP
Availability Hard to maintain SLA Easy to maintain SLA
Security Robust Depends
Economics High-end machines Commodity machines
Data Model Relational, Fix-schema Depends but more likely
simple, Schema-less
Maturity Very mature Not mature, various products
Commercial Global company Small start-ups
support
OLAP/BI Mature Immature 15
Human resource Easy to find Hard to find

noSQL basic categories

16

iTcloud新雲端時代 ─ http://www.ithome.com.tw/002/cloud/cloud.html

Apache Hbase介紹
• ASF的top-level專案
• 屬於noSQL DB中的Key-Value類型
• 源自於Google的
• Bigtable: A Distributed Storage System for Structured Data
• a distributed storage system for managing structured data that is
designed to scale to a very large size: petabytes of data across
thousands of commodity servers
• a sparse, distributed, persistent multi-dimensional sorted map

17
Hbase: The Definitive Guide - http://www.amazon.com/HBase-Definitive-Guide-Lars-
George/dp/1449396100/ref=sr_1_1?ie=UTF8&qid=1339060175&sr=8-1

Apache Hbase Concepts – Column-Oriented (1/2)

18

http://ofps.oreilly.com/titles/9781449396107/intro.html

Apache Hbase Concepts – Column-
Oriented (2/2)
• a sparse, distributed, persistent multi-dimensional sorted map
• which is indexed by row key, column key (column family +
qualifiers), and a timestamp

Column Families

19

Apache Hbase Concepts - Architecture

20

http://ofps.oreilly.com/titles/9781449396107/architecture.html

Hands-on (1/3) –
Use your VM (Virtual Machine) to install tm-puppet

• Please refer to SPN Dev hbase training program again~
• Install git on your PC
• Install tm-puppet on your VM

21

Hands-on (2/3) –
Use HBase shell
• Basic operations
• help, list, scan
• Create
• A table ‘MY_FIRST_TABLE’
• Two column families ‘FAM_1’, ‘FAM_2’
• Ex.
• create 't1', {NAME => 'f1'}, {NAME => 'f2'}
• Create ‘t1’, ‘f1’, ‘f2’
• Put two records (column)
• Ex. put 't1', 'r1', 'c1', 'value'
• Update a record (column) (It is also a put)
• Delete a record (column) 22
• delete 't1', 'r1', 'c1'

Hands-on (3/3) –
Requirements
• Put your successful installed tm-puppet image file to git
• Use following commands
• Jps
• Ifconfig
• Cut the image
• Path : ${git_home}/hbase-training/001/hands-
on/${your_name}/hands-on-001.jpg
• Put your hbase shell records image file to git
• Use following commands
• Scan ‘MY_TEST_TABLE’
• Ifconfig
• Cut the image
• Path : ${git_home}/ hbase-training/001/hands-
on/${your_name}/hands-on-002.jpg 23
• Commit and push your git

noSQL architecture practices (1/8) –
Use noSQL as complement
• Use noSQL as a mirror (implemented by code)
• The RDB is still a major storage device, and noSQL as a mirror

24
NoSQL架構實踐（一）— 以NoSQL為輔 ─
http://www.infoq.com/cn/news/2011/02/nosql-architecture-practice

//PSEUDO CODE for noSQL as a mirror
//We want to store the data Object
bool status = false;
DB.startTransaction(); //start transaction
id = DB.Insert(data); //write data Object to RDB
if(id > 0){
status = NoSQL.Add(id, data); //write data Object to noSQL by id
}
if(id > 0 && status == true){
DB.commit(); //commit transaction
} else {
DB.rollback(); //failed, rollback transaction
}

25

• Use noSQL as a mirror (implemented by synchronization)

26

• Combine RDB & noSQL

27

//PSEUDO CODE for RDB & noSQL combination
//we want to store the data Object
data.title = "title";
data.name = "name";
data.time = "2009-12-01 10:10:01";
data.from = "1";
bool status = false;
DB.startTransaction(); //start transaction
//write into RDB, data.from is a value needed by search criteria
id = DB.Insert("INSERT INTO table (from) VALUES(data.from)");
if(id > 0){
//write data Object to noSQL by id
status = NoSQL.Add(id, data);
}
if(id>0 && status==true){
DB.commit(); //commit transaction 28
}else{
DB.rollback(); //failed, rollback transaction
}

• What benefits we can get from the RDB & noSQL combination
practice

• Decrease the I/O of RDB, therefore save more storage space
• Increase the RDB table-level cache hitrate, only the key
values(PK, FK, search criteria related values) updated will
refresh the cache
• Increase the synchronization efficiency for RDB Master/Slave
architecture
• Increase the RDB backup/recover efficiency
• Increase the scalability/performance for whole system
29

Use noSQL as master
• Use only with noSQL
• Mainly for simple query requirements systems
• But there are noSQL products can fulfill the more complex
queries
• MonngoDB, Tokyo Cabinet, etc

30
NoSQL架構實踐（二）— 以NoSQL為主 ─
http://www.infoq.com/cn/news/2011/03/nosql-architecture-practice-2

Use noSQL as master
• Use noSQL as major data source
• APs only write data into noSQL
• Then synchronize the data from noSQL to other data stores
based on their application

31

Case Study (1/4) –
Facebook’s Real-time Message System
• Use HBase to store 135+ billion messages a month
• Beat off other few competitors such as Cassandra, mySQL-
Sharding, etc

• Data Patterns
• A short set of temporal data that tends to be volatile
• An ever-growing set of data that rarely gets accessed

32
Facebook's New Real-time Messaging System: HBase to Store 135+ Billion
Messages a Month - http://highscalability.com/blog/2010/11/16/facebooks-new-
real-time-messaging-system-hbase-to-store-135.html

• Some key aspects of their system:
• HBase
• Has a simpler consistency model than Cassandra.
• Very good scalability and performance for their data patterns.
• Most feature rich for their requirements: auto load balancing and
failover, compression support, multiple shards per server, etc.
• HDFS, the filesystem used by HBase, supports replication, end-to-end
checksums, and automatic rebalancing.
• Facebook's operational teams have a lot of experience using HDFS
because Facebook is a big user of Hadoop and Hadoop uses HDFS as
its distributed file system.

33

• Haystack is used to store attachments.
• A custom application server was written from scratch in order
to service the massive inflows of messages from many
different sources.
• A user discovery service was written on top of ZooKeeper.
• Infrastructure services are accessed for: email account
verification, friend relationships, privacy decisions, and
delivery decisions
• Keeping with their small teams doing amazing things approach,
20 new infrastructures services are being released by 15
engineers in one year.
• Facebook is not going to standardize on a single database 34
platform, they will use separate platforms for separate tasks.

Alibaba China Site architecture

35

http://www.infoq.com/cn/presentations/hl-alibaba-cn-architecture-design-practice

Data Access pattern as the key
for noSQL
• Data Structure
• Structured
• Semi-structured
• Unstructured
• Size
• How many & how often writes/read (proportion)
• Data Writing
• Transaction
• Data Reading
• Random access
• Sequential access
• Relationship 37

001 hbase introduction

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

En vedette

En vedette (6)

Similaire à 001 hbase introduction

Similaire à 001 hbase introduction (20)

Plus de Scott Miao

Plus de Scott Miao (6)

Dernier

Dernier (20)

001 hbase introduction