7. How to get diffs
• Full-scan and compare
a
b
c
d
e
a
Data at t0
b’
c
d
e’
Data at t1
1 b’ 4 e’
Diffs from t0 to t1
• Partial-scan with bitmaps (or indexes)
a
b
c
d
e
00000
a
b’
c
d
e’
1 b’ 4 e’
01001
• Scan logs directly from WAL storages
a
b
c
d
e
a
b’
c
d
e’
1 b’ 4 e’
1 b’ 4 e’
7
8. WalB アーキテクチャ
Any application
(File system, DBMS, etc)
Walb dev
controller
Control
Read
A walb device
as a wrapper
A block device for data
(Data device)
Not special format
Write
Walb log
Extractor
Logs
A walb log
device
A block device for logs
(Log device)
An original format
8
10. Ring buffer inside
The oldest logpack
The latest logpack
Ring buffer
Log pack
Logpack
header
block
Log pack
header
block
Checksum
Logpack lsid
Num of records
Total IO size
1st
written data
2nd
written data
…
1st log record
2nd log record
IO address
IO size
...
IO address
IO size
...
...
10
11. Redo/undo ログ
• Redo ログ
– 時間を進める
• Undo ログ
– 時間を逆に進める
Redo logs
Data at t0
0
2
Data at t1
Undo logs
0
2
11
22. IO processing flow (Naive)
Write
Submitted
Completed
WalB write IO response
Wait for log flushed and overlapped IOs done
Packed
Log IO response
Data IO response
Time
Log submitted
Log completed
Data submitted
Data completed
Read
Submitted
Completed
Data IO response
Time
Data submitted
Data completed
22
23. IO processing flow (WalB)
Write
Submitted
Completed
WalB write IO response
Packed
Wait for log flushed and overlapped IOs done
Log IO response
Log submitted
Data IO response
Log completed
Data submitted
Data completed
Pdata inserted
Time
Pdata deleted
Read
Submitted
Completed
Data IO response
Pdata copied
(Data submitted)
(Data completed)
Pdata: Pending Data
Time
23
24. Pending data
• A red-black tree
– カーネルが提供
– 重複 IO を発⾒見見しやすく
するためアドレス順に
ソート
...
Node0
Node1
addr
size
data
addr
size
data
...
NodeN
addr
size
data
• A spinlock
– 複数コアからのアクセ
スを排他
24
25. Pdata pseudo code
Insert_to_pdata(pdata,
io):
(delete
fully
overwritten
io(s)
by
the
io
from
the
pdata)
insert
the
io
to
pdata.
Delete_from_pdata(pdata,
io):
(if
not
deleted
yet)
delete
the
io
from
pdata.
Copy_overlapped_area(pdata,
io):
get
overlapped
io
list
from
pdata.
sort
the
list
by
log
sequence
id.
for
each
io
(as
iox)
in
the
sorted
list:
copy
overlapped
area
of
iox
to
the
io.
Each
function
must
be
executed
atomically
(with
spinlock).
25
26. Overlapped IO serialization
Wait for overlapped IOs done
Data IO response
Data submitted
Oldata inserted
Got notice
Time
Data completed
Oldata deleted
Sent notice
• ⼀一意な状態を実現するため重複 IO のみを直列列化
• Oldata: overlapped data
– pdata と構造は類似
– IO につきカウンタひとつで重複 IO の存在を管理理可能
– FIFO 制約
26
27. Overlapped IO serialization –cont.
Address
Queued
Submitted
A
C
B
Completed
D
Time
A
0
Oldata
A
B
C
D
AB
00
ABC
002
ABCD
0022
a=0
ABCD
0022
CD
01
D
0
c-c--,d--
b=0
c=2
d-d=2
Wait for completion of previously inserted overlapped IOs
Wait for completion of the IO
Wait for completion of previously inserted IOs
27
28. Oldata pseudo code
Insert_to_oldata(oldata,
io):
io.ol_count
=
number
of
overlapped
io(s)
in
oldata
insert
the
io
to
oldata.
if
io.ol_count
==
0:
submit
the
io.
else:
the
io
should
wait
for
notification.
Delete_from_oldata(oldata,
io):
delete
the
io
from
oldata.
for
each
io
(as
iox)
in
the
all
overlapped
io(s)
in
oldata:
iox.io_count
-‐=
1
if
iox.io_count
==
0:
notify
to
iox
that
it
can
be
submitted.
Each
function
must
be
executed
atomically
(with
spinlock).
The
order
of
insertion/deletion
IOs
must
be
the
same
(FIFO).
28
47. 評価まとめ
• WalB オーバーヘッド
– 並列列度度が⼩小さい場合は無視できない
– Log-flush は HDD を⽤用いた sequential write 時に
無視できない
• Request vs bio interface
– IO サイズが⼤大きい場合を除いて bio の性能が⾼高い
47
48. WalB まとめ
• Linux ブロックデバイスドライバ
– 増分バックアップ
– ⾮非同期レプリケーション
• オーバーヘッド⼩小
– No persistent indexes
– No undo-logs
– No fragmentation
48
49. 開発進捗と今後
• Version 1.0
– For Linux kernel 3.2+ and x86_64 architecture
– Userland tools are minimal
• Improve userland tools
– Faster extraction/application of logs
– Logical/physical compression
– Backup/replication managers
• Submit kernel patches
49
50. Future work
• Add all-zero flag to the log record format
– to avoid all-zero blocks storing to the log device
• Add bitmap management
– to avoid full-scans in ring buffer overflow
• (Support snapshot access)
– by implementing pluggable persistent address indexes
• (Support thin provisioning)
– if a clever defragmentation algorithm was available
50
51. Thank you for your attention!
• GitHub repository:
– https://github.com/starpos/walb/
• Contact to me:
– Email: hoshino AT labs.cybozu.co.jp
– Twitter: @starpoz (hashtag: #walbdev)
51