The document summarizes a research presentation on distributed storage and processing. It discusses two papers: 1) PABIRS, a data access middleware for distributed file systems that efficiently processes mixed workloads of queries. It proposes an integrated data access middleware to address this. 2) Scalable distributed transactions across heterogeneous stores.
It then provides details on PABIRS, which uses a hybrid index with a bitmap index and LSM (log-structured merge) tree index. The bitmap index is used for low selectivity keys, while the LSM index is built for "hot" values with selectivity above a threshold. The system aims to efficiently support data retrieval and insertion for various query workloads on distributed file systems.
2. 紹介する論論⽂文
1. PABIRS: A Data Access Middleware
for Distributed File Systems
– S. Wu (Zhejiang Univ.), G. Chen, X. Zhou, Z.
Zhang, A. K. H. Tung, and M. Winslett
(UIUC)
2. Scalable Distributed Transactions
across Heterogeneous Stores
– A. Dey, A. Fekete, and U. Röhm
(Univ. of Sydney)
R3: Distributed Storage and Processing 担当:若若森(NTT)2
3. • ⽬目的
– 選択率率率の⾼高いクエリや分析クエリの混合したワークロードを分散FS
上で効率率率的に処理理する
• 課題
– ソートやインデキシング等の前処理理が有効だが,複雑な前処理理は挿
⼊入のスループットを低下する
– べき乗分布の実データ(下図)のインデックス設計も困難
• 貢献
– 複雑なクエリの混合するワークロードを効率率率的に処理理する
統合データアクセスミドルウェア (PABIRS) の提案
数10億の通話履履歴から直近数カ⽉月分(数千)のレコードを検索索
PABIRS: A Data Access Middleware for
Distributed File Systems
R3: Distributed Storage and Processing 担当:若若森(NTT)3
0
500
1000
1500
2000
2500
0 100 200 300 400 500 600 700 800 900 1000
CallFrequency
Caller ID
Call Frequency
Fig. 1. Distribution of Call
Frequency
0
200
400
600
800
1000
1200
0 100 200 300 400 500 600 700 800 900 1000
NumberofBlocks
Caller ID
Number of Blocks
Fig. 2. Number of Blocks per
Key
PABIRS
図は元論論⽂文より引⽤用
例例)
ある電話会社の通話ログデータから1,000個の電話番号(ID)をランダムに抽出した結果
Fig.
1.
通話頻度度の分布
0
500
1000
1500
2000
2500
0 100 200 300 400 500 600 700 800 900 1000
CallFrequency
Caller ID
Call Frequency
Fig. 1. Distribution of Call
Frequency
0
200
400
600
800
1000
1200
0 100 200 300 400 500 600 700 800 900 1000
NumberofBlocks
Caller ID
Number of Blocks
Fig. 2. Number of Blocks per
Key
support efficient data retrieval for various query workloads.
PABIRS
Fig.
2.
IDの含まれるデータブロック数
・1%
の
ID
による通話が
半分以上を占める
・べき乗分布
(power-‐law)
場所や通話回数などの属性で集約して分析
・頻出
ID
のレコードが
ほぼ全てのDFSブロック
中に存在
4. PABIRS = Bitmap + LSM index
• DFS上の(半)構造化データへのアクセス⼿手段
– DFSへのGETインタフェース
– MapReduce処理理:mapへのinputformat
– KVSのトランザクション:secondary index
• DFS wrapper: ハイブリッドインデックス
– Bitmap index:選択率率率の低いキー/タプル向け
– LSM index:hot value に対してのみ⽣生成
R3: Distributed Storage and Processing 担当:若若森(NTT)4
PABIRS
DFS
g. 3. Architecture of PABIRSFig.
3.
PABIRS
のアーキテクチャ
InputFormatInsert(key,
value)Lookup(key)
Fig.
4.
bitmap
の例例
・ブロック毎に
signature
を保持
・DAGベースの階層構造
(directory
vertices
à
data
vertices)
1 0 0 0 0
1 0 0 0 0
0 0 0 0 1
0 0 0 1 0
1 0 0 0 0
1 0 0 0 0
u
u
u
u
u
u
UID u u u u u
1 0 0 1 0
1 0 0 0 1
block
signature
data block 1
1 0 0 1 1
1 0 1 1 0
1 0 1 1 1
data block 2
block
signature
UID
Fig. 4. Bitmap Example
III. HYBRID INDEXING SCHEME
The general idea behind our hybrid indexing scheme is
to build bitmap signatures for all data blocks and select
certain hot keys for LSM index. Bitmap signature is created
for multiple attributes without re-ordering the records. To
facilitate efficient parallel search, we design a hierarchical
model based on a virtual Directed Acyclic Graph (DAG)
structure, in which each intermediate vertex is a summary of
the signatures accessible on its descendants. We present an
example DAG structure in Figure 5 as a virtual index structure
Param
s, s1
v
pj
N
Bp
Bt
✓
k
m
F
W
r
with entries taken from the leaf level of the C0 tree, thus decreasing the size of C0, and creates a
newly merged leaf node of the C1 tree.
The buffered multi-page block containing old C1 tree nodes prior to merge is called the emp-
tying block, and new leaf nodes are written to a different buffered multi-page block called the
filling block. When this filling block has been packed full with newly merged leaf nodes of C1,
the block is written to a new free area on disk. The new multi-page block containing merged
results is pictured in Figure 2.2 as lying on the right of the former nodes. Subsequent merge
steps bring together increasing index value segments of the C0 and C1 components until the
maximum values are reached and the rolling merge starts again from the smallest values.
C1 tree C0 tree
Disk Memory
Figure 2.2. Conceptual picture of rolling merge steps, with result written back to disk
Newly merged blocks are written to new disk positions, so that the old blocks will not be over-
written and will be available for recovery in case of a crash. The parent directory nodes in C1,
also buffered in memory, are updated to reflect this new leaf structure, but usually remain in
buffer for longer periods to minimize I/O; the old leaf nodes from the C1 component are in-
validated after the merge step is complete and are then deleted from the C1 directory. In gen-
eral, there will be leftover leaf-level entries for the merged C1 component following each
merge step, since a merge step is unlikely to result in a new node just as the old leaf node
empties. The same consideration holds for multi-page blocks, since in general when the filling
block has filled with newly merged nodes, there will be numerous nodes containing entries still
LSM
Tree
[O'Neil+,
’96]
・特徴:
⾼高い書込スループット
・インメモリのC0(AVL-‐Tree)に
書き込み
・C0のサイズがしきい値を
超えた時,ディスク上の
C1
(B-‐Tree)
に
rolling
merge
コスト⾒見見積り
図は元論論⽂文より引⽤用
(図:
LSM
Tree
の元論論⽂文より)
5. ハイブリッドインデックスの最適化
1. Bitmap Signatureのコストモデルと最適化
– fanout パラメータ F に基づき,low-‐‑‒level
vertices からhigh-‐‑‒level vertexを⽣生成
– コストモデルを定義,コスト最⼩小化するFを推定
– Pregel [Malewicz+, ʼ’10] の BSP でグラフ探索索
2. LSMによる最適化
– あるキーの選択率率率がしきい値を超えた場合にLSM indexを作成
R3: Distributed Storage and Processing 担当:若若森(NTT)5
0
200
400
600
800
1000
1200
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
ProcessingTime(msec)
Selectivities of Call Numbers (%)
bitmap
lsm
Cost of Bitmap and LSM
he fanout of the B-tree. We try to insert the key
ex, only when the estimated cost is no larger than
Index Manager
Data Statistics
DFS
Data Stream Append
LSM Index
New Data
Bitmap Signature
MapReduce
Algorithm
Fig. 8. Index Update
Fig.
7.
bitmap
と
LSM
の検索索コスト
・LSM
は選択率率率によらず⼀一定
・選択率率率
0.1
%以下の場合は
bitmap
が⾼高速
・実際は
90%
以上のクエリが
0.1
%以下
0.1
0
200
400
600
800
1000
1200
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
ProcessingTime(msec)
Selectivities of Call Numbers (%)
bitmap
lsm
Fig. 7. Search Cost of Bitmap and LSM
where E is the fanout of the B-tree. We try to insert the key
into LSM index, only when the estimated cost is no larger than
the cost of bitmap index.
Based on the inequality above, we are able to calculate
the minimal selectivity, which makes LSM a more attractive
selection than the bitmap. In Figure 7, we apply the theoretical
Index Manager
Data Statistics
DFS
Data Stream Append
LSM Index
New Data
Bitmap Signature
MapReduce
Algorithm
Fig. 8. Index Update
C. Update on the Indices
PABIRS is specifically designed for the applications that
require fast data insertion. In PABIRS, bitmap index is a
lightweight index which can be built in a batch, while LSM
index is intentionally designed to support the fast insertion.
Fig.
8.
インデックスの更更新
・新規データはDFSに追記
・オフラインの
MR
Jobで
Bitmap
signature
と
LSM
の
Hot-‐key
を更更新
for multiple attributes without re-ordering the records. To
facilitate efficient parallel search, we design a hierarchical
model based on a virtual Directed Acyclic Graph (DAG)
structure, in which each intermediate vertex is a summary of
the signatures accessible on its descendants. We present an
example DAG structure in Figure 5 as a virtual index structure
on two different attributes, using 8 bits and 5 bits for these two
attributes respectively.
Generally speaking, the DAG structure consists of three
layers. The retrieval layer contains individual signatures cor-
responding to the data blocks, while each intermediate vertex
in the index layer is associated with a summary signature by
merging signatures of its children vertices. Data layer refers
to the physical data blocks stored in the DFS. Signatures and
their corresponding graph vertices are randomly distributed to
multiple DFS nodes. In the rest of the paper, we refer the
vertices in the retrieval layer as data vertices and the vertices
in the index layer as directory vertices.
On the other hand, LSM index replicates the records with
hot keys and sorts them in its B-trees. For each indexing
attribute, we independently create an LSM index to maintain
its sorted replicas of hot data. In the rest of the section, we
first introduce the tuning approaches used on the bitmap-based
index, followed by the selection strategy between these two
indices. For better readability, the notations used in the section
are summarized in Table I.
A. Optimizations on Bitmap Signature
Suppose the signature of each data block follows the same
distribution {p1, p2, . . . , pk}, in which each pj indicates the
probability of having “1” on the j-th bit. Because of the
exclusiveness between the values, the signature is a sparse
vector, i.e.
P
j pj ⌧ k. Given two signatures s1 and s2, the
expected number of common “1” on j-th bits in both s1 and
s2 is
P
j(pj)2
. It is much smaller than the expected number
of “1” in either s1 or s2, i.e.
P
j(pj), unless there exists a pj
dominating the distribution.
When records are randomly assigned to the data blocks,
each probability pj is supposed to be a small positive number.
This leads to the phenomenon of Weak Locality in PABIRS. It
N total number of records/tuples
Bp size of a data block
Bt size of a tuple
✓ query selectivity
k number of distinct values of the attributes
m number of values mapped to the same bit
F fanout of the directory vertex
W number of virtual machines (workers)
rl computation cost of a directory vertex
rn network delay between any pair of vertices
rd the overhead of reading a data block
✓ selectivity of a particular queried key
fmin minimum frequency of any value in a domain
p(✓) pdf of distribution on selectivity ✓ of queries.
...
…...
Fig. 5. Demonstration of Signature Graph
is thus not helpful to group similar signatures when building
high-level directory vertex in the index layer, because such
merging only generates a new signature with a union of “1”s
from the signatures of its children vertices. Although it is
unlikely to optimize by better grouping, the fanout of the
abstract tree structure, i.e. the number of children vertices for
every directory vertex, remains tunable and turns out to be
crucial to the searching efficiency.
1) Cost Model and Fanout Optimization: Instead of picking
up similar signatures during bitmap construction, PABIRS
simply groups the low-level vertices to generate a high-level
vertex, based on a pre-specified fanout parameter F. Specifi-
116
Fig.
5.
signature
graph
※ほか,分析ワークロード向けの最適化等については元論論⽂文を参照
図は元論論⽂文より引⽤用
6. 実験,評価
• 環境
– Hadoop 1.0.4 + GPS [Salihoglu+, ʼ’12] (Pregel のOSS実装)をベースに
実装
– 4コアcpu, 8G RAM, 32ノードのhadoopクラスタで実験
• 項⽬目
A. ⾼高選択率率率クエリ:電話履履歴データに対する3種のselectクエリをHBase
Phoenix, Impala, BIDS [Lu+, 13]と⽐比較
B. 分析クエリ:tpcdskew [Bruno+, ʻ‘05] で⽣生成した⼈人⼯工データを⽤用いて
TPC-‐‑‒H Q3, Q5, Q6をHiveと⽐比較
R3: Distributed Storage and Processing 担当:若若森(NTT)6
320 400
)
0
50
100
150
200
250
300
350
400
Q1 Q2 Q3
ResponseTime(second)
HBase
PABIRS
BIDS
Impala-8G
Impala-4G
Fig. 13. Queries
1
10
100
1000
80 160 240 320 400
AverageResponseTime(second)
Data Size (G)
Q1
Q2
Q3
Fig. 14. Effect of Data Size
40
50
onseTime
PABIRS
HBase
150
200
nute)
PABIRS
0
20
40
60
80
00
20
40
60
0 4 8 12 16 20 24 28 32
Query Batch Size
Single Processor Thread
Quad Processor Thread
19. Throughput of Concurrent
ries (Q1)
0
5
10
15
20
25
30
35
40
45
50
0 4 8 12 16 20 24 28 32
AverageResponseTime(second)
Query Batch Size
Single Processor Thread
Quad Processor Thread
Fig. 20. Response Time of Concurrent
Queries (Q1)
10
100
1000
10000
1 2 3 4
ResponseTime(second)
Skew Factor
q3/h
q3/p
q5/h
q5/p
q6/h
q6/p
Fig. 21. Performance of TPC-H Query
(Skew)
10
100
1000
10000
3.5%(3) 7.1%(6) 10.5%(9)14.2%(12)
ResponseTime(second)
Query Selectivity (month)
q3/h
q3/p
q5/h
q5/p
q6/h
q6/p
Fig. 22. Performance of TPC-H Query
(Selectivity)
emory which leads to the “Memory Limit Exceeded”
tion.
or Q3 and Q6, we build an index for the column shipdate
we increase ✏ to a larger value (e.g., 365), PABIRS finds th
index-based access is even worse than scan-based access.
will automatically switch to the disk scan, which generate
better
station. To avoid query with empty result, we intentionally
select a number with at least one record under the base station.
PABIRS can effectively handle queries with a high selectiv-
ity but still involving numerous tuples. As shown in Table III,
in our 160G dataset, we have 40960 blocks in total. Although
the selectivities of the queries are as low as 0.00001%, the
records related to Q1, Q2 and Q3 cover 477, 28863 and 343
data blocks respectively. The involved data blocks, especially
for Q1 and Q3, are no close to the total number of data blocks,
while the overhead of loading hundreds of data blocks from
the disks remains high.
In experiments, PABIRS, Phoenix and BIDS are allowed
to use 4 GB main memory on each node of the cluster, while
Impala are tested under two settings with 4 GB and 8 GB
main memory respectively. The results in Figure 13 shows that
Impala-4G is unable to finish the queries in reasonable time
(i.e. 1,000 seconds), as it incurs high I/O cost on memory-
disk data swap. It reveals the limitation of Impala on memory
usage efficiency. Moreover, Impala and BIDS show a similar
performance for all queries, because both approaches adopt
the scan-based techniques (memory scan and disk scan). In the
rest of the experiments, we only report the results of Impala-
8G, denoted as Impala in abbreviation. The results also imply
that PABIRS significantly outperforms the other systems on
all queries. When the selectivity of the query is high, such as
Q1 and Q3, HBase Phoenix is the only alternative with close
performance to PABIRS, because of its adoption of secondary
index. But for the query involving a large portion of data like
Q2, HBase Phoenix is slow as it incurs many random I/Os to
retrieve all results.
TABLE III. PROCESSING TIME OF PABIRS
QID selectivity index time disk time total time
Q1 1.2% 1.03s 1.47s 2.50s
Q2 70% 2.11s 137.63s 139.74s
Q3 0.8% 1.04s 1.28s 2.32s
To gain better insight into the scalability of PABIRS, we
the performances of PABIRS and HBase Phoenix degrade
slightly when more insertions are conducted, because they
need to build and query indexes for the new tuples. Finally,
we implement a simple transaction module as discussed in
Section 2. Our test transaction retrieves all records of a specific
phone number (normally hundreds to thousands of records) and
updates the values of NeID in those records to a new value.
We vary the number of concurrent transactions and ss shown
in Figure 18, for this test transaction, PABIRS can provide a
good throughput.
In PABIRS, queries can be grouped into batch and share the
index searching process. In Figure 19 and Figure 20, we show
the throughput and response time for varied batch size. As each
node in the cluster is equipped with a 4-core CPU, we start
four concurrent I/O threads at the same time. For comparison
purpose, we also show the result when a single I/O thread
run. The throughput of four I/O threads is almost three times
higher than the single thread case. The throughput improves
dramatically for a larger query batch, since we can share
more signature and data scans among the queries. However,
the results imply that the throughput gain shrinks with the
increase of the query batch size. It is thus important to choose
an appropriate batch size in real actions. The response time is
also affected by the batch size. Figure 20 illustrates that the
response time is generally proportional to the batch size. If a
strict real-time requirement is needed, it is important for the
system to carefully choose batch size, in order to hit a balance
between the throughput and response time.
C. Analytic Query Performance
In this group of experiments, we evaluate the performance
of PABIRS on data and queries generated by TPC-H bench-
mark. Specifically, we generate 320 GB data with different
skew factors using the TPC-H Skew Generator6
. We deploy
Hive on top of PABIRS and compare the performances of
PABIRS against the original Hive on query Q3, Q5 and
Q6 in TPC-H. We also include Impala in the experiment.
However, Impala requires buffering all intermediate join results
A. B.
Fig.
21.
TPC-‐H
(skew) Fig.
22.
TPC-‐H
(Selectivity)
(Fig.
13)
・skew⼩小:Hiveと同等の性能
・skew⼤大:インデックスの効果で性能向上
・Q5には効果なし
(インデックスがordersにしかないため)
図は元論論⽂文より引⽤用
7. Scalable Distributed Transactions across
Heterogeneous Stores
• ⽬目的
– 異異なるデータストア間で複数アイテムに
対応したトランザクション処理理を⾏行行いたい
• 課題
– アプリケーションでトランザクションを⾏行行う場合:
• プログラマによるエラーを起こしやすく,可⽤用性やスケーラビリティ喪失の
恐れがある
– コーディネータのミドルウェアを導⼊入する場合:
• アプリケーションは全て管理理下になければならない
• 貢献
– 異異なるデータストア間での複数アイテムのトランザクションをサ
ポートするクライアントライブラリ: Cherry Garcia (CG)の提案
– Windows Azure Storage (WAS), Google Cloud Storage (GCS),
Tora (a high-‐‑‒throughput KVS) に実装
– YCSB+T [Dey+, ʻ‘14] (WebスケールTXNベンチマーク)で評価
R3: Distributed Storage and Processing 担当:若若森(NTT)7
BEGIN
TRANSACTION
SET
item1
of
Store1
SET
item2
of
Store2
COMMIT
TRANSACTION
8. 異異なるデータストア間のトランザクション
Datastore wds. The example also uses a third store (e
later) that acts as the Coordinating Data Store (CDS)
1 public void UserTransaction ( ) {
D a t a s t o r e cds = D a t a s t o r e . c r e a t e ( ” c r e d e n t i a l s . xml” ) ;
3 D a t a s t o r e gds = D a t a s t o r e . c r e a t e ( ” goog creds . xml” ) ;
D a t a s t o r e wds = D a t a s t o r e . c r e a t e ( ” msft creds . xml” ) ;
5 T r a n s a c t i o n tx = new T r a n s a c t i o n ( cds ) ;
try {
7 tx . s t a r t ( ) ;
Record saving = tx . read ( gds , ” saving ” )
9 Record checking = tx . read ( wds , ” checking ” ) ;
i n t s = saving . get ( ” amount ” ) ;
11 i n t c = checking . get ( ” amount ” ) ;
saving . s e t ( ” amount ” , s 5 ) ;
13 checking . s e t ( ” amount ” , c + 5 ) ;
tx . w r i t e ( gds , ” saving ” , saving ) ;
15 tx . w r i t e ( wds , ” checking ” , checking ) ;
tx . commit ( ) ;
17 } catch ( Exception e ) {
tx . a b o r t ( ) ;
19 }
}
Listing 1. Example code that uses the API to accesses two da
R3: Distributed Storage and Processing 担当:若若森(NTT)8
Listing.
1.
GCのAPIを使⽤用して記述した2つのデータストア間のトランザクション
Datastore:
データストアの
インスタンス
Transaction:
トランザクションコーディネータ
・Google
Cloud
Storage
の Datastore
(gds)
から’saving’,Windows
Asure
Storage
の
Datastore
(wds)から’checking’を読み込み
それぞれ更更新する
・ほか,Coordinating
Data
Store
(CDS)
として動作するDatastoreも使⽤用している
図は元論論⽂文より引⽤用
9. Cherry Garcia (CG):
クライアントライブラリ
• プラットフォームの想定
– 単⼀一レコードをreadする
際のStrong Consistency
– アトミックな単⼀一アイテ
ムの更更新・削除 (Test-‐‑‒
and-‐‑‒Set)
– アイテム中にユーザ定義
のメタデータを含められ
る
R3: Distributed Storage and Processing 担当:若若森(NTT)9
II. SYSTEM DESIGN
ection, we describe the design of our client-
ransaction processing protocol that enables trans-
ving multiple data items that span multiple het-
data store instances. The protocol is to be imple-
library whose API abstracts data store instances
alled Datastore, and these are accessed via a
oordinator abstraction, a class called Transaction.
ecord is addressable using a string key and its
accessed using an object of a class called Record.
an example of an application that uses the API to
a records, one (“saving”) residing in an instance of
d Storage, abstracted by the Datastore gds, while
s stored in Windows Azure Storage represented as
ds. The example also uses a third store (explained
ts as the Coordinating Data Store (CDS).
ransaction ( ) {
D a t a s t o r e . c r e a t e ( ” c r e d e n t i a l s . xml” ) ;
D a t a s t o r e . c r e a t e ( ” goog creds . xml” ) ;
D a t a s t o r e . c r e a t e ( ” msft creds . xml” ) ;
= new T r a n s a c t i o n ( cds ) ;
= tx . read ( gds , ” saving ” )
g = tx . read ( wds , ” checking ” ) ;
g . get ( ” amount ” ) ;
ng . get ( ” amount ” ) ;
mount ” , s 5 ) ;
” amount ” , c + 5 ) ;
Application 1
Transaction
Application 2
Transaction
Tora Windows Azure
Storage
Google Cloud
Storage
Tora Datastore
Abstraction
Application 3
Transaction
Tora Datastore
Abstraction
WAS Datastore
Abstraction
WAS Datastore
Abstraction
GCS Datastore
Abstraction
GCS Datastore
Abstraction
Datastore
Specific
REST API
Cherry
Garcia
Coordinating
Storage
TSR
Fig. 1. Library architecture
2) Overview: In essence, the protocol calls for each data
item to maintain the last committed and perhaps also the
currently active version, for the data and relevant meta-
data. Each version is tagged with meta-data pertaining to
the transaction that created it. This includes the transaction
commit time and transaction identifier that created it, pointing
to a globally visible transaction status record (TSR) using a
Universal Resource Identifier (URI). The TSR is used by the
client to determine which version of the data item to use when
reading it, and so that transaction commit can happen just
by updating (in one step) the TSR. The transaction identifier,
Fig.
1.
ライブラリのアーキテクチャ
図は元論論⽂文より引⽤用
• クライアントによるトランザクションコーディネーションの概要
• 各レコードを単⼀一アイテムのデータベースのように扱う
• 2PCでトランザクションコーディネーション
• 中央にコーディネータをもたない
• データにトランザクションステートを持たせ,クライアントでコー
ディネートする
10. CGによるトランザクションのタイムライン
• 2PC
– Current state と previous state をデータに持たせる
– Key の hash 値順に PREPARED フラグを⽴立立てる
– Coordinating Data Store (CDS) に Transaction Status Record
(TSR) を書き込み, COMMITTED フラグを⽴立立てる (並列列処理理)
R3: Distributed Storage and Processing 担当:若若森(NTT)10
data
wo
the
mit.
ped
ked
hed
ion
wer
ten
SR)
ing
uly
COMMITTED PREPARED
application logic
CDS
WAS
GCS
C1
t1
r2
r1
transaction cache
COMMITTED
read() read()
v1v1
v1
v1
v2 v2
write() commit()
v2
v2
PREPARE
PREPARE
TSR
COMMIT
v2
v2
COMMITTEDPREPARED
COMMITTED
DELETE
application logic C2
t2
transaction cache
read()
v1v1 v2 v2
commit()write()
v1v1 v2 v2
time
read()
PREPARE
application logic
t3
transaction cache
Cherry Garcia
Cherry Garcia
Fig. 2. The timeline describing 3 transactions running on 2 client hosts to
access records in 2 data stores using a third data store as a CDS
In the rest of this section we go deeper in detail on the
components of the library and the algorithms. Pseudocode for図は元論論⽂文より引⽤用
(Fig.
2.)
11. 実装,実験
• Cherry Garcia の実装
– Javaライブラリ (JDK 1.6)
– Datastore abstractionをWindows Azure Storage
(WAS), Google Cloud Storage (GCS), Tora
(WiredTigerストレージエンジン上で動作するKVS) に対
して実装
• 実験
R3: Distributed Storage and Processing 担当:若若森(NTT)11 図は元論論⽂文より引⽤用
1885.4& 1888.6& 1862.2& 1911.6&
5898.4&
33810&
0&
10000&
20000&
30000&
40000&
0.1& 0.3& 0.5& 0.7& 0.9& 0.99&
aborts'per'million'
theta'
aborts&per&million&transac:ons&
Fig. 6. Aborts measured varying theta with 1 YCSB+T client against a
1-node Tora cluster
0"
5000"
10000"
15000"
1" 32" 60" 91" 121" 152" 182" 213" 244"
Throughput"
(transac8ons/second)"
YCSB+T"Client"Threads"
transac8ons/sec"
Fig. 7. Throughput of 4 YCSB+T client hosts each with 1 though 64 threads
against a 4-node Tora cluster
0"
5000"
10000"
15000"
20000"
25000"
1" 2" 3" 4" 5" 6" 7" 8"
Throughput"
(transac8ons/second)"
Number"of"hosts"running"16"YCSB+T"clients"threads"each"
transac8ons/second"
Fig. 8. Throughput of YCSB+T with 16 through 128 threads on 8 client
hosts against a 4-node Tora cluster
600"
700"
800"
900"
ons"per"second)"
1"record"tx"
1"record"nonBtx"
2"record"tx"
2"record"nonBtx"800"
1000"
1200"
tx/sec)"
3"records"serial"
phase"2"
3"records"with"
parallel"phase"2"
Fig.
8.
8クライアントから
4ノードのToraクラスタに
対して16スレッドから128スレッドのトランザクショ
ンを実⾏行行した時のスループット
increased linearly until 16 threads and the average latency for
each request stayed within the 500µs mark. As the number
of threads were increased beyond 16 the latency increased
until it reached 4.5ms at 64 threads. This increased latency
suggests that there is a performance bottleneck somewhere in
the system.
We ran a further test with 4 client hosts and a cluster of
4 Tora servers and repeated the previous test and varied the
number of threads from 1 through to 64 threads across all 4
client hosts and measured the throughput. The graph in Figure
7 shows that the performance on each host scales linearly until
16 threads (an aggregate of 64 threads across 4 client hosts)
and then flattens out. We observed that the socket send buffers
on the servers were full suggesting a network bottleneck at the
client.
G. Experiment 4: abort rates vary with contention
We setup one EC2 m3.2xlarge server each as a YCSB+T
client and Tora server in AWS and ran the client with 16
threads with a read to read-write ration of 50:50 over 1 million
transactions. We used the Zipfian access key pattern, and
varied the theta value over 0.1, 0.3, 0.5, 0.7,0.9 and 0.99.
Figure Fig 6 shows that the aborts increase as the contention
increases, though aborts are infrequent even with extreme
contention.
H. Experiment 5: Scale-out test
We ran YCSB+T with a mix of 90:10 read to read-modify-
write operations in a Zipfian data access pattern with theta set
to 0.99 across 1 to 8 client hosts each with 16 threads, running
against a 4-node Tora cluster. We collected the throughput
0"
5000"
10000"
15000"
1" 32" 60" 91" 121" 152" 182" 213" 244"
Throughput"
(transac8ons/second)"
YCSB+T"Client"Threads"
transac8ons/sec"
Fig. 7. Throughput of 4 YCSB+T client hosts each with 1 though 64 threads
against a 4-node Tora cluster
0"
5000"
10000"
15000"
20000"
25000"
1" 2" 3" 4" 5" 6" 7" 8"
Throughput"
(transac8ons/second)"
Number"of"hosts"running"16"YCSB+T"clients"threads"each"
transac8ons/second"
Fig. 8. Throughput of YCSB+T with 16 through 128 threads on 8 client
hosts against a 4-node Tora cluster
0"
100"
200"
300"
400"
500"
600"
700"
800"
900"
1" 6" 11" 16"
throughput"(transac8ons"per"second)"
number"of"client"threads"
1"record"tx"
1"record"nonBtx"
2"record"tx"
2"record"nonBtx"
0"
200"
400"
600"
800"
1000"
1200"
1" 6" 11" 16"
throughput"(tx/sec)"
number"of"client"threads"
3"records"serial"
phase"2"
3"records"with"
parallel"phase"2"
Fig. 9. Overhead of transactions and the effect of 1-phase optimization
133
Fig.
9.
トランザクションのオーバーヘッド
と1-‐phaseの最適化(*)の効果
(*)
1アイテムに限定してPREPAREフェーズを省省略略
線形にスケール
(最⼤大 23288
trans/sec)
1-‐phaseの最適化の
オーバーヘッドは⼩小さい
並列列化による
スループット向上