SlideShare une entreprise Scribd logo
1  sur  11
Télécharger pour lire hors ligne
【ICDE2015  勉強会】
Research  3:
Distributed  Storage  and  Processing
担当:  若若森  拓拓⾺馬  (NTT)
2015.5.16
紹介する論論⽂文
1. PABIRS:  A  Data  Access  Middleware  
for  Distributed  File  Systems  
– S.  Wu  (Zhejiang  Univ.),  G.  Chen,  X.  Zhou,  Z.  
Zhang,  A.  K.  H.  Tung,  and  M.  Winslett  
(UIUC)
2. Scalable  Distributed  Transactions  
across  Heterogeneous  Stores
– A.  Dey,  A.  Fekete,  and  U.  Röhm
(Univ.  of  Sydney)
R3:  Distributed  Storage  and  Processing  担当:若若森(NTT)2
•  ⽬目的
–  選択率率率の⾼高いクエリや分析クエリの混合したワークロードを分散FS
上で効率率率的に処理理する
•  課題
–  ソートやインデキシング等の前処理理が有効だが,複雑な前処理理は挿
⼊入のスループットを低下する
–  べき乗分布の実データ(下図)のインデックス設計も困難
•  貢献
–  複雑なクエリの混合するワークロードを効率率率的に処理理する
統合データアクセスミドルウェア  (PABIRS)  の提案
数10億の通話履履歴から直近数カ⽉月分(数千)のレコードを検索索
PABIRS:  A  Data  Access  Middleware  for  
Distributed  File  Systems  
R3:  Distributed  Storage  and  Processing  担当:若若森(NTT)3
0
500
1000
1500
2000
2500
0 100 200 300 400 500 600 700 800 900 1000
CallFrequency
Caller ID
Call Frequency
Fig. 1. Distribution of Call
Frequency
0
200
400
600
800
1000
1200
0 100 200 300 400 500 600 700 800 900 1000
NumberofBlocks
Caller ID
Number of Blocks
Fig. 2. Number of Blocks per
Key
PABIRS
図は元論論⽂文より引⽤用
例例)	
  ある電話会社の通話ログデータから1,000個の電話番号(ID)をランダムに抽出した結果
Fig.	
  1.	
  通話頻度度の分布
0
500
1000
1500
2000
2500
0 100 200 300 400 500 600 700 800 900 1000
CallFrequency
Caller ID
Call Frequency
Fig. 1. Distribution of Call
Frequency
0
200
400
600
800
1000
1200
0 100 200 300 400 500 600 700 800 900 1000
NumberofBlocks
Caller ID
Number of Blocks
Fig. 2. Number of Blocks per
Key
support efficient data retrieval for various query workloads.
PABIRS
Fig.	
  2.	
  IDの含まれるデータブロック数
・1%	
  の	
  ID	
  による通話が	
  
  半分以上を占める	
  
・べき乗分布	
  (power-­‐law)
場所や通話回数などの属性で集約して分析
・頻出	
  ID	
  のレコードが	
  
  ほぼ全てのDFSブロック	
  
  中に存在
PABIRS  =  Bitmap  +  LSM  index
•  DFS上の(半)構造化データへのアクセス⼿手段
–  DFSへのGETインタフェース
–  MapReduce処理理:mapへのinputformat
–  KVSのトランザクション:secondary  index
•  DFS  wrapper:  ハイブリッドインデックス
–  Bitmap  index:選択率率率の低いキー/タプル向け
–  LSM  index:hot  value  に対してのみ⽣生成
R3:  Distributed  Storage  and  Processing  担当:若若森(NTT)4
PABIRS
DFS
g. 3. Architecture of PABIRSFig.	
  3.	
  PABIRS	
  のアーキテクチャ
InputFormatInsert(key,	
  value)Lookup(key)
Fig.	
  4.	
  bitmap	
  の例例
・ブロック毎に	
  signature	
  を保持	
  
・DAGベースの階層構造	
  
  (directory	
  vertices	
  à	
  data	
  vertices)	
  
1 0 0 0 0
1 0 0 0 0
0 0 0 0 1
0 0 0 1 0
1 0 0 0 0
1 0 0 0 0
u
u
u
u
u
u
UID u u u u u
1 0 0 1 0
1 0 0 0 1
block
signature
data block 1
1 0 0 1 1
1 0 1 1 0
1 0 1 1 1
data block 2
block
signature
UID
Fig. 4. Bitmap Example
III. HYBRID INDEXING SCHEME
The general idea behind our hybrid indexing scheme is
to build bitmap signatures for all data blocks and select
certain hot keys for LSM index. Bitmap signature is created
for multiple attributes without re-ordering the records. To
facilitate efficient parallel search, we design a hierarchical
model based on a virtual Directed Acyclic Graph (DAG)
structure, in which each intermediate vertex is a summary of
the signatures accessible on its descendants. We present an
example DAG structure in Figure 5 as a virtual index structure
Param
s, s1
v
pj
N
Bp
Bt
✓
k
m
F
W
r
with entries taken from the leaf level of the C0 tree, thus decreasing the size of C0, and creates a
newly merged leaf node of the C1 tree.
The buffered multi-page block containing old C1 tree nodes prior to merge is called the emp-
tying block, and new leaf nodes are written to a different buffered multi-page block called the
filling block. When this filling block has been packed full with newly merged leaf nodes of C1,
the block is written to a new free area on disk. The new multi-page block containing merged
results is pictured in Figure 2.2 as lying on the right of the former nodes. Subsequent merge
steps bring together increasing index value segments of the C0 and C1 components until the
maximum values are reached and the rolling merge starts again from the smallest values.
C1 tree C0 tree
Disk Memory
Figure 2.2. Conceptual picture of rolling merge steps, with result written back to disk
Newly merged blocks are written to new disk positions, so that the old blocks will not be over-
written and will be available for recovery in case of a crash. The parent directory nodes in C1,
also buffered in memory, are updated to reflect this new leaf structure, but usually remain in
buffer for longer periods to minimize I/O; the old leaf nodes from the C1 component are in-
validated after the merge step is complete and are then deleted from the C1 directory. In gen-
eral, there will be leftover leaf-level entries for the merged C1 component following each
merge step, since a merge step is unlikely to result in a new node just as the old leaf node
empties. The same consideration holds for multi-page blocks, since in general when the filling
block has filled with newly merged nodes, there will be numerous nodes containing entries still
LSM	
  	
  Tree	
  [O'Neil+,	
  ’96]
・特徴:	
  ⾼高い書込スループット	
  
・インメモリのC0(AVL-­‐Tree)に	
  
  書き込み	
  
・C0のサイズがしきい値を	
  
  超えた時,ディスク上の	
  
  C1	
  (B-­‐Tree)	
  に	
  rolling	
  merge	
  
コスト⾒見見積り
図は元論論⽂文より引⽤用
(図:	
  LSM	
  Tree	
  の元論論⽂文より)
ハイブリッドインデックスの最適化
1.  Bitmap  Signatureのコストモデルと最適化
–  fanout  パラメータ  F  に基づき,low-‐‑‒level
vertices  からhigh-‐‑‒level  vertexを⽣生成
–  コストモデルを定義,コスト最⼩小化するFを推定
–  Pregel  [Malewicz+,  ʼ’10]  の  BSP  でグラフ探索索
2.  LSMによる最適化
–  あるキーの選択率率率がしきい値を超えた場合にLSM  indexを作成
R3:  Distributed  Storage  and  Processing  担当:若若森(NTT)5
0
200
400
600
800
1000
1200
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
ProcessingTime(msec)
Selectivities of Call Numbers (%)
bitmap
lsm
Cost of Bitmap and LSM
he fanout of the B-tree. We try to insert the key
ex, only when the estimated cost is no larger than
Index Manager
Data Statistics
DFS
Data Stream Append
LSM Index
New Data
Bitmap Signature
MapReduce
Algorithm
Fig. 8. Index Update
Fig.	
  7.	
  bitmap	
  と	
  LSM	
  の検索索コスト
・LSM	
  は選択率率率によらず⼀一定	
  
・選択率率率	
  0.1	
  %以下の場合は	
  
  bitmap	
  が⾼高速	
  
・実際は	
  90%	
  以上のクエリが	
  
  	
  0.1	
  %以下	
  0.1
0
200
400
600
800
1000
1200
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
ProcessingTime(msec)
Selectivities of Call Numbers (%)
bitmap
lsm
Fig. 7. Search Cost of Bitmap and LSM
where E is the fanout of the B-tree. We try to insert the key
into LSM index, only when the estimated cost is no larger than
the cost of bitmap index.
Based on the inequality above, we are able to calculate
the minimal selectivity, which makes LSM a more attractive
selection than the bitmap. In Figure 7, we apply the theoretical
Index Manager
Data Statistics
DFS
Data Stream Append
LSM Index
New Data
Bitmap Signature
MapReduce
Algorithm
Fig. 8. Index Update
C. Update on the Indices
PABIRS is specifically designed for the applications that
require fast data insertion. In PABIRS, bitmap index is a
lightweight index which can be built in a batch, while LSM
index is intentionally designed to support the fast insertion.
Fig.	
  8.	
  インデックスの更更新
・新規データはDFSに追記	
  
・オフラインの	
  MR	
  Jobで	
  
  Bitmap	
  signature	
  と	
  LSM	
  の	
  
  Hot-­‐key	
  を更更新	
  
for multiple attributes without re-ordering the records. To
facilitate efficient parallel search, we design a hierarchical
model based on a virtual Directed Acyclic Graph (DAG)
structure, in which each intermediate vertex is a summary of
the signatures accessible on its descendants. We present an
example DAG structure in Figure 5 as a virtual index structure
on two different attributes, using 8 bits and 5 bits for these two
attributes respectively.
Generally speaking, the DAG structure consists of three
layers. The retrieval layer contains individual signatures cor-
responding to the data blocks, while each intermediate vertex
in the index layer is associated with a summary signature by
merging signatures of its children vertices. Data layer refers
to the physical data blocks stored in the DFS. Signatures and
their corresponding graph vertices are randomly distributed to
multiple DFS nodes. In the rest of the paper, we refer the
vertices in the retrieval layer as data vertices and the vertices
in the index layer as directory vertices.
On the other hand, LSM index replicates the records with
hot keys and sorts them in its B-trees. For each indexing
attribute, we independently create an LSM index to maintain
its sorted replicas of hot data. In the rest of the section, we
first introduce the tuning approaches used on the bitmap-based
index, followed by the selection strategy between these two
indices. For better readability, the notations used in the section
are summarized in Table I.
A. Optimizations on Bitmap Signature
Suppose the signature of each data block follows the same
distribution {p1, p2, . . . , pk}, in which each pj indicates the
probability of having “1” on the j-th bit. Because of the
exclusiveness between the values, the signature is a sparse
vector, i.e.
P
j pj ⌧ k. Given two signatures s1 and s2, the
expected number of common “1” on j-th bits in both s1 and
s2 is
P
j(pj)2
. It is much smaller than the expected number
of “1” in either s1 or s2, i.e.
P
j(pj), unless there exists a pj
dominating the distribution.
When records are randomly assigned to the data blocks,
each probability pj is supposed to be a small positive number.
This leads to the phenomenon of Weak Locality in PABIRS. It
N total number of records/tuples
Bp size of a data block
Bt size of a tuple
✓ query selectivity
k number of distinct values of the attributes
m number of values mapped to the same bit
F fanout of the directory vertex
W number of virtual machines (workers)
rl computation cost of a directory vertex
rn network delay between any pair of vertices
rd the overhead of reading a data block
✓ selectivity of a particular queried key
fmin minimum frequency of any value in a domain
p(✓) pdf of distribution on selectivity ✓ of queries.
...
…...
Fig. 5. Demonstration of Signature Graph
is thus not helpful to group similar signatures when building
high-level directory vertex in the index layer, because such
merging only generates a new signature with a union of “1”s
from the signatures of its children vertices. Although it is
unlikely to optimize by better grouping, the fanout of the
abstract tree structure, i.e. the number of children vertices for
every directory vertex, remains tunable and turns out to be
crucial to the searching efficiency.
1) Cost Model and Fanout Optimization: Instead of picking
up similar signatures during bitmap construction, PABIRS
simply groups the low-level vertices to generate a high-level
vertex, based on a pre-specified fanout parameter F. Specifi-
116
Fig.	
  5.	
  signature	
  graph
※ほか,分析ワークロード向けの最適化等については元論論⽂文を参照	
  
図は元論論⽂文より引⽤用
実験,評価
•  環境
–  Hadoop  1.0.4  +  GPS  [Salihoglu+,  ʼ’12]  (Pregel  のOSS実装)をベースに
実装
–  4コアcpu,  8G  RAM,  32ノードのhadoopクラスタで実験
•  項⽬目
A.  ⾼高選択率率率クエリ:電話履履歴データに対する3種のselectクエリをHBase  
Phoenix,  Impala,  BIDS  [Lu+,  13]と⽐比較
B.  分析クエリ:tpcdskew  [Bruno+,  ʻ‘05]  で⽣生成した⼈人⼯工データを⽤用いて
TPC-‐‑‒H  Q3,  Q5,  Q6をHiveと⽐比較
R3:  Distributed  Storage  and  Processing  担当:若若森(NTT)6
320 400
)
0
50
100
150
200
250
300
350
400
Q1 Q2 Q3
ResponseTime(second)
HBase
PABIRS
BIDS
Impala-8G
Impala-4G
Fig. 13. Queries
1
10
100
1000
80 160 240 320 400
AverageResponseTime(second)
Data Size (G)
Q1
Q2
Q3
Fig. 14. Effect of Data Size
40
50
onseTime
PABIRS
HBase
150
200
nute)
PABIRS
0
20
40
60
80
00
20
40
60
0 4 8 12 16 20 24 28 32
Query Batch Size
Single Processor Thread
Quad Processor Thread
19. Throughput of Concurrent
ries (Q1)
0
5
10
15
20
25
30
35
40
45
50
0 4 8 12 16 20 24 28 32
AverageResponseTime(second)
Query Batch Size
Single Processor Thread
Quad Processor Thread
Fig. 20. Response Time of Concurrent
Queries (Q1)
10
100
1000
10000
1 2 3 4
ResponseTime(second)
Skew Factor
q3/h
q3/p
q5/h
q5/p
q6/h
q6/p
Fig. 21. Performance of TPC-H Query
(Skew)
10
100
1000
10000
3.5%(3) 7.1%(6) 10.5%(9)14.2%(12)
ResponseTime(second)
Query Selectivity (month)
q3/h
q3/p
q5/h
q5/p
q6/h
q6/p
Fig. 22. Performance of TPC-H Query
(Selectivity)
emory which leads to the “Memory Limit Exceeded”
tion.
or Q3 and Q6, we build an index for the column shipdate
we increase ✏ to a larger value (e.g., 365), PABIRS finds th
index-based access is even worse than scan-based access.
will automatically switch to the disk scan, which generate
better
station. To avoid query with empty result, we intentionally
select a number with at least one record under the base station.
PABIRS can effectively handle queries with a high selectiv-
ity but still involving numerous tuples. As shown in Table III,
in our 160G dataset, we have 40960 blocks in total. Although
the selectivities of the queries are as low as 0.00001%, the
records related to Q1, Q2 and Q3 cover 477, 28863 and 343
data blocks respectively. The involved data blocks, especially
for Q1 and Q3, are no close to the total number of data blocks,
while the overhead of loading hundreds of data blocks from
the disks remains high.
In experiments, PABIRS, Phoenix and BIDS are allowed
to use 4 GB main memory on each node of the cluster, while
Impala are tested under two settings with 4 GB and 8 GB
main memory respectively. The results in Figure 13 shows that
Impala-4G is unable to finish the queries in reasonable time
(i.e. 1,000 seconds), as it incurs high I/O cost on memory-
disk data swap. It reveals the limitation of Impala on memory
usage efficiency. Moreover, Impala and BIDS show a similar
performance for all queries, because both approaches adopt
the scan-based techniques (memory scan and disk scan). In the
rest of the experiments, we only report the results of Impala-
8G, denoted as Impala in abbreviation. The results also imply
that PABIRS significantly outperforms the other systems on
all queries. When the selectivity of the query is high, such as
Q1 and Q3, HBase Phoenix is the only alternative with close
performance to PABIRS, because of its adoption of secondary
index. But for the query involving a large portion of data like
Q2, HBase Phoenix is slow as it incurs many random I/Os to
retrieve all results.
TABLE III. PROCESSING TIME OF PABIRS
QID selectivity index time disk time total time
Q1 1.2% 1.03s 1.47s 2.50s
Q2 70% 2.11s 137.63s 139.74s
Q3 0.8% 1.04s 1.28s 2.32s
To gain better insight into the scalability of PABIRS, we
the performances of PABIRS and HBase Phoenix degrade
slightly when more insertions are conducted, because they
need to build and query indexes for the new tuples. Finally,
we implement a simple transaction module as discussed in
Section 2. Our test transaction retrieves all records of a specific
phone number (normally hundreds to thousands of records) and
updates the values of NeID in those records to a new value.
We vary the number of concurrent transactions and ss shown
in Figure 18, for this test transaction, PABIRS can provide a
good throughput.
In PABIRS, queries can be grouped into batch and share the
index searching process. In Figure 19 and Figure 20, we show
the throughput and response time for varied batch size. As each
node in the cluster is equipped with a 4-core CPU, we start
four concurrent I/O threads at the same time. For comparison
purpose, we also show the result when a single I/O thread
run. The throughput of four I/O threads is almost three times
higher than the single thread case. The throughput improves
dramatically for a larger query batch, since we can share
more signature and data scans among the queries. However,
the results imply that the throughput gain shrinks with the
increase of the query batch size. It is thus important to choose
an appropriate batch size in real actions. The response time is
also affected by the batch size. Figure 20 illustrates that the
response time is generally proportional to the batch size. If a
strict real-time requirement is needed, it is important for the
system to carefully choose batch size, in order to hit a balance
between the throughput and response time.
C. Analytic Query Performance
In this group of experiments, we evaluate the performance
of PABIRS on data and queries generated by TPC-H bench-
mark. Specifically, we generate 320 GB data with different
skew factors using the TPC-H Skew Generator6
. We deploy
Hive on top of PABIRS and compare the performances of
PABIRS against the original Hive on query Q3, Q5 and
Q6 in TPC-H. We also include Impala in the experiment.
However, Impala requires buffering all intermediate join results
A. B.
Fig.	
  21.	
  TPC-­‐H	
  (skew) Fig.	
  22.	
  TPC-­‐H	
  (Selectivity)
(Fig.	
  13)
・skew⼩小:Hiveと同等の性能	
  
・skew⼤大:インデックスの効果で性能向上	
  
・Q5には効果なし	
  (インデックスがordersにしかないため)	
  
図は元論論⽂文より引⽤用
Scalable  Distributed  Transactions  across  
Heterogeneous  Stores
•  ⽬目的
–  異異なるデータストア間で複数アイテムに
対応したトランザクション処理理を⾏行行いたい
•  課題
–  アプリケーションでトランザクションを⾏行行う場合:
•  プログラマによるエラーを起こしやすく,可⽤用性やスケーラビリティ喪失の
恐れがある
–  コーディネータのミドルウェアを導⼊入する場合:
•  アプリケーションは全て管理理下になければならない
•  貢献
–  異異なるデータストア間での複数アイテムのトランザクションをサ
ポートするクライアントライブラリ:  Cherry  Garcia  (CG)の提案
–  Windows  Azure  Storage  (WAS),  Google  Cloud  Storage  (GCS),
Tora  (a  high-‐‑‒throughput  KVS)  に実装
–  YCSB+T  [Dey+,  ʻ‘14]  (WebスケールTXNベンチマーク)で評価
R3:  Distributed  Storage  and  Processing  担当:若若森(NTT)7
BEGIN	
  TRANSACTION	
  
	
  	
  SET	
  item1	
  of	
  Store1	
  
	
  	
  SET	
  item2	
  of	
  Store2	
  
COMMIT	
  TRANSACTION
異異なるデータストア間のトランザクション
Datastore wds. The example also uses a third store (e
later) that acts as the Coordinating Data Store (CDS)
1 public void UserTransaction ( ) {
D a t a s t o r e cds = D a t a s t o r e . c r e a t e ( ” c r e d e n t i a l s . xml” ) ;
3 D a t a s t o r e gds = D a t a s t o r e . c r e a t e ( ” goog creds . xml” ) ;
D a t a s t o r e wds = D a t a s t o r e . c r e a t e ( ” msft creds . xml” ) ;
5 T r a n s a c t i o n tx = new T r a n s a c t i o n ( cds ) ;
try {
7 tx . s t a r t ( ) ;
Record saving = tx . read ( gds , ” saving ” )
9 Record checking = tx . read ( wds , ” checking ” ) ;
i n t s = saving . get ( ” amount ” ) ;
11 i n t c = checking . get ( ” amount ” ) ;
saving . s e t ( ” amount ” , s 5 ) ;
13 checking . s e t ( ” amount ” , c + 5 ) ;
tx . w r i t e ( gds , ” saving ” , saving ) ;
15 tx . w r i t e ( wds , ” checking ” , checking ) ;
tx . commit ( ) ;
17 } catch ( Exception e ) {
tx . a b o r t ( ) ;
19 }
}
Listing 1. Example code that uses the API to accesses two da
R3:  Distributed  Storage  and  Processing  担当:若若森(NTT)8
Listing.	
  1.	
  GCのAPIを使⽤用して記述した2つのデータストア間のトランザクション
Datastore:	
  
データストアの	
  
インスタンス
Transaction:	
  
トランザクションコーディネータ	
  
・Google	
  Cloud	
  Storage	
  の  Datastore	
  (gds)	
  
から’saving’,Windows	
  Asure	
  Storage	
  の	
  
Datastore	
  (wds)から’checking’を読み込み	
  
それぞれ更更新する	
  
・ほか,Coordinating	
  Data	
  Store	
  (CDS)	
  
として動作するDatastoreも使⽤用している
図は元論論⽂文より引⽤用
Cherry  Garcia  (CG):
クライアントライブラリ
•  プラットフォームの想定
–  単⼀一レコードをreadする
際のStrong  Consistency
–  アトミックな単⼀一アイテ
ムの更更新・削除  (Test-‐‑‒
and-‐‑‒Set)
–  アイテム中にユーザ定義
のメタデータを含められ
る
R3:  Distributed  Storage  and  Processing  担当:若若森(NTT)9
II. SYSTEM DESIGN
ection, we describe the design of our client-
ransaction processing protocol that enables trans-
ving multiple data items that span multiple het-
data store instances. The protocol is to be imple-
library whose API abstracts data store instances
alled Datastore, and these are accessed via a
oordinator abstraction, a class called Transaction.
ecord is addressable using a string key and its
accessed using an object of a class called Record.
an example of an application that uses the API to
a records, one (“saving”) residing in an instance of
d Storage, abstracted by the Datastore gds, while
s stored in Windows Azure Storage represented as
ds. The example also uses a third store (explained
ts as the Coordinating Data Store (CDS).
ransaction ( ) {
D a t a s t o r e . c r e a t e ( ” c r e d e n t i a l s . xml” ) ;
D a t a s t o r e . c r e a t e ( ” goog creds . xml” ) ;
D a t a s t o r e . c r e a t e ( ” msft creds . xml” ) ;
= new T r a n s a c t i o n ( cds ) ;
= tx . read ( gds , ” saving ” )
g = tx . read ( wds , ” checking ” ) ;
g . get ( ” amount ” ) ;
ng . get ( ” amount ” ) ;
mount ” , s 5 ) ;
” amount ” , c + 5 ) ;
Application 1
Transaction
Application 2
Transaction
Tora Windows Azure
Storage
Google Cloud
Storage
Tora Datastore
Abstraction
Application 3
Transaction
Tora Datastore
Abstraction
WAS Datastore
Abstraction
WAS Datastore
Abstraction
GCS Datastore
Abstraction
GCS Datastore
Abstraction
Datastore
Specific
REST API
Cherry
Garcia
Coordinating
Storage
TSR
Fig. 1. Library architecture
2) Overview: In essence, the protocol calls for each data
item to maintain the last committed and perhaps also the
currently active version, for the data and relevant meta-
data. Each version is tagged with meta-data pertaining to
the transaction that created it. This includes the transaction
commit time and transaction identifier that created it, pointing
to a globally visible transaction status record (TSR) using a
Universal Resource Identifier (URI). The TSR is used by the
client to determine which version of the data item to use when
reading it, and so that transaction commit can happen just
by updating (in one step) the TSR. The transaction identifier,
Fig.	
  1.	
  ライブラリのアーキテクチャ
図は元論論⽂文より引⽤用
•  クライアントによるトランザクションコーディネーションの概要
•  各レコードを単⼀一アイテムのデータベースのように扱う
•  2PCでトランザクションコーディネーション
•  中央にコーディネータをもたない
•  データにトランザクションステートを持たせ,クライアントでコー
ディネートする
CGによるトランザクションのタイムライン
•  2PC
–  Current  state  と  previous  state  をデータに持たせる
–  Key  の  hash  値順に  PREPARED  フラグを⽴立立てる
–  Coordinating  Data  Store  (CDS)  に  Transaction  Status  Record  
(TSR)  を書き込み,  COMMITTED  フラグを⽴立立てる  (並列列処理理)
R3:  Distributed  Storage  and  Processing  担当:若若森(NTT)10
data
wo
the
mit.
ped
ked
hed
ion
wer
ten
SR)
ing
uly
COMMITTED PREPARED
application logic
CDS
WAS
GCS
C1
t1
r2
r1
transaction cache
COMMITTED
read() read()
v1v1
v1
v1
v2 v2
write() commit()
v2
v2
PREPARE
PREPARE
TSR
COMMIT
v2
v2
COMMITTEDPREPARED
COMMITTED
DELETE
application logic C2
t2
transaction cache
read()
v1v1 v2 v2
commit()write()
v1v1 v2 v2
time
read()
PREPARE
application logic
t3
transaction cache
Cherry Garcia
Cherry Garcia
Fig. 2. The timeline describing 3 transactions running on 2 client hosts to
access records in 2 data stores using a third data store as a CDS
In the rest of this section we go deeper in detail on the
components of the library and the algorithms. Pseudocode for図は元論論⽂文より引⽤用
(Fig.	
  2.)
実装,実験
•  Cherry  Garcia  の実装
–  Javaライブラリ  (JDK  1.6)
–  Datastore  abstractionをWindows  Azure  Storage  
(WAS),  Google  Cloud  Storage  (GCS),  Tora  
(WiredTigerストレージエンジン上で動作するKVS)  に対
して実装
•  実験
R3:  Distributed  Storage  and  Processing  担当:若若森(NTT)11 図は元論論⽂文より引⽤用
1885.4& 1888.6& 1862.2& 1911.6&
5898.4&
33810&
0&
10000&
20000&
30000&
40000&
0.1& 0.3& 0.5& 0.7& 0.9& 0.99&
aborts'per'million'
theta'
aborts&per&million&transac:ons&
Fig. 6. Aborts measured varying theta with 1 YCSB+T client against a
1-node Tora cluster
0"
5000"
10000"
15000"
1" 32" 60" 91" 121" 152" 182" 213" 244"
Throughput"
(transac8ons/second)"
YCSB+T"Client"Threads"
transac8ons/sec"
Fig. 7. Throughput of 4 YCSB+T client hosts each with 1 though 64 threads
against a 4-node Tora cluster
0"
5000"
10000"
15000"
20000"
25000"
1" 2" 3" 4" 5" 6" 7" 8"
Throughput"
(transac8ons/second)"
Number"of"hosts"running"16"YCSB+T"clients"threads"each"
transac8ons/second"
Fig. 8. Throughput of YCSB+T with 16 through 128 threads on 8 client
hosts against a 4-node Tora cluster
600"
700"
800"
900"
ons"per"second)"
1"record"tx"
1"record"nonBtx"
2"record"tx"
2"record"nonBtx"800"
1000"
1200"
tx/sec)"
3"records"serial"
phase"2"
3"records"with"
parallel"phase"2"
Fig.	
  8.	
  8クライアントから	
  4ノードのToraクラスタに
対して16スレッドから128スレッドのトランザクショ
ンを実⾏行行した時のスループット
increased linearly until 16 threads and the average latency for
each request stayed within the 500µs mark. As the number
of threads were increased beyond 16 the latency increased
until it reached 4.5ms at 64 threads. This increased latency
suggests that there is a performance bottleneck somewhere in
the system.
We ran a further test with 4 client hosts and a cluster of
4 Tora servers and repeated the previous test and varied the
number of threads from 1 through to 64 threads across all 4
client hosts and measured the throughput. The graph in Figure
7 shows that the performance on each host scales linearly until
16 threads (an aggregate of 64 threads across 4 client hosts)
and then flattens out. We observed that the socket send buffers
on the servers were full suggesting a network bottleneck at the
client.
G. Experiment 4: abort rates vary with contention
We setup one EC2 m3.2xlarge server each as a YCSB+T
client and Tora server in AWS and ran the client with 16
threads with a read to read-write ration of 50:50 over 1 million
transactions. We used the Zipfian access key pattern, and
varied the theta value over 0.1, 0.3, 0.5, 0.7,0.9 and 0.99.
Figure Fig 6 shows that the aborts increase as the contention
increases, though aborts are infrequent even with extreme
contention.
H. Experiment 5: Scale-out test
We ran YCSB+T with a mix of 90:10 read to read-modify-
write operations in a Zipfian data access pattern with theta set
to 0.99 across 1 to 8 client hosts each with 16 threads, running
against a 4-node Tora cluster. We collected the throughput
0"
5000"
10000"
15000"
1" 32" 60" 91" 121" 152" 182" 213" 244"
Throughput"
(transac8ons/second)"
YCSB+T"Client"Threads"
transac8ons/sec"
Fig. 7. Throughput of 4 YCSB+T client hosts each with 1 though 64 threads
against a 4-node Tora cluster
0"
5000"
10000"
15000"
20000"
25000"
1" 2" 3" 4" 5" 6" 7" 8"
Throughput"
(transac8ons/second)"
Number"of"hosts"running"16"YCSB+T"clients"threads"each"
transac8ons/second"
Fig. 8. Throughput of YCSB+T with 16 through 128 threads on 8 client
hosts against a 4-node Tora cluster
0"
100"
200"
300"
400"
500"
600"
700"
800"
900"
1" 6" 11" 16"
throughput"(transac8ons"per"second)"
number"of"client"threads"
1"record"tx"
1"record"nonBtx"
2"record"tx"
2"record"nonBtx"
0"
200"
400"
600"
800"
1000"
1200"
1" 6" 11" 16"
throughput"(tx/sec)"
number"of"client"threads"
3"records"serial"
phase"2"
3"records"with"
parallel"phase"2"
Fig. 9. Overhead of transactions and the effect of 1-phase optimization
133
Fig.	
  9.	
  トランザクションのオーバーヘッド	
  
と1-­‐phaseの最適化(*)の効果	
  
(*)	
  1アイテムに限定してPREPAREフェーズを省省略略
線形にスケール	
  (最⼤大  23288	
  trans/sec)
1-­‐phaseの最適化の	
  
オーバーヘッドは⼩小さい
並列列化による	
  
スループット向上

Contenu connexe

Tendances

Tutorial on Parallel Computing and Message Passing Model - C4
Tutorial on Parallel Computing and Message Passing Model - C4Tutorial on Parallel Computing and Message Passing Model - C4
Tutorial on Parallel Computing and Message Passing Model - C4
Marcirio Chaves
 
Database , 8 Query Optimization
Database , 8 Query OptimizationDatabase , 8 Query Optimization
Database , 8 Query Optimization
Ali Usman
 
Data structure-questions
Data structure-questionsData structure-questions
Data structure-questions
Shekhar Chander
 
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
Kiruthikak14
 
Interpolation wikipedia
Interpolation   wikipediaInterpolation   wikipedia
Interpolation wikipedia
hort34
 

Tendances (17)

Tutorial on Parallel Computing and Message Passing Model - C4
Tutorial on Parallel Computing and Message Passing Model - C4Tutorial on Parallel Computing and Message Passing Model - C4
Tutorial on Parallel Computing and Message Passing Model - C4
 
Notes 8086 instruction format
Notes 8086 instruction formatNotes 8086 instruction format
Notes 8086 instruction format
 
Database , 8 Query Optimization
Database , 8 Query OptimizationDatabase , 8 Query Optimization
Database , 8 Query Optimization
 
Positional Data Organization and Compression in Web Inverted Indexes
Positional Data Organization and Compression in Web Inverted IndexesPositional Data Organization and Compression in Web Inverted Indexes
Positional Data Organization and Compression in Web Inverted Indexes
 
Data structure-question-bank
Data structure-question-bankData structure-question-bank
Data structure-question-bank
 
Data structure-questions
Data structure-questionsData structure-questions
Data structure-questions
 
Lesson11 transactions
Lesson11 transactionsLesson11 transactions
Lesson11 transactions
 
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
 
Unit 1 ca-introduction
Unit 1 ca-introductionUnit 1 ca-introduction
Unit 1 ca-introduction
 
Ieeepro techno solutions ieee java project - nc cloud applying network codi...
Ieeepro techno solutions   ieee java project - nc cloud applying network codi...Ieeepro techno solutions   ieee java project - nc cloud applying network codi...
Ieeepro techno solutions ieee java project - nc cloud applying network codi...
 
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
 
Implementação do Hash Coalha/Coalesced
Implementação do Hash Coalha/CoalescedImplementação do Hash Coalha/Coalesced
Implementação do Hash Coalha/Coalesced
 
Db2 faqs
Db2 faqsDb2 faqs
Db2 faqs
 
Furnish an Index Using the Works of Tree Structures
Furnish an Index Using the Works of Tree StructuresFurnish an Index Using the Works of Tree Structures
Furnish an Index Using the Works of Tree Structures
 
Storage Management - Lecture 8 - Introduction to Databases (1007156ANR)
Storage Management - Lecture 8 - Introduction to Databases (1007156ANR)Storage Management - Lecture 8 - Introduction to Databases (1007156ANR)
Storage Management - Lecture 8 - Introduction to Databases (1007156ANR)
 
Interpolation wikipedia
Interpolation   wikipediaInterpolation   wikipedia
Interpolation wikipedia
 
A Novel Approach of Caching Direct Mapping using Cubic Approach
A Novel Approach of Caching Direct Mapping using Cubic ApproachA Novel Approach of Caching Direct Mapping using Cubic Approach
A Novel Approach of Caching Direct Mapping using Cubic Approach
 

En vedette (6)

ICDE2014 Session 14 Data Warehousing
ICDE2014 Session 14 Data WarehousingICDE2014 Session 14 Data Warehousing
ICDE2014 Session 14 Data Warehousing
 
VLDB2013 Session 1 Emerging Hardware
VLDB2013 Session 1 Emerging HardwareVLDB2013 Session 1 Emerging Hardware
VLDB2013 Session 1 Emerging Hardware
 
巨大な表を高速に扱うData.table について
巨大な表を高速に扱うData.table について巨大な表を高速に扱うData.table について
巨大な表を高速に扱うData.table について
 
data.tableパッケージで大規模データをサクッと処理する
data.tableパッケージで大規模データをサクッと処理するdata.tableパッケージで大規模データをサクッと処理する
data.tableパッケージで大規模データをサクッと処理する
 
遅延価値観数と階層ベイズを用いた男心をくすぐる女の戦略.R
遅延価値観数と階層ベイズを用いた男心をくすぐる女の戦略.R遅延価値観数と階層ベイズを用いた男心をくすぐる女の戦略.R
遅延価値観数と階層ベイズを用いた男心をくすぐる女の戦略.R
 
Rデータフレーム自由自在
Rデータフレーム自由自在Rデータフレーム自由自在
Rデータフレーム自由自在
 

Similaire à ICDE2015 Research 3: Distributed Storage and Processing

for sbi so Ds c c++ unix rdbms sql cn os
for sbi so   Ds c c++ unix rdbms sql cn osfor sbi so   Ds c c++ unix rdbms sql cn os
for sbi so Ds c c++ unix rdbms sql cn os
alisha230390
 
Network Flow Pattern Extraction by Clustering Eugine Kang
Network Flow Pattern Extraction by Clustering Eugine KangNetwork Flow Pattern Extraction by Clustering Eugine Kang
Network Flow Pattern Extraction by Clustering Eugine Kang
Eugine Kang
 
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Derryck Lamptey, MPhil, CISSP
 

Similaire à ICDE2015 Research 3: Distributed Storage and Processing (20)

Low complexity low-latency architecture for matching
Low complexity low-latency architecture for matchingLow complexity low-latency architecture for matching
Low complexity low-latency architecture for matching
 
Samsung DeepSort
Samsung DeepSortSamsung DeepSort
Samsung DeepSort
 
An overview of fragmentation
An overview of fragmentationAn overview of fragmentation
An overview of fragmentation
 
for sbi so Ds c c++ unix rdbms sql cn os
for sbi so   Ds c c++ unix rdbms sql cn osfor sbi so   Ds c c++ unix rdbms sql cn os
for sbi so Ds c c++ unix rdbms sql cn os
 
Distributed Coordination
Distributed CoordinationDistributed Coordination
Distributed Coordination
 
GEN: A Database Interface Generator for HPC Programs
GEN: A Database Interface Generator for HPC ProgramsGEN: A Database Interface Generator for HPC Programs
GEN: A Database Interface Generator for HPC Programs
 
ADBS_parallel Databases in Advanced DBMS
ADBS_parallel Databases in Advanced DBMSADBS_parallel Databases in Advanced DBMS
ADBS_parallel Databases in Advanced DBMS
 
OMT: A DYNAMIC AUTHENTICATED DATA STRUCTURE FOR SECURITY KERNELS
OMT: A DYNAMIC AUTHENTICATED DATA STRUCTURE FOR SECURITY KERNELSOMT: A DYNAMIC AUTHENTICATED DATA STRUCTURE FOR SECURITY KERNELS
OMT: A DYNAMIC AUTHENTICATED DATA STRUCTURE FOR SECURITY KERNELS
 
P REFIX - BASED L ABELING A NNOTATION FOR E FFECTIVE XML F RAGMENTATION
P REFIX - BASED  L ABELING  A NNOTATION FOR  E FFECTIVE  XML F RAGMENTATIONP REFIX - BASED  L ABELING  A NNOTATION FOR  E FFECTIVE  XML F RAGMENTATION
P REFIX - BASED L ABELING A NNOTATION FOR E FFECTIVE XML F RAGMENTATION
 
Network Flow Pattern Extraction by Clustering Eugine Kang
Network Flow Pattern Extraction by Clustering Eugine KangNetwork Flow Pattern Extraction by Clustering Eugine Kang
Network Flow Pattern Extraction by Clustering Eugine Kang
 
Modifications in lsb based steganography
Modifications in lsb based steganographyModifications in lsb based steganography
Modifications in lsb based steganography
 
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
 
Spatio textual similarity join
Spatio textual similarity joinSpatio textual similarity join
Spatio textual similarity join
 
IRJET- Clustering of Hierarchical Documents based on the Similarity Deduc...
IRJET-  	  Clustering of Hierarchical Documents based on the Similarity Deduc...IRJET-  	  Clustering of Hierarchical Documents based on the Similarity Deduc...
IRJET- Clustering of Hierarchical Documents based on the Similarity Deduc...
 
A41001011
A41001011A41001011
A41001011
 
Advanced Non-Relational Schemas For Big Data
Advanced Non-Relational Schemas For Big DataAdvanced Non-Relational Schemas For Big Data
Advanced Non-Relational Schemas For Big Data
 
«Дизайн продвинутых нереляционных схем для Big Data»
«Дизайн продвинутых нереляционных схем для Big Data»«Дизайн продвинутых нереляционных схем для Big Data»
«Дизайн продвинутых нереляционных схем для Big Data»
 
Linked list (introduction) 1
Linked list (introduction) 1Linked list (introduction) 1
Linked list (introduction) 1
 
Ijcatr04051012
Ijcatr04051012Ijcatr04051012
Ijcatr04051012
 
FractalTreeIndex
FractalTreeIndexFractalTreeIndex
FractalTreeIndex
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 

ICDE2015 Research 3: Distributed Storage and Processing

  • 1. 【ICDE2015  勉強会】 Research  3: Distributed  Storage  and  Processing 担当:  若若森  拓拓⾺馬  (NTT) 2015.5.16
  • 2. 紹介する論論⽂文 1. PABIRS:  A  Data  Access  Middleware   for  Distributed  File  Systems   – S.  Wu  (Zhejiang  Univ.),  G.  Chen,  X.  Zhou,  Z.   Zhang,  A.  K.  H.  Tung,  and  M.  Winslett   (UIUC) 2. Scalable  Distributed  Transactions   across  Heterogeneous  Stores – A.  Dey,  A.  Fekete,  and  U.  Röhm (Univ.  of  Sydney) R3:  Distributed  Storage  and  Processing  担当:若若森(NTT)2
  • 3. •  ⽬目的 –  選択率率率の⾼高いクエリや分析クエリの混合したワークロードを分散FS 上で効率率率的に処理理する •  課題 –  ソートやインデキシング等の前処理理が有効だが,複雑な前処理理は挿 ⼊入のスループットを低下する –  べき乗分布の実データ(下図)のインデックス設計も困難 •  貢献 –  複雑なクエリの混合するワークロードを効率率率的に処理理する 統合データアクセスミドルウェア  (PABIRS)  の提案 数10億の通話履履歴から直近数カ⽉月分(数千)のレコードを検索索 PABIRS:  A  Data  Access  Middleware  for   Distributed  File  Systems   R3:  Distributed  Storage  and  Processing  担当:若若森(NTT)3 0 500 1000 1500 2000 2500 0 100 200 300 400 500 600 700 800 900 1000 CallFrequency Caller ID Call Frequency Fig. 1. Distribution of Call Frequency 0 200 400 600 800 1000 1200 0 100 200 300 400 500 600 700 800 900 1000 NumberofBlocks Caller ID Number of Blocks Fig. 2. Number of Blocks per Key PABIRS 図は元論論⽂文より引⽤用 例例)  ある電話会社の通話ログデータから1,000個の電話番号(ID)をランダムに抽出した結果 Fig.  1.  通話頻度度の分布 0 500 1000 1500 2000 2500 0 100 200 300 400 500 600 700 800 900 1000 CallFrequency Caller ID Call Frequency Fig. 1. Distribution of Call Frequency 0 200 400 600 800 1000 1200 0 100 200 300 400 500 600 700 800 900 1000 NumberofBlocks Caller ID Number of Blocks Fig. 2. Number of Blocks per Key support efficient data retrieval for various query workloads. PABIRS Fig.  2.  IDの含まれるデータブロック数 ・1%  の  ID  による通話が     半分以上を占める   ・べき乗分布  (power-­‐law) 場所や通話回数などの属性で集約して分析 ・頻出  ID  のレコードが     ほぼ全てのDFSブロック     中に存在
  • 4. PABIRS  =  Bitmap  +  LSM  index •  DFS上の(半)構造化データへのアクセス⼿手段 –  DFSへのGETインタフェース –  MapReduce処理理:mapへのinputformat –  KVSのトランザクション:secondary  index •  DFS  wrapper:  ハイブリッドインデックス –  Bitmap  index:選択率率率の低いキー/タプル向け –  LSM  index:hot  value  に対してのみ⽣生成 R3:  Distributed  Storage  and  Processing  担当:若若森(NTT)4 PABIRS DFS g. 3. Architecture of PABIRSFig.  3.  PABIRS  のアーキテクチャ InputFormatInsert(key,  value)Lookup(key) Fig.  4.  bitmap  の例例 ・ブロック毎に  signature  を保持   ・DAGベースの階層構造     (directory  vertices  à  data  vertices)   1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 u u u u u u UID u u u u u 1 0 0 1 0 1 0 0 0 1 block signature data block 1 1 0 0 1 1 1 0 1 1 0 1 0 1 1 1 data block 2 block signature UID Fig. 4. Bitmap Example III. HYBRID INDEXING SCHEME The general idea behind our hybrid indexing scheme is to build bitmap signatures for all data blocks and select certain hot keys for LSM index. Bitmap signature is created for multiple attributes without re-ordering the records. To facilitate efficient parallel search, we design a hierarchical model based on a virtual Directed Acyclic Graph (DAG) structure, in which each intermediate vertex is a summary of the signatures accessible on its descendants. We present an example DAG structure in Figure 5 as a virtual index structure Param s, s1 v pj N Bp Bt ✓ k m F W r with entries taken from the leaf level of the C0 tree, thus decreasing the size of C0, and creates a newly merged leaf node of the C1 tree. The buffered multi-page block containing old C1 tree nodes prior to merge is called the emp- tying block, and new leaf nodes are written to a different buffered multi-page block called the filling block. When this filling block has been packed full with newly merged leaf nodes of C1, the block is written to a new free area on disk. The new multi-page block containing merged results is pictured in Figure 2.2 as lying on the right of the former nodes. Subsequent merge steps bring together increasing index value segments of the C0 and C1 components until the maximum values are reached and the rolling merge starts again from the smallest values. C1 tree C0 tree Disk Memory Figure 2.2. Conceptual picture of rolling merge steps, with result written back to disk Newly merged blocks are written to new disk positions, so that the old blocks will not be over- written and will be available for recovery in case of a crash. The parent directory nodes in C1, also buffered in memory, are updated to reflect this new leaf structure, but usually remain in buffer for longer periods to minimize I/O; the old leaf nodes from the C1 component are in- validated after the merge step is complete and are then deleted from the C1 directory. In gen- eral, there will be leftover leaf-level entries for the merged C1 component following each merge step, since a merge step is unlikely to result in a new node just as the old leaf node empties. The same consideration holds for multi-page blocks, since in general when the filling block has filled with newly merged nodes, there will be numerous nodes containing entries still LSM    Tree  [O'Neil+,  ’96] ・特徴:  ⾼高い書込スループット   ・インメモリのC0(AVL-­‐Tree)に     書き込み   ・C0のサイズがしきい値を     超えた時,ディスク上の     C1  (B-­‐Tree)  に  rolling  merge   コスト⾒見見積り 図は元論論⽂文より引⽤用 (図:  LSM  Tree  の元論論⽂文より)
  • 5. ハイブリッドインデックスの最適化 1.  Bitmap  Signatureのコストモデルと最適化 –  fanout  パラメータ  F  に基づき,low-‐‑‒level vertices  からhigh-‐‑‒level  vertexを⽣生成 –  コストモデルを定義,コスト最⼩小化するFを推定 –  Pregel  [Malewicz+,  ʼ’10]  の  BSP  でグラフ探索索 2.  LSMによる最適化 –  あるキーの選択率率率がしきい値を超えた場合にLSM  indexを作成 R3:  Distributed  Storage  and  Processing  担当:若若森(NTT)5 0 200 400 600 800 1000 1200 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 ProcessingTime(msec) Selectivities of Call Numbers (%) bitmap lsm Cost of Bitmap and LSM he fanout of the B-tree. We try to insert the key ex, only when the estimated cost is no larger than Index Manager Data Statistics DFS Data Stream Append LSM Index New Data Bitmap Signature MapReduce Algorithm Fig. 8. Index Update Fig.  7.  bitmap  と  LSM  の検索索コスト ・LSM  は選択率率率によらず⼀一定   ・選択率率率  0.1  %以下の場合は     bitmap  が⾼高速   ・実際は  90%  以上のクエリが       0.1  %以下  0.1 0 200 400 600 800 1000 1200 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 ProcessingTime(msec) Selectivities of Call Numbers (%) bitmap lsm Fig. 7. Search Cost of Bitmap and LSM where E is the fanout of the B-tree. We try to insert the key into LSM index, only when the estimated cost is no larger than the cost of bitmap index. Based on the inequality above, we are able to calculate the minimal selectivity, which makes LSM a more attractive selection than the bitmap. In Figure 7, we apply the theoretical Index Manager Data Statistics DFS Data Stream Append LSM Index New Data Bitmap Signature MapReduce Algorithm Fig. 8. Index Update C. Update on the Indices PABIRS is specifically designed for the applications that require fast data insertion. In PABIRS, bitmap index is a lightweight index which can be built in a batch, while LSM index is intentionally designed to support the fast insertion. Fig.  8.  インデックスの更更新 ・新規データはDFSに追記   ・オフラインの  MR  Jobで     Bitmap  signature  と  LSM  の     Hot-­‐key  を更更新   for multiple attributes without re-ordering the records. To facilitate efficient parallel search, we design a hierarchical model based on a virtual Directed Acyclic Graph (DAG) structure, in which each intermediate vertex is a summary of the signatures accessible on its descendants. We present an example DAG structure in Figure 5 as a virtual index structure on two different attributes, using 8 bits and 5 bits for these two attributes respectively. Generally speaking, the DAG structure consists of three layers. The retrieval layer contains individual signatures cor- responding to the data blocks, while each intermediate vertex in the index layer is associated with a summary signature by merging signatures of its children vertices. Data layer refers to the physical data blocks stored in the DFS. Signatures and their corresponding graph vertices are randomly distributed to multiple DFS nodes. In the rest of the paper, we refer the vertices in the retrieval layer as data vertices and the vertices in the index layer as directory vertices. On the other hand, LSM index replicates the records with hot keys and sorts them in its B-trees. For each indexing attribute, we independently create an LSM index to maintain its sorted replicas of hot data. In the rest of the section, we first introduce the tuning approaches used on the bitmap-based index, followed by the selection strategy between these two indices. For better readability, the notations used in the section are summarized in Table I. A. Optimizations on Bitmap Signature Suppose the signature of each data block follows the same distribution {p1, p2, . . . , pk}, in which each pj indicates the probability of having “1” on the j-th bit. Because of the exclusiveness between the values, the signature is a sparse vector, i.e. P j pj ⌧ k. Given two signatures s1 and s2, the expected number of common “1” on j-th bits in both s1 and s2 is P j(pj)2 . It is much smaller than the expected number of “1” in either s1 or s2, i.e. P j(pj), unless there exists a pj dominating the distribution. When records are randomly assigned to the data blocks, each probability pj is supposed to be a small positive number. This leads to the phenomenon of Weak Locality in PABIRS. It N total number of records/tuples Bp size of a data block Bt size of a tuple ✓ query selectivity k number of distinct values of the attributes m number of values mapped to the same bit F fanout of the directory vertex W number of virtual machines (workers) rl computation cost of a directory vertex rn network delay between any pair of vertices rd the overhead of reading a data block ✓ selectivity of a particular queried key fmin minimum frequency of any value in a domain p(✓) pdf of distribution on selectivity ✓ of queries. ... …... Fig. 5. Demonstration of Signature Graph is thus not helpful to group similar signatures when building high-level directory vertex in the index layer, because such merging only generates a new signature with a union of “1”s from the signatures of its children vertices. Although it is unlikely to optimize by better grouping, the fanout of the abstract tree structure, i.e. the number of children vertices for every directory vertex, remains tunable and turns out to be crucial to the searching efficiency. 1) Cost Model and Fanout Optimization: Instead of picking up similar signatures during bitmap construction, PABIRS simply groups the low-level vertices to generate a high-level vertex, based on a pre-specified fanout parameter F. Specifi- 116 Fig.  5.  signature  graph ※ほか,分析ワークロード向けの最適化等については元論論⽂文を参照   図は元論論⽂文より引⽤用
  • 6. 実験,評価 •  環境 –  Hadoop  1.0.4  +  GPS  [Salihoglu+,  ʼ’12]  (Pregel  のOSS実装)をベースに 実装 –  4コアcpu,  8G  RAM,  32ノードのhadoopクラスタで実験 •  項⽬目 A.  ⾼高選択率率率クエリ:電話履履歴データに対する3種のselectクエリをHBase   Phoenix,  Impala,  BIDS  [Lu+,  13]と⽐比較 B.  分析クエリ:tpcdskew  [Bruno+,  ʻ‘05]  で⽣生成した⼈人⼯工データを⽤用いて TPC-‐‑‒H  Q3,  Q5,  Q6をHiveと⽐比較 R3:  Distributed  Storage  and  Processing  担当:若若森(NTT)6 320 400 ) 0 50 100 150 200 250 300 350 400 Q1 Q2 Q3 ResponseTime(second) HBase PABIRS BIDS Impala-8G Impala-4G Fig. 13. Queries 1 10 100 1000 80 160 240 320 400 AverageResponseTime(second) Data Size (G) Q1 Q2 Q3 Fig. 14. Effect of Data Size 40 50 onseTime PABIRS HBase 150 200 nute) PABIRS 0 20 40 60 80 00 20 40 60 0 4 8 12 16 20 24 28 32 Query Batch Size Single Processor Thread Quad Processor Thread 19. Throughput of Concurrent ries (Q1) 0 5 10 15 20 25 30 35 40 45 50 0 4 8 12 16 20 24 28 32 AverageResponseTime(second) Query Batch Size Single Processor Thread Quad Processor Thread Fig. 20. Response Time of Concurrent Queries (Q1) 10 100 1000 10000 1 2 3 4 ResponseTime(second) Skew Factor q3/h q3/p q5/h q5/p q6/h q6/p Fig. 21. Performance of TPC-H Query (Skew) 10 100 1000 10000 3.5%(3) 7.1%(6) 10.5%(9)14.2%(12) ResponseTime(second) Query Selectivity (month) q3/h q3/p q5/h q5/p q6/h q6/p Fig. 22. Performance of TPC-H Query (Selectivity) emory which leads to the “Memory Limit Exceeded” tion. or Q3 and Q6, we build an index for the column shipdate we increase ✏ to a larger value (e.g., 365), PABIRS finds th index-based access is even worse than scan-based access. will automatically switch to the disk scan, which generate better station. To avoid query with empty result, we intentionally select a number with at least one record under the base station. PABIRS can effectively handle queries with a high selectiv- ity but still involving numerous tuples. As shown in Table III, in our 160G dataset, we have 40960 blocks in total. Although the selectivities of the queries are as low as 0.00001%, the records related to Q1, Q2 and Q3 cover 477, 28863 and 343 data blocks respectively. The involved data blocks, especially for Q1 and Q3, are no close to the total number of data blocks, while the overhead of loading hundreds of data blocks from the disks remains high. In experiments, PABIRS, Phoenix and BIDS are allowed to use 4 GB main memory on each node of the cluster, while Impala are tested under two settings with 4 GB and 8 GB main memory respectively. The results in Figure 13 shows that Impala-4G is unable to finish the queries in reasonable time (i.e. 1,000 seconds), as it incurs high I/O cost on memory- disk data swap. It reveals the limitation of Impala on memory usage efficiency. Moreover, Impala and BIDS show a similar performance for all queries, because both approaches adopt the scan-based techniques (memory scan and disk scan). In the rest of the experiments, we only report the results of Impala- 8G, denoted as Impala in abbreviation. The results also imply that PABIRS significantly outperforms the other systems on all queries. When the selectivity of the query is high, such as Q1 and Q3, HBase Phoenix is the only alternative with close performance to PABIRS, because of its adoption of secondary index. But for the query involving a large portion of data like Q2, HBase Phoenix is slow as it incurs many random I/Os to retrieve all results. TABLE III. PROCESSING TIME OF PABIRS QID selectivity index time disk time total time Q1 1.2% 1.03s 1.47s 2.50s Q2 70% 2.11s 137.63s 139.74s Q3 0.8% 1.04s 1.28s 2.32s To gain better insight into the scalability of PABIRS, we the performances of PABIRS and HBase Phoenix degrade slightly when more insertions are conducted, because they need to build and query indexes for the new tuples. Finally, we implement a simple transaction module as discussed in Section 2. Our test transaction retrieves all records of a specific phone number (normally hundreds to thousands of records) and updates the values of NeID in those records to a new value. We vary the number of concurrent transactions and ss shown in Figure 18, for this test transaction, PABIRS can provide a good throughput. In PABIRS, queries can be grouped into batch and share the index searching process. In Figure 19 and Figure 20, we show the throughput and response time for varied batch size. As each node in the cluster is equipped with a 4-core CPU, we start four concurrent I/O threads at the same time. For comparison purpose, we also show the result when a single I/O thread run. The throughput of four I/O threads is almost three times higher than the single thread case. The throughput improves dramatically for a larger query batch, since we can share more signature and data scans among the queries. However, the results imply that the throughput gain shrinks with the increase of the query batch size. It is thus important to choose an appropriate batch size in real actions. The response time is also affected by the batch size. Figure 20 illustrates that the response time is generally proportional to the batch size. If a strict real-time requirement is needed, it is important for the system to carefully choose batch size, in order to hit a balance between the throughput and response time. C. Analytic Query Performance In this group of experiments, we evaluate the performance of PABIRS on data and queries generated by TPC-H bench- mark. Specifically, we generate 320 GB data with different skew factors using the TPC-H Skew Generator6 . We deploy Hive on top of PABIRS and compare the performances of PABIRS against the original Hive on query Q3, Q5 and Q6 in TPC-H. We also include Impala in the experiment. However, Impala requires buffering all intermediate join results A. B. Fig.  21.  TPC-­‐H  (skew) Fig.  22.  TPC-­‐H  (Selectivity) (Fig.  13) ・skew⼩小:Hiveと同等の性能   ・skew⼤大:インデックスの効果で性能向上   ・Q5には効果なし  (インデックスがordersにしかないため)   図は元論論⽂文より引⽤用
  • 7. Scalable  Distributed  Transactions  across   Heterogeneous  Stores •  ⽬目的 –  異異なるデータストア間で複数アイテムに 対応したトランザクション処理理を⾏行行いたい •  課題 –  アプリケーションでトランザクションを⾏行行う場合: •  プログラマによるエラーを起こしやすく,可⽤用性やスケーラビリティ喪失の 恐れがある –  コーディネータのミドルウェアを導⼊入する場合: •  アプリケーションは全て管理理下になければならない •  貢献 –  異異なるデータストア間での複数アイテムのトランザクションをサ ポートするクライアントライブラリ:  Cherry  Garcia  (CG)の提案 –  Windows  Azure  Storage  (WAS),  Google  Cloud  Storage  (GCS), Tora  (a  high-‐‑‒throughput  KVS)  に実装 –  YCSB+T  [Dey+,  ʻ‘14]  (WebスケールTXNベンチマーク)で評価 R3:  Distributed  Storage  and  Processing  担当:若若森(NTT)7 BEGIN  TRANSACTION      SET  item1  of  Store1      SET  item2  of  Store2   COMMIT  TRANSACTION
  • 8. 異異なるデータストア間のトランザクション Datastore wds. The example also uses a third store (e later) that acts as the Coordinating Data Store (CDS) 1 public void UserTransaction ( ) { D a t a s t o r e cds = D a t a s t o r e . c r e a t e ( ” c r e d e n t i a l s . xml” ) ; 3 D a t a s t o r e gds = D a t a s t o r e . c r e a t e ( ” goog creds . xml” ) ; D a t a s t o r e wds = D a t a s t o r e . c r e a t e ( ” msft creds . xml” ) ; 5 T r a n s a c t i o n tx = new T r a n s a c t i o n ( cds ) ; try { 7 tx . s t a r t ( ) ; Record saving = tx . read ( gds , ” saving ” ) 9 Record checking = tx . read ( wds , ” checking ” ) ; i n t s = saving . get ( ” amount ” ) ; 11 i n t c = checking . get ( ” amount ” ) ; saving . s e t ( ” amount ” , s 5 ) ; 13 checking . s e t ( ” amount ” , c + 5 ) ; tx . w r i t e ( gds , ” saving ” , saving ) ; 15 tx . w r i t e ( wds , ” checking ” , checking ) ; tx . commit ( ) ; 17 } catch ( Exception e ) { tx . a b o r t ( ) ; 19 } } Listing 1. Example code that uses the API to accesses two da R3:  Distributed  Storage  and  Processing  担当:若若森(NTT)8 Listing.  1.  GCのAPIを使⽤用して記述した2つのデータストア間のトランザクション Datastore:   データストアの   インスタンス Transaction:   トランザクションコーディネータ   ・Google  Cloud  Storage  の  Datastore  (gds)   から’saving’,Windows  Asure  Storage  の   Datastore  (wds)から’checking’を読み込み   それぞれ更更新する   ・ほか,Coordinating  Data  Store  (CDS)   として動作するDatastoreも使⽤用している 図は元論論⽂文より引⽤用
  • 9. Cherry  Garcia  (CG): クライアントライブラリ •  プラットフォームの想定 –  単⼀一レコードをreadする 際のStrong  Consistency –  アトミックな単⼀一アイテ ムの更更新・削除  (Test-‐‑‒ and-‐‑‒Set) –  アイテム中にユーザ定義 のメタデータを含められ る R3:  Distributed  Storage  and  Processing  担当:若若森(NTT)9 II. SYSTEM DESIGN ection, we describe the design of our client- ransaction processing protocol that enables trans- ving multiple data items that span multiple het- data store instances. The protocol is to be imple- library whose API abstracts data store instances alled Datastore, and these are accessed via a oordinator abstraction, a class called Transaction. ecord is addressable using a string key and its accessed using an object of a class called Record. an example of an application that uses the API to a records, one (“saving”) residing in an instance of d Storage, abstracted by the Datastore gds, while s stored in Windows Azure Storage represented as ds. The example also uses a third store (explained ts as the Coordinating Data Store (CDS). ransaction ( ) { D a t a s t o r e . c r e a t e ( ” c r e d e n t i a l s . xml” ) ; D a t a s t o r e . c r e a t e ( ” goog creds . xml” ) ; D a t a s t o r e . c r e a t e ( ” msft creds . xml” ) ; = new T r a n s a c t i o n ( cds ) ; = tx . read ( gds , ” saving ” ) g = tx . read ( wds , ” checking ” ) ; g . get ( ” amount ” ) ; ng . get ( ” amount ” ) ; mount ” , s 5 ) ; ” amount ” , c + 5 ) ; Application 1 Transaction Application 2 Transaction Tora Windows Azure Storage Google Cloud Storage Tora Datastore Abstraction Application 3 Transaction Tora Datastore Abstraction WAS Datastore Abstraction WAS Datastore Abstraction GCS Datastore Abstraction GCS Datastore Abstraction Datastore Specific REST API Cherry Garcia Coordinating Storage TSR Fig. 1. Library architecture 2) Overview: In essence, the protocol calls for each data item to maintain the last committed and perhaps also the currently active version, for the data and relevant meta- data. Each version is tagged with meta-data pertaining to the transaction that created it. This includes the transaction commit time and transaction identifier that created it, pointing to a globally visible transaction status record (TSR) using a Universal Resource Identifier (URI). The TSR is used by the client to determine which version of the data item to use when reading it, and so that transaction commit can happen just by updating (in one step) the TSR. The transaction identifier, Fig.  1.  ライブラリのアーキテクチャ 図は元論論⽂文より引⽤用 •  クライアントによるトランザクションコーディネーションの概要 •  各レコードを単⼀一アイテムのデータベースのように扱う •  2PCでトランザクションコーディネーション •  中央にコーディネータをもたない •  データにトランザクションステートを持たせ,クライアントでコー ディネートする
  • 10. CGによるトランザクションのタイムライン •  2PC –  Current  state  と  previous  state  をデータに持たせる –  Key  の  hash  値順に  PREPARED  フラグを⽴立立てる –  Coordinating  Data  Store  (CDS)  に  Transaction  Status  Record   (TSR)  を書き込み,  COMMITTED  フラグを⽴立立てる  (並列列処理理) R3:  Distributed  Storage  and  Processing  担当:若若森(NTT)10 data wo the mit. ped ked hed ion wer ten SR) ing uly COMMITTED PREPARED application logic CDS WAS GCS C1 t1 r2 r1 transaction cache COMMITTED read() read() v1v1 v1 v1 v2 v2 write() commit() v2 v2 PREPARE PREPARE TSR COMMIT v2 v2 COMMITTEDPREPARED COMMITTED DELETE application logic C2 t2 transaction cache read() v1v1 v2 v2 commit()write() v1v1 v2 v2 time read() PREPARE application logic t3 transaction cache Cherry Garcia Cherry Garcia Fig. 2. The timeline describing 3 transactions running on 2 client hosts to access records in 2 data stores using a third data store as a CDS In the rest of this section we go deeper in detail on the components of the library and the algorithms. Pseudocode for図は元論論⽂文より引⽤用 (Fig.  2.)
  • 11. 実装,実験 •  Cherry  Garcia  の実装 –  Javaライブラリ  (JDK  1.6) –  Datastore  abstractionをWindows  Azure  Storage   (WAS),  Google  Cloud  Storage  (GCS),  Tora   (WiredTigerストレージエンジン上で動作するKVS)  に対 して実装 •  実験 R3:  Distributed  Storage  and  Processing  担当:若若森(NTT)11 図は元論論⽂文より引⽤用 1885.4& 1888.6& 1862.2& 1911.6& 5898.4& 33810& 0& 10000& 20000& 30000& 40000& 0.1& 0.3& 0.5& 0.7& 0.9& 0.99& aborts'per'million' theta' aborts&per&million&transac:ons& Fig. 6. Aborts measured varying theta with 1 YCSB+T client against a 1-node Tora cluster 0" 5000" 10000" 15000" 1" 32" 60" 91" 121" 152" 182" 213" 244" Throughput" (transac8ons/second)" YCSB+T"Client"Threads" transac8ons/sec" Fig. 7. Throughput of 4 YCSB+T client hosts each with 1 though 64 threads against a 4-node Tora cluster 0" 5000" 10000" 15000" 20000" 25000" 1" 2" 3" 4" 5" 6" 7" 8" Throughput" (transac8ons/second)" Number"of"hosts"running"16"YCSB+T"clients"threads"each" transac8ons/second" Fig. 8. Throughput of YCSB+T with 16 through 128 threads on 8 client hosts against a 4-node Tora cluster 600" 700" 800" 900" ons"per"second)" 1"record"tx" 1"record"nonBtx" 2"record"tx" 2"record"nonBtx"800" 1000" 1200" tx/sec)" 3"records"serial" phase"2" 3"records"with" parallel"phase"2" Fig.  8.  8クライアントから  4ノードのToraクラスタに 対して16スレッドから128スレッドのトランザクショ ンを実⾏行行した時のスループット increased linearly until 16 threads and the average latency for each request stayed within the 500µs mark. As the number of threads were increased beyond 16 the latency increased until it reached 4.5ms at 64 threads. This increased latency suggests that there is a performance bottleneck somewhere in the system. We ran a further test with 4 client hosts and a cluster of 4 Tora servers and repeated the previous test and varied the number of threads from 1 through to 64 threads across all 4 client hosts and measured the throughput. The graph in Figure 7 shows that the performance on each host scales linearly until 16 threads (an aggregate of 64 threads across 4 client hosts) and then flattens out. We observed that the socket send buffers on the servers were full suggesting a network bottleneck at the client. G. Experiment 4: abort rates vary with contention We setup one EC2 m3.2xlarge server each as a YCSB+T client and Tora server in AWS and ran the client with 16 threads with a read to read-write ration of 50:50 over 1 million transactions. We used the Zipfian access key pattern, and varied the theta value over 0.1, 0.3, 0.5, 0.7,0.9 and 0.99. Figure Fig 6 shows that the aborts increase as the contention increases, though aborts are infrequent even with extreme contention. H. Experiment 5: Scale-out test We ran YCSB+T with a mix of 90:10 read to read-modify- write operations in a Zipfian data access pattern with theta set to 0.99 across 1 to 8 client hosts each with 16 threads, running against a 4-node Tora cluster. We collected the throughput 0" 5000" 10000" 15000" 1" 32" 60" 91" 121" 152" 182" 213" 244" Throughput" (transac8ons/second)" YCSB+T"Client"Threads" transac8ons/sec" Fig. 7. Throughput of 4 YCSB+T client hosts each with 1 though 64 threads against a 4-node Tora cluster 0" 5000" 10000" 15000" 20000" 25000" 1" 2" 3" 4" 5" 6" 7" 8" Throughput" (transac8ons/second)" Number"of"hosts"running"16"YCSB+T"clients"threads"each" transac8ons/second" Fig. 8. Throughput of YCSB+T with 16 through 128 threads on 8 client hosts against a 4-node Tora cluster 0" 100" 200" 300" 400" 500" 600" 700" 800" 900" 1" 6" 11" 16" throughput"(transac8ons"per"second)" number"of"client"threads" 1"record"tx" 1"record"nonBtx" 2"record"tx" 2"record"nonBtx" 0" 200" 400" 600" 800" 1000" 1200" 1" 6" 11" 16" throughput"(tx/sec)" number"of"client"threads" 3"records"serial" phase"2" 3"records"with" parallel"phase"2" Fig. 9. Overhead of transactions and the effect of 1-phase optimization 133 Fig.  9.  トランザクションのオーバーヘッド   と1-­‐phaseの最適化(*)の効果   (*)  1アイテムに限定してPREPAREフェーズを省省略略 線形にスケール  (最⼤大  23288  trans/sec) 1-­‐phaseの最適化の   オーバーヘッドは⼩小さい 並列列化による   スループット向上