This presentation was held by Prashanth Menon at ICDE '14 on April 3, 2014 in Chicago, IL, USA.
The full paper and additional information is available at:
http://msrg.org/papers/Menon2013
Abstract:
With the ever growing size and complexity of enterprise systems there is a pressing need for more detailed application performance management. Due to the high data rates, traditional database technology cannot sustain the required performance. Alternatives are the more lightweight and, thus, more performant key-value stores. However, these systems tend to sacrifice read performance in order to obtain the desired write throughput by avoiding random disk access in favor of fast sequential accesses.
With the advent of SSDs, built upon the philosophy of no moving parts, the boundary between sequential vs. random access is now becoming blurred. This provides a unique opportunity to extend the storage memory hierarchy using SSDs in key-value stores. In this paper, we extensively evaluate the benefits of using SSDs in commercialized key-value stores. In particular, we
investigate the performance of hybrid SSD-HDD systems and demonstrate the benefits of our SSD caching and our novel dynamic schema model.
Gen AI in Business - Global Trends Report 2024.pdf
CaSSanDra: An SSD Boosted Key-Value Store
1. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
CaSSanDra:
An
SSD
Boosted
Key-‐Value
Store
Prashanth
Menon,
Tilmann
Rabl,
Mohammad
Sadoghi
(*),
Hans-‐Arno
Jacobsen
!1
*
2. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Outline
• ApplicaHon
Performance
Management
• Cassandra
and
SSDs
• Extending
Cassandra’s
Row
Cache
• ImplemenHng
a
Dynamic
Schema
Catalogue
• Conclusions
!2
3. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Modern
Enterprise
Architecture
• Many
different
soPware
systems
• Complex
interacHons
• Stateful
systems
oPen
distributed/parHHoned/replicated
• Stateless
systems
certainly
duplicated
!3
4. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
ApplicaHon
Performance
Management
• Lightweight
agent
aSached
to
each
soPware
system
instance
• Monitors
system
health
• Traces
transacHons
• Determines
root
causes
• Raw
APM
metric:
!4
Agent
Agent
Agent
Agent
Agent Agent
AgentAgent
Agent
Agent
Agent
Agent
5. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
ApplicaHon
Performance
Management
• Problem:
Agents
have
short
memory
and
only
have
a
local
view
• What
was
the
average
response
Hme
for
requests
served
by
servlet
X
between
December
18-‐31
2011?
• What
was
the
average
Hme
spent
in
each
service/database
to
respond
to
client
requests?
!5
Agent
Agent
Agent
Agent
Agent Agent
AgentAgent
Agent
Agent
Agent
Agent
6. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
APM
Metrics
Datastore
• All
agents
store
metric
data
in
high
write-‐throughput
datastore
• Metric
data
is
at
a
fine
granularity
(per-‐acHon,
millisecond
etc)
• User
now
has
global
view
of
metrics
• What
is
the
best
database
to
store
APM
metrics?
!6
Agent
Agent
Agent
Agent
Agent Agent
AgentAgent
Agent
Agent
Agent
Agent
?
7. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Cassandra
Wins
APM
• APM
experiments
performed
by
Rabl
et
al.
[1]
show
Cassandra
performs
best
for
APM
use
case
• In
memory
workloads
including
95%,
50%
and
5%
read
• Workloads
requiring
disk
access
with
95%,
50%
and
5%
reads
!7
Read: 95%
0
50000
100000
150000
200000
250000
2 4 6 8 10 12
Throughput(Ops/sec)
Number of Nodes
Cassandra
HBase
Voldemort
VoltDB
Redis
MySQL
Figure 6: Throughput for Workload RW
0.1
1
10
100
1000
2 4 6 8 10 12
Latency(ms)-Logarithmic
Number of Nodes
Cassandra
HBase
Voldemort
VoltDB
Redis
MySQL
Read: 50%
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
2 4 6 8 10 12
Throughput(Operations/sec)
Number of Nodes
Cassandra
HBase
Voldemort
VoltDB
Redis
MySQL
Figure 3: Throughput for Workload R
million records per node, thus, scaling the problem size with the
cluster size. For each run, we used a freshly installed system and
loaded the data. We ran the workload for 10 minutes with max-
imum throughput. Figure 3 shows the maximum throughput for
workload R for all six systems.
In the experiment with only one node, Redis has the highest
throughput (more than 50K ops/sec) followed by VoltDB. There
are no significant differences between the throughput of Cassan-
dra and MySQL, which is about half that of Redis (25K ops/sec).
Voldemort is 2 times slower than Cassandra (with 12K ops/sec).
The slowest system in this test on a single node is HBase with 2.5K
operation per second. However, it is interesting to observe that the
0.1
1
10
100
2 4 6 8 10 12
Latency(ms)-Logarithmic
Number of Nodes
Cassandra
HBase
Voldemort
VoltDB
Redis
MySQL
Figure 4: Read latency for Workload R
0.01
0.1
1
10
100
2 4 6 8 10 12
Latency(ms)-Logarithmic
Number of Nodes
Cassandra
HBase
Voldemort
VoltDB
Redis
MySQL
Figure 5: Write latency for Workload R
[1] http://msrg.org/publications/pdf_files/2012/vldb12-bigdata-Solving_Big_Data_Challenges_fo.pdf
8. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Cassandra
• Built
at
Facebook
by
previous
Dynamo
engineers
• Open
sourced
to
Apache
in
2009
• DHT
with
consistent
hashing
• MD5
hash
of
key
• MulHple
nodes
handle
segments
of
ring
for
load
balancing
• Dynamo
distribuHon
and
replicaHon
model
+
BigTable
storage
model
!8
Commit&&
Log&
Memtable&
SS&Tables&
9. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Cassandra
and
SSDs
• Improve
performance
by
either
adding
nodes
or
improving
per-‐
node
performance
• Node
performance
is
directly
dependent
on
the
disk
I/O
performance
of
the
system
• Cassandra
stores
two
enHHes
on
disk:
• Commit
Log
• SSTables
• Should
SSDs
be
used
to
store
both?
• We
evaluated
each
possible
configura<on
!9
10. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Experiment
Setup
• Server
specificaHon:
• 2x
Intel
8-‐core
X5450,
16GB
RAM,
2x
2TB
RAID0
HDD,
2x
250GB
Intel
x520
SSD
• Apache
Cassandra
1.10
• Used
YCSB
benchmark
• 100M
rows,
50GB
total
raw
data,
‘latest’
distribuHon
• 95%
read,
5%
write
• Minimum
three
runs
per
workload,
fresh
data
on
each
run
• Broken
into
phases:
• Data
load
• FragmentaHon
• Cache
warm-‐up
• Workload
(>
12h
process)
!10
11. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
SSD
vs.
HDD
• LocaHon
of
log
is
irrelevant
• LocaHon
of
data
is
important
• DramaHc
performance
improvement
of
SSD
over
HDD
• SSD
benefits
from
high
parallelism
!11
Configura<on #
of
clients #
of
threads/client Loca<on
of
Data Loca<on
of
Commit
Log
C1 1 2 RAID
(HDD) RAID
(HDD)
C2 1 2 RAID
(HDD) SSD
C3 1 2 SSD RAID
(HDD)
C4 1 2 SSD SSD
C5 4 16 RAID
(HDD) RAID
(HDD)
C6 4 16 SSD SSD
0
1000
2000
3000
4000
5000
6000
7000
8000
C1 C2 C3 C4 C5 C6
Throughput(ops/sec)
Configuration
(a) HDD vs SSD Throughput
0
1
2
3
4
5
6
7
8
C1 C2 C3 C4 C5 C6
Latency(ms)
Configuration
(b) HDD vs SDD Latency
0
1000
2000
3000
4000
5000
6000
7000
8000
HDD
Throughput(ops/sec)
Data
Empty Disk
Full Disk
(c) 99% Fill HDD v
Fig. 4. Throughput/Latency Results for HDD vs SSD and D
on HDD for the bulk of data that is infrequently accessed.
Another reason to do this is the fact that SSD performance
degrades with higher fill ratios. As seen in Figure 4(c), the
performance of a highly filled SSD degrades much worse than
This is becau
the SSD; in f
twice the amo
alone, achiev
12. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
SSD
vs.
HDD
(II)
• SSD
offers
more
than
7x
improvement
to
throughput
on
empty
disk
• SSD
performance
degrades
by
half
as
storage
device
fills
up
• Filling
the
SSD
or
running
it
near
capacity
is
not
advisable
!12
3 C4 C5 C6
iguration
SDD Latency
0
1000
2000
3000
4000
5000
6000
7000
8000
HDD SSD
Throughput(ops/sec)
Data Location
Empty Disk
Full Disk
(c) 99% Fill HDD vs SDD Throughput
0
50
100
150
200
250
HDD SSD
Latency(ms)
Data Location
Empty Disk
Full Disk
(d) 99% Fill HDD vs SDD Latency
t/Latency Results for HDD vs SSD and Disk Full vs Disk Empty
quently accessed.
SSD performance
Figure 4(c), the
much worse than
s to be noted that
, for write heavy
experienced.
This is because a larger portion of the hot data is cached on
the SSD; in fact, our configuration enabled storing more than
twice the amount of data than when using an in-memory cache
alone, achieving a cache-hit ratio of more than 85%. When
a read operation reaches the server for a row that does not
reside in the off-heap memory cache, only a single SSD seek
is required to fulfill the request. In addition, cached data is
13. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
SSD
vs.
HDD:
Summary
• Cassandra
benefits
most
when
storing
data
on
SSD
(not
the
log)
• LocaHon
of
commit
log
not
important
• SSD
performance
inversely
proporHonal
to
fill
raHo
• Storing
all
data
on
SSD
is
uneconomical
• Replacing
3TB
HDD
with
3x
1TB
SSD
is
10x
more
costly
• SSDs
have
limited
lifeHme
(10-‐50K
write-‐erase
cycles),
replacement
more
frequently
• Rabl
et
al.
[1]
show
adding
node
is
100%
costlier,
with
100%
throughput
improvement
• Build
hybrid
system
to
get
comparable
performance
for
marginal
cost
!13
14. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Cassandra:
Read
+
Write
Path
• Write
path
is
fast:
1. Write
update
into
commit
log
2. Write
update
into
Memtable
• Memtables
flush
to
SSTables
asynchronously
when
full
• Never
blocks
writes
• Read
path
can
be
slow:
1. Read
key-‐value
from
Memtable
2. Read
key-‐value
from
each
SSTable
on
disk
3. Construct
merged
view
of
row
from
each
input
source
!14
ReadUpdate
Memtable
SSTableSSTableSSTable
SSTableSSTableSSTable
Memory
• Each
read
needs
to
do
O(#
of
SSTables)
I/O
Disk
Log
15. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Cassandra:
SSTables
• Cassandra
allows
blind-‐writes
• Row
data
can
be
fragmented
over
mulHple
SSTables
over
Hme
!
!
!
!
• Bloom
filters
and
indexes
can
potenHally
help
• Ul<mately,
mul<ple
fragments
need
to
be
read
from
disk
!15
Employee(ID( First(Name( Last(Name( Age( Department(ID(
99231234& Prashanth& Menon& 25& MSRG&
{SSTables
16. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Cassandra:
Row
Cache
• Row
cache
buffers
full
merged
row
in
memory
• Cache
miss
follows
regular
read
path,
constructs
merged
row,
brings
into
cache
• Makes
read
path
faster
for
frequently
accessed
data
• Problem:
Row
cache
occupies
memory
• Takes
away
precious
memory
from
rest
of
system
!16
• Extend
the
row
cache
efficiently
onto
SSD
ReadUpdate
Memtable
SSTableSSTableSSTable
SSTableSSTableSSTable
Memory
Disk
Log
Row Cache
17. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Extended
Row
Cache
• Extend
the
row
cache
onto
SSD
• Chained
with
in-‐memory
row
cache
• LRU
in-‐memory,
overflow
onto
LRU
SSD
row
cache
• Implemented
as
append-‐only
cache
files
• Efficient
sequenHal
writes
• Fast
random
reads
• Zero
I/O
for
hit
in
first
level
row
cache
• One
random
I/O
on
SSD
for
second
level
row
cache
!17
Log SSTableSSTableSSTable
SSTableSSTableSSTable
Memory
Memtable
1rst Level Row
Cache
2nd Level Cache
Index
Disk
2nd Level Row Cache
SSD
ReadUpdate
18. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
EvaluaHon:
SSD
Row
Cache
• Setup:
• 100M
rows,
50GB
total
data,
6GB
row
cache
• Results:
• 75%
improvement
in
throughput
• 75%
improvement
in
latency
• RAM-‐only
cache
has
too
liSle
hit
raHo
!18
0
200
400
600
800
1000
95% 85% 75%
Throughput(ops/sec)
Read Percentage
Disabled
RAM
RAM+SSD
(a) Row Cache (Throughput)
0
1
2
3
4
5
6
7
8
95% 85% 75%
Latency(ms)
Read Percentage
Disabled
RAM
RAM+SSD
(b) Row Cache (Latency)
0
1000
2000
3000
4000
5000
6000
7000
95%
Throughput(ops/sec)
Re
Regular
Dynamic
(c) Dynamic Sc
Fig. 5. Throughput/Latency Results for Row Cache Exten
and we find this to be much more compelling. In normal
operation, data sizes averaged 6.8GB compressed after the
initial load of 40 million keys. With a modified Cassandra,
data sizes averaged at 6.01GB of data, a savings of roughly
10%. This value will grow as the number of columns in the
table grow and as column names grow in length.
Another potential benefit for dynamic schema model (omit-
we identify
key-value s
In this p
SSDs in k
figurations
and implem
19. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Dynamic
Schema
• Key-‐value
stores
covet
schema-‐less
data
model
• Very
flexible,
good
for
highly
varying
data
• Schemas
oPen
change,
defining
up
front
can
be
detrimental
!
!
!
!
!
!
• ObservaHon:
many
big
data
applicaHons
have
relaHvely
stable
schemas
• e.g.,
Click
stream,
APM,
sensor
data
etc.
• Redundant
schemas
have
significant
overhead
in
I/O
and
space
usage
!19
Metric'Name' HostA/AgentX/AVGResponse' Timestamp' 1332988833' Value' 4' Max' 6' Min' 1'
Metric'Name' HostA/AgentX/AVGResponse' Timestamp' 1332988848' Value' 5' Max' 7' Min' 1'
Metric'Name' HostA/AgentX/Failures' Timestamp' 1332988849' All' 4' Warn' 3' Error' 1'
OnHDisk'Format'
Metric'Name' Timestamp' Value' Max' Min'
HostA/AgentX/AVGResponse' 1332988833' 4' 6' 1'
ApplicaKon'Format'
20. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Dynamic
Schema
(III)
• Don’t
serialize
redundant
schema
with
rows
• Extract
schema
from
data,
store
on
SSD,
serialize
schema
ID
with
data
• Allows
for
large
number
of
schemas
!20
Metric'Name' HostA/AgentX/AVGResponse' Timestamp' 1332988833' Value' 4' Max' 6' Min' 1'
Metric'Name' HostA/AgentX/AVGResponse' Timestamp' 1332988848' Value' 5' Max' 7' Min' 1'
Metric'Name' HostA/AgentX/Failures' Timestamp' 1332988849' All' 4' Warn' 3' Error' 1'
S1'
S2'
Metric'Name'Timestamp' Value' Max' Min'
Metric'Name'Timestamp' All' Warn' Error'
HostA/AgentX/AVGResponse'1332988833'S1' 4' 6' 1'
HostA/AgentX/AVGResponse'1332988848'
HostA/AgentX/Failures' 1332988849'
S1'
S2'
5' 7' 1'
4' 3' 1'
New'Disk'Format'Schema'Catalogue'
Old'Disk'Format'
SSD
21. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
EvaluaHon:
Dynamic
Schema
• Setup:
• 40M
rows,
variable
columns
5-‐10
(638
schemas),
6GB
row
cache
• Results:
• 10%
reducHon
in
disk
usage
(6.8GB
vs
6GB)
• Slightly
improved
throughput,
stable
latency
• EffecHve
SSD
usage
(only
random
reads)
&
reduce
I/O
and
space
usage
!21
85% 75%
Percentage
he (Latency)
0
1000
2000
3000
4000
5000
6000
7000
95% 50% 5%
Throughput(ops/sec)
Read Percentage
Regular
Dynamic
(c) Dynamic Schema (Throughput)
0
20
40
60
80
100
120
140
95% 50% 5%
Latency(ms)
Read Percentage
Regular
Dynamic
(d) Dynamic Schema (Latency)
atency Results for Row Cache Extension and Dynamic Schema
ing. In normal
essed after the
fied Cassandra,
ngs of roughly
columns in the
th.
ma model (omit-
we identify new avenues for exploiting the use of SSDs within
key-value stores, namely, our dynamic cataloguing technique.
VIII. CONCLUSION
In this paper, we investigated the performance benefits of
SSDs in key-value stores. We benchmarked different con-
figurations of SSD and HDD combinations. We proposed
and implemented two specific optimizations for SSD-HDD
22. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Conclusions
• Storing
Cassandra
commit
logs
on
SSD
doesn’t
help
• Managing
SSDs
at
capacity
degrades
its
performance
• Using
SSDs
as
a
secondary
row-‐cache
dramaHcally
improves
performance
• ExtracHng
redundant
schemas
onto
and
SSD
reduces
disk
space
usage
and
required
I/O
!22
23. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Thanks!
!
• QuesHons?
!
• Contact:
• Prashanth
Menon
(prashanth.menon@utoronto.ca)
!23
24. UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting back:
Using observability tools to improve
the DBMS (not just diagnose it)
Ryan Johnson
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Future
Work
• What
types
of
tables
benefit
most
from
a
dynamic
schema?
• Impact
of
compacHon
on
read-‐heavy
workloads
• How
can
SSDs
be
used
to
improve
the
performance
of
compacHon?
• How
is
performance
when
storing
only
SSTable
indexes
on
SSD?
!24