Even though we had abandoned the Cassandra in all our products, we would like to share our works here.
Why we abandoned the Cassandra in our products? Because:
(1) It is a big wrong in Cassandra's implementation, especially on it's local storage engine layer, i.e. SSTable and Indexing.
(2) It is a big wrong to combine Bigtable and Dynamo. Dynamo's hash ring architecture is a obsolete technolohy for scale, it's consistency and replication policy is also unusable in big data storage.
Boost PC performance: How more available memory can improve productivity
Cassandra Compression and Performance Evaluation
1. Cassandra Performance Evaluation with Compression
Schubert Zhang, May.2010
schubert.zhang@gmail.com
The current implementation of Cassandra’s storage layer and indexing mechanism
only allow compression at row level.
Column Family Row Serialization Structure:
1. The old structure:
Len HashCount
bloom filter BitSet
(int) (int)
index size int
FirstColumName LastColumName Offset (long) Block Width
index of block 0
(Len(short)+name) (Len(short)+name) (0 for first block) (long)
index of block 1
localDeletionTime markedForDeleteAt
deletion meta
(int) (long)
column count int
column block 0
(uncompressed) Column0 Column1 Column2 Column3
column block 1
(uncompressed) deleteMark timestamp value
(bool) (long) (byte[])
2. The new structure (to support compression)
The new structure is appropriate for both old (uncompressed) and new (compressed)
format.
format (int): -1 (old format), 0 (new, LZO compressed), 1(new, GZ compressed), 2(new, uncompressed)
Len HashCount
bloom filter BitSet
(int) (int)
localDeletionTime markedForDeleteAt
deletion meta
(int) (long)
column count int
column block 0
(compressed or not) Column0 Column1 Column2 Column3
column block 1
(compressed or not) deleteMark timestamp value
(bool) (long) (byte[])
index size int
FirstColumName LastColumName Offset (long) Block Width Size on Disk
index of block 0
(Len(short)+name) (Len(short)+name) (0 for first block) (long) (int)
index of block 1
index size’
If the first int (format) is -1, the following structure will be same as “The
old structure”, except the “index of block” will use the new one.
1
2. Benchmark:
1. Just one single node (only one disk, 4GB RAM(3GB for JVM heap), 4 Cores)
2. Dataset:
~200 bytes per column (thrift compactly encoded, the original CSV string is
~250 bytes)
100,000 keys
500,000,000 columns totally
~5,000 columns per key in average
3. Key Cache and Row Cache both disabled
4. Write or Read Client has 4 Threads, totally execute 10,000 read operations.
5. Every read operation only read the first 100 columns of the specified key.
5. The read performance is got after major compaction, i.e. only one SSTable.
Compression Performance Matrix:
Field Model Uncompressed Compressed Compressed
Criteria (Default) (GZ) (LZO)
Size Disk Space(B) 104.545GB 45.067GB 54.656GB
Compression Ratio 1/1 1/2.3 1/1.9
Compact Major Time(H) 3:16 5:30 3:08
Row Max Size(B) 1186948 512475 624396
Write Throughput(ops/s) 12635 11806 11034
Avg Latency(ms) 0.320 0.334 0.347
Min Latency(ms) 0.079 0.083 0.089
Max Latency(ms) 19331 5128 10227
Local Latency(ms) 0.032 0.033 0.037
Read Throughput(ops/s) 25 28 25
Avg Latency(ms) 159 144 159
Min Latency(ms) 1 2 1
Max Latency(ms) 1038 1526 619
Local Latency(ms) 159 144 159
Note:
1. The bottleneck of Write is CPU and memory.
a) In theory, we may get better performance under more power CPU and more
RAM.
b) And if the commitlog is stored on a dedicated disk, we may get better result.
2. The bottleneck of Read is disk utility (100%).
a) Too many seeks.
b) Every read need 2 seeks to reach the row. So, a read operation needs at
least 20ms on disk seek. The maximum throughput (ops/s) is 50.
c) If the row is compressed, one additional seek in the row is needed.
3. The compression ratio will become better along with the average size of row.
a) Since our dataset are very random, the ratio is just about 1/2.
4. Compaction is CPU-bound, since compaction is single-threaded. Gzip compression
is slower.
2
3. Configuration:
Parameter Value
KeysCached 0
DiskAccessMode standard
SlicedBufferSizeInKB 64
FlushDataBufferSizeInMB 32
FlushIndexBufferSizeInMB 8
ColumnIndexSizeInKB 64
MemtableThroughputInMB 128
ConcurrentReads 16
ConcurrentWrites 64
CommitLogSync periodic
CommitLogSyncPeriodInMS 10000
Encoding + Compression:
1. The original text CSV column: ~250 bytes
2. Use thrift compacted encoding: ~200 bytes
3. Encoding + Compression, compositive reduce ratio: ~1/3
Read Throughput/Latency on slice size (count of columns):
Test on LZO compressed data, totally executed 10,000 read operations.
Slice Size 50 500 5000
Criteria
Throughput(ops/s) 25 21 15
Avg Latency(ms) 158.865 186.571 256.837
Min Latency(ms) 1.278 5.041 60.934
Max Latency(ms) 288.307 395.427 1223.202
3
4. Read Throughput
30
Throughput(ops/s)
25 25
20 21
15 15
10
5
0
50 500 5000
Slice Size(Count of Columns)
Read Latency
1400
1200 1223.202
Latency(ms)
1000
Avg Latency(ms)
800
Min Latency(ms)
600
Max Latency(ms)
400 395.427
288.307 256.837
200 186.571
158.865
0 5.041
1.278 60.934
50 500 5000
Slice Size(Count of Columns)
Read Throughput/Latency on KeyCache, mmap, etc:
Test on LZO compressed data, use the benchmark. Totally executed 10,000 read
operations.
Feature KeyCache=100% KeyCache=0 KeyCache=0
DiskAccess=standard DiskAccess= DiskAccess=mmap
Criteria mmap_index_only
Throughput(ops/s) 40 40 84
Avg Latency(ms) 100.522 101.762 47.342
Min Latency(ms) 1.566 1.453 1.270
Max Latency(ms) 278.975 267.120 239.816
90 84
80
Throughput(ops/s)
70
60
50
40 40
40
30
20
10
0
KeyCache_standard mmap_index_only mmap
But, for a long time of evaluation, the performance of on mmap is unstable. Following
evaluation executed 1000,000 read operations. It may because of GC.
4