1. DaStor Evaluation Report
for CDR Storage & Query
making data alive! (DaStor bases on Cassandra, a fault project)
Schubert Zhang
Big Data Engineering Team
Oct.28, 2010
2. Testbed
• Hardware • The existing testbed and
– Cluster with 9 nodes configuration are not
• 5 nodes
ideal for performance.
– DELL PowerEdge R710
– CPU: Intel(R) Xeon(R) CPU E5520 @ 2.27GHz, cache
size=8192 KB
– Core: 2x 4core CPU, HyperThread, => 16 cores • Preferred
– RAM: 16GB
– Commit Log: dedicated
– Hard Disk: 2x 1TB SATA 7.2k rpm, RAID0
hard disk.
• 4 nodes
– DELL PowerEdge 2970 – File system: XFS/EXT4.
– CPU: Quad-Core AMD Opteron (tm) Processor 2378, – More memory to cache
cache size=512 KB
– Core: 2x 4core CPU, => 8 cores more indexes and
– RAM: 16GB memadata.
– Hard Disk: 2x 1TB STAT 7.2k rpm, RAID0
– Totally
• 9 nodes, 112 cores, 144GB RAM, 18(18TB) Hard Disks
– Network: within a single 1Gbps switch.
• Linux: RedHat EL 5.3, Kernel=2.6.18-128.el5
• File System: Ext3
• JDK: Sun Java 1.6.0_20-b02
2
3. DaStor Configuration
• Release version: 1.6.6-001 Max Heap Size 10GB
Memtable Size 1GB *
• Memory Heap Quota: 10GB
Index Interval 128
• CommitLog and Data storage use the Key Cache Capacity 100000
same 2TB volume(RAID0), as well as
Replication Factor 2
the Linux OS.
CommitLog Segment Size 128MB
• The important performance related CommitLog Sync Period 10s
parameters as the right side table.
Concurrent Writers (Threads) 32
Concurrent Readers (Threads) 16
Cell Block Size 64KB
Consistency Check false
Concurrent Compaction * false
3
4. Data Schema for CDR
Key Date(Day) as Bucket …… Date(Day) as Bucket
(20101020) (20101024)
User ID CDR CDR CDR … CDR …… CDR CDR CDR CDR … CDR
sorted by timestamp cells
• Schema
– Key: The User ID (Phone Number), string
– Bucket: The date(day) name, string
– Cell: CDR, Thrift (or ProtocolBuffer) • Data Patterns
compacted encoding
– A short set of temporal data
• Semantics that tends to be volatile.
– Each user’s everyday CDRs are sorted by
timestamp, and stored together. – An ever-growing set of data
that rarely gets accessed.
• Stored Files
– The SSTable files are separated by Buckets.
• Flexible and applicable to various CDR
structures.
4
5. Storage Architecture
Key (bkt1 , bkt2 , bkt3)
Memtable (bkt1) Triggered By:
• Data size
Commit Log Memtable (bkt2) • Lifetime
Binary serialized
Memtable (bkt3) Flush
Key (bkt1 , bkt2 , bkt3)
Index file on disks Data file on disks
<Size> <Index> <Serialized Cells>
K128 Offset ---
---
K256 Offset <Key> Offset
Dedicated <Key> Offset
Disk K384 Offset --- ---
--- ---
Bloom Filter of Key
---
(Sparse Indexes in memory)
The storage architecture refers to relevant techniques of Google and other databases.
It’s similar to Bigtable, but it’s index scheme is different.
5
6. Indexing Scheme
Index Level-1
Consistent Hash Index Level-3
1 0 h(key1)
Sorted Map, BloomFilter
E 64KB (Changeable)
A mirror of data of Cells on Row
N=3
C K0 K0
h(key2) F Cells Cells Cells
Key Cells Index
Block 0 Block 1
...
Block N
B
D
1/2
Index Level-4
Block Index
Range of B-Tree
K128 K128
Hash to Node
(Binary Search)
Cells Block 0 -> Position
BloomFilter Cells Block 1-> Position
of Keys on SSTable ...
K256 K256 Cells Block N -> Position
KeyCache
Inde Level-2
Block Index
B-Tree
(Binary Search)
K0
K384 K384
Totally 4 levels of indexing.
K128 Indexes are relatively small.
K256
K384 Very fit to store data of a
Key Position Maps Data Rows
individuals, such as users, etc.
Sparse Block Index
(Key interval = 128,
in Index file
[on disk, cachable]
in Data File Good for CDR data serving.
[on disk]
changeable)
[in memory]
6
7. Benchmark for Writes
• Each node runs 6 clients (threads), totally 54 clients.
• Each client generates random CDRs for 50 million users/phone-numbers,
and puts them into DaStor one by one.
– Key Space: 50 million
– Size of a CDR: Thrift-compacted encoding, ~200 bytes
of one node
of the cluster (9 nodes)
Throughput: average ~80K ops/s; per-node: average ~9K ops/s
Latency: average ~0.5ms
Bottleneck: network (and memory)
7
8. Benchmark for Writes (cluster overview)
The wave is because of: (1) GC (2) Compaction
8
9. Benchmark for Reads
• Each node runs 8 clients (threads) , totally 72 clients.
• Each client randomly uses a user-id/phone-number out of the 50-million
space, to get it’s recent 20 CDRs (one page) from DaStor.
• All clients read CDRs of a same day/bucket.
------------------------------------------------------------------------------------
• The 1st run:
– Before compaction.
– Average 8 SSTables on each node for everyday.
• The 2nd run:
– After compaction.
– Only one SSTable on each node for everyday.
9
10. Benchmark for Reads (before compaction)
of one node
of the cluster (9 nodes)
percentage of read ops
25.00%
20.00%
15.00%
10.00%
5.00%
0.00%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61
100ms
Throughput: average ~140 ops/s; per-node: average ~16 ops/s
Latency: average ~500ms, 97% < 2s (SLA)
Bottleneck: disk IO (random seek) (CPU load is very low)
10
11. Benchmark for Reads (after compaction)
of one node
of the cluster (9 nodes)
percentage of read ops
100.00%
80.00%
60.00%
40.00%
20.00%
0.00%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
100ms
Compaction of ~8 SSTables, ~200GB. Time:16core node: 1:40; 8core node: 2:25
Throughput: average ~1.1K ops/s; per-node: average ~120 ops/s
Latency: average ~60ms, 95% < 500ms (SLA)
Bottleneck: disk IO (random seek) (CPU load is very low)
11
13. Experiences
• Large Memtable reduces the frequency of regular compaction.
– We found that 1GB is fine.
• Large Key Space requires more memory, because of more Key-Indexes.
– Especially for key-cache.
– Memory Mapped index files.
• Compaction is sensitive to the number of CPU core, and L2/L3 cache.
– On 16core node: A large (such as 200GB) compaction may take 100 minutes.
– On 8core node : A large (such as 200GB) compaction may take 150 minutes.
– Long compaction may result in many small SSTables, that reduce the read
performance.
– Now, we support concurrent compaction.
• Number of CPU Cores, L2/L3 cache, Disks, RAM size
– CPU Cores, L2/L3 cache: Writes, Compaction
– Disks: Random seeks and reads
– RAM: Memtables for writes, Indexes cache for random reads.
13
14. Maintenance Tools
• Daily Flush Tool • Daily Compaction Tool
– To flush memtables of old buckets. – To compact SSTables of old buckets.
– Use dastor-admin tool. – Use dastor-admin tool.
DaStor Admin Tool (bin/dastor-admin, bin/dastor-admin-shell)
14
17. Developed Features
• Admin Tools • Concurrent Compaction
– Configuration improvement based – From single thread to bucket-
on config-files independent multi-threads.
– Script framework and scripts
• Scalability
– Admin tools – Easy to scale-out
– CLI shell – More controllable
– WebAdmin
– Ganglia, Jmxetric • Benchmarks
– Writes and Reads
– Throughput and Latency
• Compression
– New serialization format. • Bug fix
– Support Gzip and LZO.
• Bucket mapping and reclaim
– Mapping plug-in
– Reclaim command and mechanism.
• Java Client API
17
18. Controllable Carrier-level Scale-out
Existing cluster New machines added into the cluster, not online.
(1) Available Partitioning-A (1) Available Partitioning-A
(2) Existing buckets with data (2) Existing buckets with data
(3) New Partitioning-B for future buckets, but not
available for now
Time is gone, data in old buckets are reclaimed. The added machines online
(1) gone (1) Available Partitioning-A
(2) gone (2) Existing buckets with data
(3) Only Partitioning-B, available for service (3) New Partitioning-B available for service,
coexist with Partitioning-A. No data movement.
18
19. Data Processing and Analysis for BI & DM
• Integration with MapReduce and
Hive, etc.
BI & DM Apps
• Provide SQL-Like and rich API for BI
QL
and DM.
Hive
API
Table Meta • Built-in plug-ins for MapRecuce
framework.
MapReduce Framework
• Flexible data structure description
InputFormat OutputFormat
and tabular management.
DaStor • The simple and flexible data model
(Data Storage)
of DaStor is proper for analysis,
since the past buckets are stable.
19
20. Further Works
• More flexible and manageable Cache • Deployment tools
– Capability size/memory control.
– Consider: Capistrano, PyStatus,
– Methods to load and free cache.
Puppet, Chef …
• Scalability feature for operational
scale • Data Analysis
– Version 1.6.6 + Controllable
Scalability – Hadoop
• Compression Improvement • Documents
– To reduce number of disk seeks.
– API Manual
• Admin Tools – Admin Manual
– Configuration, Monitor, Control …
– More professional and easy to use
• Test
• Client API Enhancement – New features
– Hide the individual node to be – Performance
connected.
– More API methods. – mmap …
• Flexible consistency check
20
21. DaStor/Cassandra vs. Bigtable
• Scalability: Bigtable has better scalability.
– The scale of DaStor should be controlled carefully, and may affect services. It is a big trouble.
– Bigtable’s scalability is easy.
• Data Distribution: Bigtable’s high-level partitioning/indexing scheme is more fine-grained, and so more effective.
– DaStor ’s consistent hash partitioning scheme is too coarse-grained, and so we must cut up the bucket level partitions. But
sometimes, it is not easy to trade-off on bigdata.
• Indexing: Bigtable may need less memory to hold indexes.
– Bigtable ’s indexes are more general and can be shared equally (均摊) by different users/rows, especially when data-skew.
– There’s only one copy of indexes in Bigtable, even for multiple storage replications, since Bigtable use GFS layer for replication.
(multiple copies of data, one copy of indexes)
• Local Storage Engine: Bigtable provides better read performance, less disk seeks.
– Bigtable vs. Cassandra InnoDB vs. MyISAM
• Bigtable’s write/mutation performance is lower.
– Commit Log: If the GFS/HDFS support s fine-configuration to let individual directory on a exclusive disk, then …
• So, Bigtable ’s architecture and data model make more sense.
– The Cassandra project is a fault. It is a big fault to mix Dynamo and Bigtable.
– But in my opinion, Cassandra is just a partial Dynamo and target to a wrong
field – Data Storage. It is anamorphotic.
21