Cassandra conference

Issues and Tips for Big Data
on Cassandra

Shotaro Kamio
Architecture and Core Technology dept., DU, Rakuten, Inc. 1

Contents

１ Big Data Problem in Rakuten

2 Contributions to Cassandra Project

3 System Architecture

4 Details and Tips

5 Conclusion

2

Contents




4 Details and Tips

5 Conclusion

3




Total size
M
on
th
-Y
Ju ear
n
De -9
c 7
Ju -97
n
De -9
c- 8
Ju 98
n
De -99
c
Ju -99
n
Ja -00
n
Ju -00
n
De -01
c
Ju -01
n
De -0
c 2
Ju -02
n
De -0

More than 1 billion records.
c- 3
Ju 03
n
De -0
c 4

– Double its size every second year.
Ju -04
n
De -05
User data increases exponentially.
c
Ju -05
n
De -06
c
Ju -06
n
De -07
c
Ju -07
n
De -0
Big Data Problem in Rakuten

c- 8
Ju 08
n
De -0
c 9
Ju -09
2 years

n
De -1
c- 0
We need a scalable solution to handle this big data.
x2

10
4

Importance of Data Store in Rakuten

• Rakuten have a lot of data
– User data, item data, reviews, etc.
• Expect connectivity to Hadoop
• High-performance, fault-tolerant, scalable
storage is necessary → Cassandra

Service A Service B Service C …

Data A Data B

5

Performance of New System (Cassandra)

 Store all data in 1 day
– Achieved 15,000 updates/sec with quorum.
– 50 times faster than DB.
15,000 updates/sec
 Good read throughput
– Handle more than 100 read threads at a
time.
x 50

DB New

6

Contents




4 Details and Tips

5 Conclusion

7

Contributions to Cassandra Project

• Tested 0.7.x - 0.8.x

• Bug reports / Feedback to JIRA
– CASSANDRA-2212, 2297, 2406, 2557, 2626 and more
– Bugs related to specific condition, secondary index and large
dataset.
• Contribute patches
– Talk this in later slides.

8

JIRA: Overflow in bytesPastMark(..)

• https://issues.apache.org/jira/browse/CASSANDRA-2297

• Hit the error on a row which is more than 60GB
– The row has column families of super column type

• bytesPastMark method was fixed to return long value.

9

JIRA: Stack overflow while compacting

• https://issues.apache.org/jira/browse/CASSANDRA-2626

• Long series of compaction causes stack overflow.
←　This occurs with large dataset.

• Helped debugging.

10

Challenges in OSS

• Not well tested with real big data.
→ Rakuten can feedback a lot to community.
– Bug report, patches, and communication.
• OSS becomes much stable.

Feedback

11

Contribution of Patches

• Column name aliasing
– Encode column name in compact way.
– Useful to reduce data size for structured (relational)
data.
– Reduce SSTable size by 15%.
• Variable-length quantity (VLQ) compression
– Reduce encoding overhead in columns
– Reduce SSTable size by 17%.

12

VLQ Compression Patch

• Serializer is changed to use VLQ encoding.
• Typical column has fixed length of:
– 2 bytes for column name length
– 1 byte for flag
– 8 bytes for TTL, deletion time
– 8 bytes for timestamp
– 4 bytes for length of value.
• Those encoding overheads are reduced.

13

Contents




4 Details and Tips

5 Conclusion

14

System Architecture

DB

…
DB

Cassandra 1
B atch

Data
feeder
　　　　　　　　

DB Services
B atch
…

DB

…
DB

Cassandra 2

Backup

15

System Architecture

DB

…
DB

Cassandra 1
B atch

Data
feeder
　　　　　　　　

DB Services
B atch
…

DB

…
DB

Cassandra 2

Backup

16

Planning: Schema Design

• Data modeling is a key of scalability.
• Design schema
– Query patterns for super column and normal column.
• Think queries based on use cases.
– Batch operation to reduce number of requests because Thrift has
communication overhead.
• Secondary Index
– We used it to find out updated data.
• Choose partitioner appropriately.
– One partitioner for a cluster.

17

Secondary Index

• Pros
– Useful to query based on a column value.
– It can reduce consistency problem.
– For example, to query updated data based on update-time.
• Cons
– Performance of complex query depends on data.
E.g., Year == 2011 and Price < 100

18

A Bit Detail of Secondary Index

 Works like a hash + filters.
1. Pick up a row which has a key for the index (hash).
2. Apply filters.
– Collect the result if all filters are matched.
1. Repeat until the requested number of rows are obtained.

E.g., Year == 2011 and Price < 100
Key1 Year = 2011

Key2 Year = 2011 Price = 1,000
Many keys of year = 2011,
Key3 Year = 2011 Price = 10 but a few results.
Key4 Year = 2011 Price = 10,000

Key5 Year = 2011 Price = 200

19

A Bit Detail of Secondary Index (2)

 Consider the frequency of results for the query
– Very few result in large data set → query might get
timeout.
 Careful data/query design is necessary at this moment.
 Improvement is discussed: CASSANDRA-2915

20

Planning: Data Size Estimation

• Estimate future data volume
• Serialization overhead: x 3 - 4
– Big overhead for small data.
– We improved with custom patches, compression code
• Cassandra 1.0 can use Snappy/Deflate compression.
• Replication: x 3 (depends on your decision)
• Compaction: x 2 or above

21

Other Factors for Data Size

• Obsolete SSTables
– Disk usage may keep high after compaction.
– Cassandra 0.8.x relies on GC to remove obsolete SSTables.
– Improved in 1.0.

• How to balance data distribution
– Disk usage can be unbalanced (ByteOrderedPartitioner).
– Partitioning, key design, initial token assignment.
– Very helpful if you know data in advance.

• Backup scheme affects disk space
– Need backup space.
– Discuss later.
22

Configuration

• We adopted Cassandra 0.8.x + custom patches.
• Without mmap
– No noticeable difference on performance
– Easier to monitor and debug memory usage and GC related
issues
• ulimit
– Avoid file descriptor shortage. Need more than number of db
files. Bug??
– “memlock unlimited” for JNA
– Make /etc/security/limits.d/cassandra.conf (Redhat)

23

JVM / GC

• Have to avoid Full GC anytime.
• JVM cannot utilize large heap over 15G.
– Slow GC. Can be unstable.
– Don’t give too much data/cache into heap.
– Off-heap cache is available in 0.8.1
• Cassandra may use more memory than heap size.
– ulimit –d 25000000 (max data segment size)
– ulimit –v 75000000 (max virtual memory size)
• Need benchmark to know appropriate parameters.

24

Parameter Tuning for Failure Detector

• Cassandra uses Phi Accrual Failure Detector
– The Φ Accrual Failure Detector [SRDS'04]

double phi(long tnow)
• Failure detection error occurs {
when node is having too much int size = arrivalIntervals_.size();
double log = 0d;
access and/or GC running if ( size > 0 )
{
double t = tnow - tLast_;
• Depends on number of nodes: double probability = p(t);
log = (-1) * Math.log10( probability );
– Larger cluster, larger number. }
return log;
}
double p(double t)
{
double mean = mean();
double exponent = (-1)*(t)/mean;
return Math.pow(Math.E, exponent);
}

25

Hardware

• Benchmark is important to decide hardware.
– Requirements for performance, data size, etc.
– Cassandra is good at utilizing CPU cores.
• Network ports will be bottleneck to scale-out…
– Large number of low-spec servers or
– Small number of high-spec servers.

Our case:
• High-spec CPU and SSD drives
• 2 clusters (active and test cluster)

26

System Architecture

DB

…
DB

Cassandra 1
B atch

Data
feeder
　　　　　　　　

DB Services
B atch
…

DB

…
DB

Cassandra 2

Backup

27

Customize Hector Library

• Query can timeout on Cassandra:
– When Cassandra is in high load temporarily.
– Request of large result set
– Timeout of secondary index query
• Hector retries forever when query get timed-out.
• Client cannot detect infinite loop.
• Customize:
– 3 Timeouts to return exception to client.

28

System Architecture

DB

…
DB

Cassandra 1
B atch

Data
feeder
　　　　　　　　

DB Services
B atch
…

DB

…
DB

Cassandra 2

Backup

29

Testing: Data Consistency Check Tool

• We wanted to make sure data is not corrupted within
Cassandra.
• Made a tool to check the data consistency.
Input data
- Insert (Periodically
- Update comes in)
- Delete Process A
Insert, update, and
delete data
Another
Process B Cassandra
database
Compare data with that
in Cassandra
30

Testing: Data Consistency Check Tool (2)

 Compare only keys of data, not contents.
 Useful to diagnose which part is wrong in test phase.
 We found out other team’s bug as well

31

Repair

• Some types of query doesn’t trigger read repair.
• Nodetool repair is tricky on big data.
– Disk usage
– Time consuming
→ Read all data afterward: Read repair

• Discussion for improvement is going on:
– CASSANDRA-2699

32

System Architecture

DB

…
DB

Cassandra 1
B atch

Data
feeder
　　　　　　　　

DB Services
B atch
…

DB

…
DB

Cassandra 2

Backup

33

Backup Scheme

 Backup might be required to shorten recovery time.
1. Snapshot to local disk
– Plan disk size at server estimation phase.
1. Full backup of input data
– We had full data feed several times for various reasons:
E.g., Logic change, schema change, data corruption, etc.

DB

Incoming

…
DB

data Cassandra

Backup
Snapshot
Snapshot
Snapshot

34

Contents




4 Details and Tips

5 Conclusion

35

Conclusion

• Rakuten uses Cassandra with Big data.
• We’ll continue contributing to OSS.

36

最後に・・・

ちょっと宣伝させてください・・・

37

We are hiring! 中途採用を大募集しております！

楽天のMission

人と社会を（ネットを通じて）Empowermentし
自らの成功を通じ社会を変革し豊かにする
楽天のGOAL
To become No.1
Internet Service Company
in the World
楽天のMission&GOALに共感いただける方は是非ご連絡ください！

　ｔech-career@mail.rakuten.com
38

Cassandra conference

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (6)

Similaire à Cassandra conference

Similaire à Cassandra conference (20)

Plus de Rakuten Group, Inc.

Plus de Rakuten Group, Inc. (20)

Dernier

Dernier (20)

Cassandra conference