SlideShare une entreprise Scribd logo
1  sur  38
Télécharger pour lire hors ligne
Issues and Tips for Big Data
       on Cassandra



                     Shotaro Kamio
Architecture and Core Technology dept., DU, Rakuten, Inc.   1
Contents


1   Big Data Problem in Rakuten


2   Contributions to Cassandra Project


3   System Architecture


4   Details and Tips


5   Conclusion




                                         2
Contents


1   Big Data Problem in Rakuten


2   Contributions to Cassandra Project


3   System Architecture


4   Details and Tips


5   Conclusion




                                         3

                                      
                                                                                                         
                                                                                                                                                                    Total size
                                                                                                                                       M
                                                                                                                                        on
                                                                                                                                          th
                                                                                                                                             -Y
                                                                                                                                           Ju ear
                                                                                                                                              n
                                                                                                                                          De -9
                                                                                                                                              c 7
                                                                                                                                           Ju -97
                                                                                                                                              n
                                                                                                                                          De -9
                                                                                                                                              c- 8
                                                                                                                                           Ju 98
                                                                                                                                              n
                                                                                                                                          De -99
                                                                                                                                              c
                                                                                                                                           Ju -99
                                                                                                                                              n
                                                                                                                                           Ja -00
                                                                                                                                              n
                                                                                                                                           Ju -00
                                                                                                                                              n
                                                                                                                                          De -01
                                                                                                                                              c
                                                                                                                                           Ju -01
                                                                                                                                              n
                                                                                                                                          De -0
                                                                                                                                              c 2
                                                                                                                                           Ju -02
                                                                                                                                              n
                                                                                                                                          De -0




    More than 1 billion records.
                                                                                                                                              c- 3
                                                                                                                                           Ju 03
                                                                                                                                              n
                                                                                                                                          De -0
                                                                                                                                              c 4

                                                           – Double its size every second year.
                                                                                                                                           Ju -04
                                                                                                                                              n
                                                                                                                                          De -05
                                                                                                  User data increases exponentially.
                                                                                                                                              c
                                                                                                                                           Ju -05
                                                                                                                                              n
                                                                                                                                          De -06
                                                                                                                                              c
                                                                                                                                           Ju -06
                                                                                                                                              n
                                                                                                                                          De -07
                                                                                                                                              c
                                                                                                                                           Ju -07
                                                                                                                                              n
                                                                                                                                          De -0
                                                                                                                                                                                 Big Data Problem in Rakuten




                                                                                                                                              c- 8
                                                                                                                                           Ju 08
                                                                                                                                              n
                                                                                                                                          De -0
                                                                                                                                              c 9
                                                                                                                                           Ju -09
                                                                                                                                                     2 years




                                                                                                                                              n
                                                                                                                                          De -1
                                                                                                                                              c- 0
    We need a scalable solution to handle this big data.
                                                                                                                                                               x2




                                                                                                                                                10
4
Importance of Data Store in Rakuten


• Rakuten have a lot of data
   – User data, item data, reviews, etc.
• Expect connectivity to Hadoop
• High-performance, fault-tolerant, scalable
  storage is necessary → Cassandra


             Service A           Service B   Service C   …



             Data A                Data B


                                                             5
Performance of New System (Cassandra)


   Store all data in 1 day
     – Achieved 15,000 updates/sec with quorum.
     – 50 times faster than DB.
                                              15,000 updates/sec
   Good read throughput
     – Handle more than 100 read threads at a
       time.
                                                x 50



                                                  DB   New


                                                              6
Contents


1   Big Data Problem in Rakuten


2   Contributions to Cassandra Project


3   System Architecture


4   Details and Tips


5   Conclusion




                                         7
Contributions to Cassandra Project


• Tested 0.7.x - 0.8.x

• Bug reports / Feedback to JIRA
   – CASSANDRA-2212, 2297, 2406, 2557, 2626 and more
   – Bugs related to specific condition, secondary index and large
     dataset.
• Contribute patches
   – Talk this in later slides.




                                                                     8
JIRA: Overflow in bytesPastMark(..)


•   https://issues.apache.org/jira/browse/CASSANDRA-2297


• Hit the error on a row which is more than 60GB
     – The row has column families of super column type


• bytesPastMark method was fixed to return long value.




                                                           9
JIRA: Stack overflow while compacting


•   https://issues.apache.org/jira/browse/CASSANDRA-2626


• Long series of compaction causes stack overflow.
← This occurs with large dataset.

• Helped debugging.




                                                           10
Challenges in OSS


• Not well tested with real big data.
→ Rakuten can feedback a lot to community.
   – Bug report, patches, and communication.
• OSS becomes much stable.



                    Feedback




                                               11
Contribution of Patches


• Column name aliasing
  – Encode column name in compact way.
  – Useful to reduce data size for structured (relational)
    data.
  – Reduce SSTable size by 15%.
• Variable-length quantity (VLQ) compression
  – Reduce encoding overhead in columns
  – Reduce SSTable size by 17%.




                                                             12
VLQ Compression Patch


• Serializer is changed to use VLQ encoding.
• Typical column has fixed length of:
   –   2 bytes for column name length
   –   1 byte for flag
   –   8 bytes for TTL, deletion time
   –   8 bytes for timestamp
   –   4 bytes for length of value.
• Those encoding overheads are reduced.



                                               13
Contents


1   Big Data Problem in Rakuten


2   Contributions to Cassandra Project


3   System Architecture


4   Details and Tips


5   Conclusion




                                         14
System Architecture




                               DB




                                    …
                          DB



                         Cassandra 1
     B atch



       Data
      feeder
              

DB                                      Services
     B atch
                     …

                               DB




                                    …
                          DB



                         Cassandra 2


     Backup

                                                   15
System Architecture




                               DB




                                    …
                          DB



                         Cassandra 1
     B atch



       Data
      feeder
              

DB                                      Services
     B atch
                     …

                               DB




                                    …
                          DB



                         Cassandra 2


     Backup

                                                   16
Planning: Schema Design


• Data modeling is a key of scalability.
• Design schema
   – Query patterns for super column and normal column.
• Think queries based on use cases.
   – Batch operation to reduce number of requests because Thrift has
     communication overhead.
• Secondary Index
   – We used it to find out updated data.
• Choose partitioner appropriately.
   – One partitioner for a cluster.




                                                                       17
Secondary Index


• Pros
   – Useful to query based on a column value.
   – It can reduce consistency problem.
   – For example, to query updated data based on update-time.
• Cons
   – Performance of complex query depends on data.
      E.g., Year == 2011 and Price < 100




                                                                18
A Bit Detail of Secondary Index


   Works like a hash + filters.
    1. Pick up a row which has a key for the index (hash).
    2. Apply filters.
        – Collect the result if all filters are matched.
    1. Repeat until the requested number of rows are obtained.

                                            E.g., Year == 2011 and Price < 100
Key1     Year = 2011

Key2     Year = 2011       Price = 1,000
                                                     Many keys of year = 2011,
Key3     Year = 2011       Price = 10                    but a few results.
Key4     Year = 2011       Price = 10,000

Key5     Year = 2011       Price = 200

                                                                                 19
A Bit Detail of Secondary Index (2)


   Consider the frequency of results for the query
     – Very few result in large data set → query might get
       timeout.
   Careful data/query design is necessary at this moment.
   Improvement is discussed: CASSANDRA-2915




                                                             20
Planning: Data Size Estimation


• Estimate future data volume
• Serialization overhead: x 3 - 4
   – Big overhead for small data.
   – We improved with custom patches, compression code
      • Cassandra 1.0 can use Snappy/Deflate compression.
• Replication: x 3 (depends on your decision)
• Compaction: x 2 or above




                                                            21
Other Factors for Data Size


• Obsolete SSTables
   – Disk usage may keep high after compaction.
   – Cassandra 0.8.x relies on GC to remove obsolete SSTables.
   – Improved in 1.0.

• How to balance data distribution
   – Disk usage can be unbalanced (ByteOrderedPartitioner).
   – Partitioning, key design, initial token assignment.
   – Very helpful if you know data in advance.



• Backup scheme affects disk space
   – Need backup space.
   – Discuss later.
                                                                 22
Configuration


• We adopted Cassandra 0.8.x + custom patches.
• Without mmap
   – No noticeable difference on performance
   – Easier to monitor and debug memory usage and GC related
     issues
• ulimit
   – Avoid file descriptor shortage. Need more than number of db
     files. Bug??
   – “memlock unlimited” for JNA
   – Make /etc/security/limits.d/cassandra.conf (Redhat)




                                                                   23
JVM / GC


• Have to avoid Full GC anytime.
• JVM cannot utilize large heap over 15G.
   – Slow GC. Can be unstable.
   – Don’t give too much data/cache into heap.
   – Off-heap cache is available in 0.8.1
• Cassandra may use more memory than heap size.
   – ulimit –d 25000000 (max data segment size)
   – ulimit –v 75000000 (max virtual memory size)
• Need benchmark to know appropriate parameters.




                                                    24
Parameter Tuning for Failure Detector


•   Cassandra uses Phi Accrual Failure Detector
     – The Φ Accrual Failure Detector [SRDS'04]

                                        double phi(long tnow)
•   Failure detection error occurs      {
    when node is having too much          int size = arrivalIntervals_.size();
                                          double log = 0d;
    access and/or GC running              if ( size > 0 )
                                          {
                                              double t = tnow - tLast_;
•   Depends on number of nodes:               double probability = p(t);
                                              log = (-1) * Math.log10( probability );
     – Larger cluster, larger number.     }
                                          return log;
                                        }
                                        double p(double t)
                                        {
                                            double mean = mean();
                                            double exponent = (-1)*(t)/mean;
                                            return Math.pow(Math.E, exponent);
                                        }

                                                                                    25
Hardware


• Benchmark is important to decide hardware.
   – Requirements for performance, data size, etc.
   – Cassandra is good at utilizing CPU cores.
• Network ports will be bottleneck to scale-out…
   – Large number of low-spec servers or
   – Small number of high-spec servers.



     Our case:
     • High-spec CPU and SSD drives
     • 2 clusters (active and test cluster)



                                                     26
System Architecture




                               DB




                                    …
                          DB



                         Cassandra 1
     B atch



       Data
      feeder
              

DB                                      Services
     B atch
                     …

                               DB




                                    …
                          DB



                         Cassandra 2


     Backup

                                                   27
Customize Hector Library


• Query can timeout on Cassandra:
   – When Cassandra is in high load temporarily.
   – Request of large result set
   – Timeout of secondary index query
• Hector retries forever when query get timed-out.
• Client cannot detect infinite loop.
• Customize:
   – 3 Timeouts to return exception to client.




                                                     28
System Architecture




                               DB




                                    …
                          DB



                         Cassandra 1
     B atch



       Data
      feeder
              

DB                                      Services
     B atch
                     …

                               DB




                                    …
                          DB



                         Cassandra 2


     Backup

                                                   29
Testing: Data Consistency Check Tool


   • We wanted to make sure data is not corrupted within
      Cassandra.
   • Made a tool to check the data consistency.
                                                 Input data
- Insert                                        (Periodically
- Update                                         comes in)
- Delete           Process A
                   Insert, update, and
                   delete data
Another
                   Process B                            Cassandra
database
                   Compare data with that
                   in Cassandra
                                                                    30
Testing: Data Consistency Check Tool (2)


   Compare only keys of data, not contents.
   Useful to diagnose which part is wrong in test phase.
   We found out other team’s bug as well




                                                            31
Repair


• Some types of query doesn’t trigger read repair.
• Nodetool repair is tricky on big data.
   – Disk usage
   – Time consuming
→ Read all data afterward: Read repair

• Discussion for improvement is going on:
   – CASSANDRA-2699




                                                     32
System Architecture




                               DB




                                    …
                          DB



                         Cassandra 1
     B atch



       Data
      feeder
              

DB                                      Services
     B atch
                     …

                               DB




                                    …
                          DB



                         Cassandra 2


     Backup

                                                   33
Backup Scheme

  Backup might be required to shorten recovery time.
1. Snapshot to local disk
    – Plan disk size at server estimation phase.
1. Full backup of input data
    – We had full data feed several times for various reasons:
       E.g., Logic change, schema change, data corruption, etc.


                                            DB

    Incoming




                                                 …
                                       DB



       data                           Cassandra

                    Backup
                                      Snapshot
                                       Snapshot
                                        Snapshot

                                                                  34
Contents


1   Big Data Problem in Rakuten


2   Contributions to Cassandra Project


3   System Architecture


4   Details and Tips


5   Conclusion




                                         35
Conclusion


• Rakuten uses Cassandra with Big data.
• We’ll continue contributing to OSS.




                                          36
最後に・・・




ちょっと宣伝させてください・・・




                   37
We are hiring! 中途採用を大募集しております!

楽天のMission

人と社会を(ネットを通じて)Empowermentし
自らの成功を通じ社会を変革し豊かにする
楽天のGOAL
              To become No.1
         Internet Service Company
                in the World
楽天のMission&GOALに共感いただける方は是非ご連絡ください!

       tech-career@mail.rakuten.com
                                         38

Contenu connexe

En vedette

[Rakuten TechConf2014] [Sendai] Little look inside Global Ichiba: Ichiba Busi...
[Rakuten TechConf2014] [Sendai] Little look inside Global Ichiba: Ichiba Busi...[Rakuten TechConf2014] [Sendai] Little look inside Global Ichiba: Ichiba Busi...
[Rakuten TechConf2014] [Sendai] Little look inside Global Ichiba: Ichiba Busi...Rakuten Group, Inc.
 
第4回楽天研究開発シンポジウム.開会挨拶
第4回楽天研究開発シンポジウム.開会挨拶第4回楽天研究開発シンポジウム.開会挨拶
第4回楽天研究開発シンポジウム.開会挨拶Rakuten Group, Inc.
 
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product DescriptionsRakuten Group, Inc.
 
RIT (Rakuten Institute of Technology) presentation about UI/UX
RIT (Rakuten Institute of Technology) presentation about UI/UXRIT (Rakuten Institute of Technology) presentation about UI/UX
RIT (Rakuten Institute of Technology) presentation about UI/UXRakuten Group, Inc.
 
Case Analysis Rakuten Ichiba
Case Analysis  Rakuten IchibaCase Analysis  Rakuten Ichiba
Case Analysis Rakuten IchibaEddie Lee
 

En vedette (6)

[Rakuten TechConf2014] [Sendai] Little look inside Global Ichiba: Ichiba Busi...
[Rakuten TechConf2014] [Sendai] Little look inside Global Ichiba: Ichiba Busi...[Rakuten TechConf2014] [Sendai] Little look inside Global Ichiba: Ichiba Busi...
[Rakuten TechConf2014] [Sendai] Little look inside Global Ichiba: Ichiba Busi...
 
第4回楽天研究開発シンポジウム.開会挨拶
第4回楽天研究開発シンポジウム.開会挨拶第4回楽天研究開発シンポジウム.開会挨拶
第4回楽天研究開発シンポジウム.開会挨拶
 
Hadoop at Rakuten, 2011/07/06
Hadoop at Rakuten, 2011/07/06Hadoop at Rakuten, 2011/07/06
Hadoop at Rakuten, 2011/07/06
 
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions
 
RIT (Rakuten Institute of Technology) presentation about UI/UX
RIT (Rakuten Institute of Technology) presentation about UI/UXRIT (Rakuten Institute of Technology) presentation about UI/UX
RIT (Rakuten Institute of Technology) presentation about UI/UX
 
Case Analysis Rakuten Ichiba
Case Analysis  Rakuten IchibaCase Analysis  Rakuten Ichiba
Case Analysis Rakuten Ichiba
 

Similaire à Cassandra conference

art of presentation Map of Jamies Yam
art of presentation Map of Jamies Yamart of presentation Map of Jamies Yam
art of presentation Map of Jamies YamJamies Yam
 
Brocade Migration Example
Brocade Migration ExampleBrocade Migration Example
Brocade Migration Examplenigelwakefield
 
UBD Media Kit 2012
UBD Media Kit 2012UBD Media Kit 2012
UBD Media Kit 2012UnBuenDoctor
 
Report: HSE in the Oilfield
Report: HSE in the OilfieldReport: HSE in the Oilfield
Report: HSE in the OilfieldDoug Sheridan
 
International Trade Compliance Strategy Responsibility Matrix
International Trade Compliance Strategy Responsibility MatrixInternational Trade Compliance Strategy Responsibility Matrix
International Trade Compliance Strategy Responsibility MatrixGHY International
 
High stakes world of Mobile Payments
High stakes world of Mobile PaymentsHigh stakes world of Mobile Payments
High stakes world of Mobile PaymentstxtNation
 
High stakes-world-of-mobile-payments-infographic
High stakes-world-of-mobile-payments-infographicHigh stakes-world-of-mobile-payments-infographic
High stakes-world-of-mobile-payments-infographicTyson Hackwood
 
9 18 Part 2
9 18 Part 29 18 Part 2
9 18 Part 2burgerja
 
3AMIGAS - Keynote: Pjotr Van Schothorst, VStep
3AMIGAS - Keynote: Pjotr Van Schothorst, VStep3AMIGAS - Keynote: Pjotr Van Schothorst, VStep
3AMIGAS - Keynote: Pjotr Van Schothorst, VStepFOCUS K3D
 
The Content Creation Workflow of the Ship Simulator Game - A Case Study
The Content Creation Workflow of the Ship Simulator Game - A Case StudyThe Content Creation Workflow of the Ship Simulator Game - A Case Study
The Content Creation Workflow of the Ship Simulator Game - A Case StudyWolfgang Hürst
 
Crompton Way Traffic Proposal Map
Crompton Way Traffic Proposal MapCrompton Way Traffic Proposal Map
Crompton Way Traffic Proposal Mapguestf8bf20
 

Similaire à Cassandra conference (20)

art of presentation Map of Jamies Yam
art of presentation Map of Jamies Yamart of presentation Map of Jamies Yam
art of presentation Map of Jamies Yam
 
Tvr new map 2012
Tvr new map 2012Tvr new map 2012
Tvr new map 2012
 
Brocade Migration Example
Brocade Migration ExampleBrocade Migration Example
Brocade Migration Example
 
UBD Media Kit 2012
UBD Media Kit 2012UBD Media Kit 2012
UBD Media Kit 2012
 
Webster City Enterprise Zone Map
Webster City Enterprise Zone MapWebster City Enterprise Zone Map
Webster City Enterprise Zone Map
 
Report: HSE in the Oilfield
Report: HSE in the OilfieldReport: HSE in the Oilfield
Report: HSE in the Oilfield
 
Jun05 A01 Bct
Jun05 A01 BctJun05 A01 Bct
Jun05 A01 Bct
 
International Trade Compliance Strategy Responsibility Matrix
International Trade Compliance Strategy Responsibility MatrixInternational Trade Compliance Strategy Responsibility Matrix
International Trade Compliance Strategy Responsibility Matrix
 
High stakes world of Mobile Payments
High stakes world of Mobile PaymentsHigh stakes world of Mobile Payments
High stakes world of Mobile Payments
 
High stakes-world-of-mobile-payments-infographic
High stakes-world-of-mobile-payments-infographicHigh stakes-world-of-mobile-payments-infographic
High stakes-world-of-mobile-payments-infographic
 
9 18 Part 2
9 18 Part 29 18 Part 2
9 18 Part 2
 
3AMIGAS - Keynote: Pjotr Van Schothorst, VStep
3AMIGAS - Keynote: Pjotr Van Schothorst, VStep3AMIGAS - Keynote: Pjotr Van Schothorst, VStep
3AMIGAS - Keynote: Pjotr Van Schothorst, VStep
 
The Content Creation Workflow of the Ship Simulator Game - A Case Study
The Content Creation Workflow of the Ship Simulator Game - A Case StudyThe Content Creation Workflow of the Ship Simulator Game - A Case Study
The Content Creation Workflow of the Ship Simulator Game - A Case Study
 
Seo in-singapore
Seo in-singaporeSeo in-singapore
Seo in-singapore
 
Seo conferences-2011
Seo conferences-2011Seo conferences-2011
Seo conferences-2011
 
Are you paying attention
Are you paying attentionAre you paying attention
Are you paying attention
 
Brentwood Park Disc Golf Course Map
Brentwood Park Disc Golf Course MapBrentwood Park Disc Golf Course Map
Brentwood Park Disc Golf Course Map
 
Timeline 1
Timeline 1Timeline 1
Timeline 1
 
Crompton Way Traffic Proposal Map
Crompton Way Traffic Proposal MapCrompton Way Traffic Proposal Map
Crompton Way Traffic Proposal Map
 
Hse Product Promo
Hse Product PromoHse Product Promo
Hse Product Promo
 

Plus de Rakuten Group, Inc.

コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話Rakuten Group, Inc.
 
楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のり楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のりRakuten Group, Inc.
 
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...Rakuten Group, Inc.
 
DataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組みDataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組みRakuten Group, Inc.
 
大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開Rakuten Group, Inc.
 
楽天における大規模データベースの運用
楽天における大規模データベースの運用楽天における大規模データベースの運用
楽天における大規模データベースの運用Rakuten Group, Inc.
 
楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャー楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャーRakuten Group, Inc.
 
楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割Rakuten Group, Inc.
 
Rakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdfRakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdfRakuten Group, Inc.
 
The Data Platform Administration Handling the 100 PB.pdf
The Data Platform Administration Handling the 100 PB.pdfThe Data Platform Administration Handling the 100 PB.pdf
The Data Platform Administration Handling the 100 PB.pdfRakuten Group, Inc.
 
Supporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdfSupporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdfRakuten Group, Inc.
 
Making Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdfMaking Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdfRakuten Group, Inc.
 
How We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdfHow We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdfRakuten Group, Inc.
 
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoRakuten Group, Inc.
 
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoRakuten Group, Inc.
 
Introduction of GORA API Group technology
Introduction of GORA API Group technologyIntroduction of GORA API Group technology
Introduction of GORA API Group technologyRakuten Group, Inc.
 
100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情Rakuten Group, Inc.
 
社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャー社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャーRakuten Group, Inc.
 

Plus de Rakuten Group, Inc. (20)

コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
 
楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のり楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のり
 
What Makes Software Green?
What Makes Software Green?What Makes Software Green?
What Makes Software Green?
 
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
 
DataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組みDataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組み
 
大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開
 
楽天における大規模データベースの運用
楽天における大規模データベースの運用楽天における大規模データベースの運用
楽天における大規模データベースの運用
 
楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャー楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャー
 
楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割
 
Rakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdfRakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdf
 
The Data Platform Administration Handling the 100 PB.pdf
The Data Platform Administration Handling the 100 PB.pdfThe Data Platform Administration Handling the 100 PB.pdf
The Data Platform Administration Handling the 100 PB.pdf
 
Supporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdfSupporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdf
 
Making Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdfMaking Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdf
 
How We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdfHow We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdf
 
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech info
 
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech info
 
OWASPTop10_Introduction
OWASPTop10_IntroductionOWASPTop10_Introduction
OWASPTop10_Introduction
 
Introduction of GORA API Group technology
Introduction of GORA API Group technologyIntroduction of GORA API Group technology
Introduction of GORA API Group technology
 
100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情
 
社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャー社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャー
 

Dernier

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Dernier (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

Cassandra conference

  • 1. Issues and Tips for Big Data on Cassandra Shotaro Kamio Architecture and Core Technology dept., DU, Rakuten, Inc. 1
  • 2. Contents 1 Big Data Problem in Rakuten 2 Contributions to Cassandra Project 3 System Architecture 4 Details and Tips 5 Conclusion 2
  • 3. Contents 1 Big Data Problem in Rakuten 2 Contributions to Cassandra Project 3 System Architecture 4 Details and Tips 5 Conclusion 3
  • 4.   Total size M on th -Y Ju ear n De -9 c 7 Ju -97 n De -9 c- 8 Ju 98 n De -99 c Ju -99 n Ja -00 n Ju -00 n De -01 c Ju -01 n De -0 c 2 Ju -02 n De -0 More than 1 billion records. c- 3 Ju 03 n De -0 c 4 – Double its size every second year. Ju -04 n De -05 User data increases exponentially. c Ju -05 n De -06 c Ju -06 n De -07 c Ju -07 n De -0 Big Data Problem in Rakuten c- 8 Ju 08 n De -0 c 9 Ju -09 2 years n De -1 c- 0 We need a scalable solution to handle this big data. x2 10 4
  • 5. Importance of Data Store in Rakuten • Rakuten have a lot of data – User data, item data, reviews, etc. • Expect connectivity to Hadoop • High-performance, fault-tolerant, scalable storage is necessary → Cassandra Service A Service B Service C … Data A Data B 5
  • 6. Performance of New System (Cassandra)  Store all data in 1 day – Achieved 15,000 updates/sec with quorum. – 50 times faster than DB. 15,000 updates/sec  Good read throughput – Handle more than 100 read threads at a time. x 50 DB New 6
  • 7. Contents 1 Big Data Problem in Rakuten 2 Contributions to Cassandra Project 3 System Architecture 4 Details and Tips 5 Conclusion 7
  • 8. Contributions to Cassandra Project • Tested 0.7.x - 0.8.x • Bug reports / Feedback to JIRA – CASSANDRA-2212, 2297, 2406, 2557, 2626 and more – Bugs related to specific condition, secondary index and large dataset. • Contribute patches – Talk this in later slides. 8
  • 9. JIRA: Overflow in bytesPastMark(..) • https://issues.apache.org/jira/browse/CASSANDRA-2297 • Hit the error on a row which is more than 60GB – The row has column families of super column type • bytesPastMark method was fixed to return long value. 9
  • 10. JIRA: Stack overflow while compacting • https://issues.apache.org/jira/browse/CASSANDRA-2626 • Long series of compaction causes stack overflow. ← This occurs with large dataset. • Helped debugging. 10
  • 11. Challenges in OSS • Not well tested with real big data. → Rakuten can feedback a lot to community. – Bug report, patches, and communication. • OSS becomes much stable. Feedback 11
  • 12. Contribution of Patches • Column name aliasing – Encode column name in compact way. – Useful to reduce data size for structured (relational) data. – Reduce SSTable size by 15%. • Variable-length quantity (VLQ) compression – Reduce encoding overhead in columns – Reduce SSTable size by 17%. 12
  • 13. VLQ Compression Patch • Serializer is changed to use VLQ encoding. • Typical column has fixed length of: – 2 bytes for column name length – 1 byte for flag – 8 bytes for TTL, deletion time – 8 bytes for timestamp – 4 bytes for length of value. • Those encoding overheads are reduced. 13
  • 14. Contents 1 Big Data Problem in Rakuten 2 Contributions to Cassandra Project 3 System Architecture 4 Details and Tips 5 Conclusion 14
  • 15. System Architecture DB … DB Cassandra 1 B atch Data feeder          DB Services B atch … DB … DB Cassandra 2 Backup 15
  • 16. System Architecture DB … DB Cassandra 1 B atch Data feeder          DB Services B atch … DB … DB Cassandra 2 Backup 16
  • 17. Planning: Schema Design • Data modeling is a key of scalability. • Design schema – Query patterns for super column and normal column. • Think queries based on use cases. – Batch operation to reduce number of requests because Thrift has communication overhead. • Secondary Index – We used it to find out updated data. • Choose partitioner appropriately. – One partitioner for a cluster. 17
  • 18. Secondary Index • Pros – Useful to query based on a column value. – It can reduce consistency problem. – For example, to query updated data based on update-time. • Cons – Performance of complex query depends on data. E.g., Year == 2011 and Price < 100 18
  • 19. A Bit Detail of Secondary Index  Works like a hash + filters. 1. Pick up a row which has a key for the index (hash). 2. Apply filters. – Collect the result if all filters are matched. 1. Repeat until the requested number of rows are obtained. E.g., Year == 2011 and Price < 100 Key1 Year = 2011 Key2 Year = 2011 Price = 1,000 Many keys of year = 2011, Key3 Year = 2011 Price = 10 but a few results. Key4 Year = 2011 Price = 10,000 Key5 Year = 2011 Price = 200 19
  • 20. A Bit Detail of Secondary Index (2)  Consider the frequency of results for the query – Very few result in large data set → query might get timeout.  Careful data/query design is necessary at this moment.  Improvement is discussed: CASSANDRA-2915 20
  • 21. Planning: Data Size Estimation • Estimate future data volume • Serialization overhead: x 3 - 4 – Big overhead for small data. – We improved with custom patches, compression code • Cassandra 1.0 can use Snappy/Deflate compression. • Replication: x 3 (depends on your decision) • Compaction: x 2 or above 21
  • 22. Other Factors for Data Size • Obsolete SSTables – Disk usage may keep high after compaction. – Cassandra 0.8.x relies on GC to remove obsolete SSTables. – Improved in 1.0. • How to balance data distribution – Disk usage can be unbalanced (ByteOrderedPartitioner). – Partitioning, key design, initial token assignment. – Very helpful if you know data in advance. • Backup scheme affects disk space – Need backup space. – Discuss later. 22
  • 23. Configuration • We adopted Cassandra 0.8.x + custom patches. • Without mmap – No noticeable difference on performance – Easier to monitor and debug memory usage and GC related issues • ulimit – Avoid file descriptor shortage. Need more than number of db files. Bug?? – “memlock unlimited” for JNA – Make /etc/security/limits.d/cassandra.conf (Redhat) 23
  • 24. JVM / GC • Have to avoid Full GC anytime. • JVM cannot utilize large heap over 15G. – Slow GC. Can be unstable. – Don’t give too much data/cache into heap. – Off-heap cache is available in 0.8.1 • Cassandra may use more memory than heap size. – ulimit –d 25000000 (max data segment size) – ulimit –v 75000000 (max virtual memory size) • Need benchmark to know appropriate parameters. 24
  • 25. Parameter Tuning for Failure Detector • Cassandra uses Phi Accrual Failure Detector – The Φ Accrual Failure Detector [SRDS'04] double phi(long tnow) • Failure detection error occurs { when node is having too much int size = arrivalIntervals_.size(); double log = 0d; access and/or GC running if ( size > 0 ) { double t = tnow - tLast_; • Depends on number of nodes: double probability = p(t); log = (-1) * Math.log10( probability ); – Larger cluster, larger number. } return log; } double p(double t) { double mean = mean(); double exponent = (-1)*(t)/mean; return Math.pow(Math.E, exponent); } 25
  • 26. Hardware • Benchmark is important to decide hardware. – Requirements for performance, data size, etc. – Cassandra is good at utilizing CPU cores. • Network ports will be bottleneck to scale-out… – Large number of low-spec servers or – Small number of high-spec servers. Our case: • High-spec CPU and SSD drives • 2 clusters (active and test cluster) 26
  • 27. System Architecture DB … DB Cassandra 1 B atch Data feeder          DB Services B atch … DB … DB Cassandra 2 Backup 27
  • 28. Customize Hector Library • Query can timeout on Cassandra: – When Cassandra is in high load temporarily. – Request of large result set – Timeout of secondary index query • Hector retries forever when query get timed-out. • Client cannot detect infinite loop. • Customize: – 3 Timeouts to return exception to client. 28
  • 29. System Architecture DB … DB Cassandra 1 B atch Data feeder          DB Services B atch … DB … DB Cassandra 2 Backup 29
  • 30. Testing: Data Consistency Check Tool • We wanted to make sure data is not corrupted within Cassandra. • Made a tool to check the data consistency. Input data - Insert (Periodically - Update comes in) - Delete Process A Insert, update, and delete data Another Process B Cassandra database Compare data with that in Cassandra 30
  • 31. Testing: Data Consistency Check Tool (2)  Compare only keys of data, not contents.  Useful to diagnose which part is wrong in test phase.  We found out other team’s bug as well 31
  • 32. Repair • Some types of query doesn’t trigger read repair. • Nodetool repair is tricky on big data. – Disk usage – Time consuming → Read all data afterward: Read repair • Discussion for improvement is going on: – CASSANDRA-2699 32
  • 33. System Architecture DB … DB Cassandra 1 B atch Data feeder          DB Services B atch … DB … DB Cassandra 2 Backup 33
  • 34. Backup Scheme  Backup might be required to shorten recovery time. 1. Snapshot to local disk – Plan disk size at server estimation phase. 1. Full backup of input data – We had full data feed several times for various reasons: E.g., Logic change, schema change, data corruption, etc. DB Incoming … DB data Cassandra Backup Snapshot Snapshot Snapshot 34
  • 35. Contents 1 Big Data Problem in Rakuten 2 Contributions to Cassandra Project 3 System Architecture 4 Details and Tips 5 Conclusion 35
  • 36. Conclusion • Rakuten uses Cassandra with Big data. • We’ll continue contributing to OSS. 36
  • 38. We are hiring! 中途採用を大募集しております! 楽天のMission 人と社会を(ネットを通じて)Empowermentし 自らの成功を通じ社会を変革し豊かにする 楽天のGOAL To become No.1 Internet Service Company in the World 楽天のMission&GOALに共感いただける方は是非ご連絡ください!  tech-career@mail.rakuten.com 38