SlideShare une entreprise Scribd logo
1  sur  54
Télécharger pour lire hors ligne
Cassandra FTW
                           Andrew Byde
                           Principal Scientist




Monday, 15 August 2011
Menu

                   • Introduction
                   • Data model + storage architecture
                   • Partitioning + replication
                   • Consistency
                   • De-normalisation

Monday, 15 August 2011
History + design




Monday, 15 August 2011
History

                   • 2007: Started at Facebook for inbox search
                   • July 2008: Open sourced by Facebook
                   • March 2009: Apache Incubator
                   • February 2010: Apache top-level project
                   • May 2011:Version 0.8
Monday, 15 August 2011
What it’s good for

                   • Horizontal scalability
                   • No single-point of failure
                   • Multi-data centre support
                   • Very high write workloads
                   • Tuneable consistency

Monday, 15 August 2011
What it’s not so good for

                   • Transactions
                   • Read heavy workloads
                   • Low latency applications
                         •   compared to in-memory dbs




Monday, 15 August 2011
Data model




Monday, 15 August 2011
Keyspaces and Column Families
                     SQL                                            Cassandra

           Database                 row/key col_1    col_2
                                                                     Keyspace
                                       row/key col_1     col_1
                                            row/  col_1    col_1


                Table                                              Column Family



                           Keyspaces & CFs have different
                            sets of configuration settings
Monday, 15 August 2011
Column Family

                         key: {
                            column: value,
                            column: value,
                            ...
                          }



Monday, 15 August 2011
Rows and columns
                         col1   col2   col3   col4   col5   col6   col7
                 row1            x                    x      x
                 row2     x      x      x      x      x
                 row3            x      x             x      x      x
                 row4            x      x      x             x
                 row5            x             x      x      x
                 row6            x
                 row7     x      x             x



Monday, 15 August 2011
Reads
               • get
               • get_slice          One row, some cols
                • name predicate
                • slice range
               • multiget_slice     Multiple rows
               • get_range_slices
Monday, 15 August 2011
get
                         col1   col2   col3   col4   col5   col6   col7
                 row1            x                    x      x
                 row2     x      x      x      x      x
                 row3            x      x             x      x      x
                 row4            x      x      x             x
                 row5            x             x      x      x
                 row6            x
                 row7     x      x             x



Monday, 15 August 2011
get_slice: name predicate
                         col1   col2   col3   col4   col5   col6   col7
                 row1            x                    x      x
                 row2     x      x      x      x      x
                 row3            x      x             x      x      x
                 row4            x      x      x             x
                 row5            x             x      x      x
                 row6            x
                 row7     x      x             x



Monday, 15 August 2011
get_slice: slice range
                          col1   col2   col3   col4   col5   col6   col7
                 row1             x                    x      x
                 row2      x      x      x      x      x
                 row3      x      x      x             x      x      x
                 row4             x      x      x             x
                 row5             x             x      x      x
                 row6             x
                 row7      x      x             x



Monday, 15 August 2011
multiget_slice: name
                              predicate
                          col1   col2   col3   col4   col5   col6   col7
                 row1             x                    x      x
                 row2      x      x      x      x      x
                 row3             x      x             x      x      x
                 row4             x      x      x             x
                 row5             x             x      x      x
                 row6             x
                 row7      x      x             x


Monday, 15 August 2011
get_range_slices: slice range
                         col1   col2   col3   col4   col5   col6   col7
                 row1            x                    x      x
                 row2     x      x      x      x      x
                 row3            x      x             x      x      x
                 row4            x      x      x             x
                 row5            x             x      x      x
                 row6            x
                 row7     x      x             x



Monday, 15 August 2011
Storage
                         architecture



Monday, 15 August 2011
Data Layout
                                     writes
                                        key-value insert
            on-disk
        un-ordered
        commit log                                                in-memory
        ...                                                     (key,col)-sorted
                                                                   memtable
                                            flush
                             on-disk        01001101110101000   01001101110101000



                         (key,col)-sorted                                           ...
                             SSTables
Monday, 15 August 2011
Data Layout
                            SSTables


                             SSTable
      Bloom Filter            01001101110101000



         Index
          Data




Monday, 15 August 2011
Data Layout
                                       reads
                                              ?



                          01001101110101000       01001101110101000   010011011101010001111010101001




Monday, 15 August 2011
Data Layout
                                       reads
                                              ?


                                    X             X
                          01001101110101000       01001101110101000   010011011101010001111010101001




Monday, 15 August 2011
Distribution:

                         Partitioning +
                          Replication


Monday, 15 August 2011
Partitioning + Replication



           (k, v)
                         ?




Monday, 15 August 2011
Partitioning + Replication
                   • Partitioning data on to nodes
                    • load balancing
                    • row-based
                   • Replication
                    • to protect against failure
                    • better availability
Monday, 15 August 2011
Partitioning
                   • Random: take hash of row key
                         •   good for load balancing

                         •   bad for range queries

                   • Ordered: subdivide key space
                         •   bad for load balancing

                         •   good for range queries

                   • Or build your own...
Monday, 15 August 2011
Simple Replication



           (k, v)




                           Nodes arranged on a ‘ring’
Monday, 15 August 2011
Simple Replication
                                     Primary location




           (k, v)




                           Nodes arranged on a ‘ring’
Monday, 15 August 2011
Simple Replication
                                     Primary location




           (k, v)                                   Extra copies
                                                   are successors
                                                     on the ring


                           Nodes arranged on a ‘ring’
Monday, 15 August 2011
Topology-aware
                                  Replication
                   • Snitch : node IP          (DataCenter, rack)

                   • EC2Snitch
                         •   Region   DC; availability_zone   rack

                   • PropertyFileSnitch
                         •   Configured from a file



Monday, 15 August 2011
Topology-aware
                           Replication
                                        DC 1     DC 2




                          (k, v)


                                   r1      r2   r1   r2


Monday, 15 August 2011
Topology-aware
                           Replication
                                        DC 1     DC 2




                          (k, v)


                                   r1      r2   r1   r2


Monday, 15 August 2011
Topology-aware
                           Replication
                                        DC 1     DC 2
       extra copies
       to different
       data center

                          (k, v)


                                   r1      r2   r1   r2


Monday, 15 August 2011
Topology-aware
                           Replication
                                        DC 1     DC 2
       extra copies
       to different
       data center

                          (k, v)

      spread across
      racks within a               r1      r2   r1   r2
       data center

Monday, 15 August 2011
Distribution:

                         Consistency



Monday, 15 August 2011
Consistency Level

                   • How many replicas must respond in order to
                         declare success
                   • W/N must succeed for write to succeed
                         •   write with client-generated timestamp

                   • R/N must succeed for read to succeed
                         •   return most recent, by timestamp


Monday, 15 August 2011
Consistency Level

                   • 1, 2, 3 responses
                   • Quorum (more than half)
                   • Quorum in local data center
                   • Quorum in each data center

Monday, 15 August 2011
Maintaining consistency

                   • Read repair
                   • Hinted handoff
                   • Anti-entropy


Monday, 15 August 2011
Read repair
                   • If the replicas disagree on read, send most
                         recent data back

                                            n1

                          read k?           n2

                                            n3


Monday, 15 August 2011
Read repair
                   • If the replicas disagree on read, send most
                         recent data back

                                            n1   v, t1

                          read k?           n2   not found!

                                            n3   v’, t2


Monday, 15 August 2011
Read repair
                   • If the replicas disagree on read, send most
                         recent data back

                                            n1   v, t1

                                            n2   not found!

                                            n3   v’, t2


Monday, 15 August 2011
Read repair
                   • If the replicas disagree on read, send most
                         recent data back

                                            n1

                                            n2

                                            n3   write (k, v’, t2)


Monday, 15 August 2011
Hinted handoff

                   • When a node is unavailable
                   • Writes can be written to any node as a hint
                   • Delivered when the node comes back
                         online




Monday, 15 August 2011
Anti-entropy

                   • Equivalent to ‘read repair all’
                   • Requires reading all data (woah)
                         •   (Although only hashes are sent to calculate diffs)

                   •          Manual process




Monday, 15 August 2011
De-normalisation




Monday, 15 August 2011
De-normalisation

                   • Disk space is much cheaper than disk seeks
                   • Read at 100 MB/s, seek at 100 IO/s
                   • => copy data to avoid seeks


Monday, 15 August 2011
Inbox
                                         user2

                         user1   msg1
                                         user3
                                 msg2


                                 msg3    user4
                                  ...




Monday, 15 August 2011
Data-centric model
                         m1: {
                           sender: user1
                           content: “Mary had a little lamb”
                           recipients: user2, user3
                         }


               • but how to do ‘recipients’ for Inbox?
               • one-to-many modelled by a join table

Monday, 15 August 2011
To join
          m1: {                                        user2: {
            sender: user1                                m1: true
            subject: “A rhyme”
            content: “Mary had a little lamb”          }
          }                                            user3: {
          m2: {
            sender: user1                                m1: true
            subject: “colours”                           m2: true
            content: “Its fleece was white as snow”
          }                                            }
          m3: {                                        user4: {
            sender: user1
            subject: “loyalty”                           m2: true
            content: “And everywhere that Mary went”     m3: true
          }
                                                       }


Monday, 15 August 2011
.. or not to join
                 • Joins are expensive, so de-normalise to trade
                         off space for time
                 • We can have lots of columns, so think BIG:
                 • Make message id a time-typed super-column.
                 • This makes get_slice an efficient way of
                         searching for messages in a time window



Monday, 15 August 2011
Super Column Family
                         user2: {
                           m1: {
                             sender: user1
                             subject: “A rhyme”
                           }
                         }
                         user3: {
                           m1: {
                             sender: user1
                             subject: “A rhyme”
                           }
                           m2: {
                             sender: user1
                             subject: “colours”
                           }
                         }
                         ...



Monday, 15 August 2011
De-normalisation +
                               Cassandra
                 • have to write a copy of the record for each
                         recipient ... but writes are very cheap
                 • get_slice fetches columns for a particular
                         row, so gets received messages for a user
                 • on-disk column order is optimal for this
                         query



Monday, 15 August 2011
Conclusion




Monday, 15 August 2011
What it’s good for

                   • Horizontal scalability
                   • No single-point of failure
                   • Multi-data centre support
                   • Very high write workloads
                   • Tuneable consistency

Monday, 15 August 2011
Q?




Monday, 15 August 2011

Contenu connexe

Plus de DATAVERSITY

The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
DATAVERSITY
 
Data Strategy Best Practices
Data Strategy Best PracticesData Strategy Best Practices
Data Strategy Best Practices
DATAVERSITY
 
Assessing New Database Capabilities – Multi-Model
Assessing New Database Capabilities – Multi-ModelAssessing New Database Capabilities – Multi-Model
Assessing New Database Capabilities – Multi-Model
DATAVERSITY
 

Plus de DATAVERSITY (20)

Data Modeling Fundamentals
Data Modeling FundamentalsData Modeling Fundamentals
Data Modeling Fundamentals
 
Showing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectShowing ROI for Your Analytic Project
Showing ROI for Your Analytic Project
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at Scale
 
Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?
 
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
 
Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?
 
Data Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and ForwardsData Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and Forwards
 
Data Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement TodayData Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement Today
 
2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics
 
Data Strategy Best Practices
Data Strategy Best PracticesData Strategy Best Practices
Data Strategy Best Practices
 
Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best Practices
 
MLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageMLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive Advantage
 
Keeping the Pulse of Your Data – Why You Need Data Observability to Improve D...
Keeping the Pulse of Your Data – Why You Need Data Observability to Improve D...Keeping the Pulse of Your Data – Why You Need Data Observability to Improve D...
Keeping the Pulse of Your Data – Why You Need Data Observability to Improve D...
 
Empowering the Data Driven Business with Modern Business Intelligence
Empowering the Data Driven Business with Modern Business IntelligenceEmpowering the Data Driven Business with Modern Business Intelligence
Empowering the Data Driven Business with Modern Business Intelligence
 
Enterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data ArchitectureEnterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data Architecture
 
Data Governance Best Practices, Assessments, and Roadmaps
Data Governance Best Practices, Assessments, and RoadmapsData Governance Best Practices, Assessments, and Roadmaps
Data Governance Best Practices, Assessments, and Roadmaps
 
Including All Your Mission-Critical Data in Modern Apps and Analytics
Including All Your Mission-Critical Data in Modern Apps and AnalyticsIncluding All Your Mission-Critical Data in Modern Apps and Analytics
Including All Your Mission-Critical Data in Modern Apps and Analytics
 
Assessing New Database Capabilities – Multi-Model
Assessing New Database Capabilities – Multi-ModelAssessing New Database Capabilities – Multi-Model
Assessing New Database Capabilities – Multi-Model
 
What’s in Your Data Warehouse?
What’s in Your Data Warehouse?What’s in Your Data Warehouse?
What’s in Your Data Warehouse?
 

Dernier

Mckinsey foundation level Handbook for Viewing
Mckinsey foundation level Handbook for ViewingMckinsey foundation level Handbook for Viewing
Mckinsey foundation level Handbook for Viewing
Nauman Safdar
 
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabiunwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
Abortion pills in Kuwait Cytotec pills in Kuwait
 
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai KuwaitThe Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
daisycvs
 
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
daisycvs
 

Dernier (20)

Mckinsey foundation level Handbook for Viewing
Mckinsey foundation level Handbook for ViewingMckinsey foundation level Handbook for Viewing
Mckinsey foundation level Handbook for Viewing
 
joint cost.pptx COST ACCOUNTING Sixteenth Edition ...
joint cost.pptx  COST ACCOUNTING  Sixteenth Edition                          ...joint cost.pptx  COST ACCOUNTING  Sixteenth Edition                          ...
joint cost.pptx COST ACCOUNTING Sixteenth Edition ...
 
Horngren’s Cost Accounting A Managerial Emphasis, Canadian 9th edition soluti...
Horngren’s Cost Accounting A Managerial Emphasis, Canadian 9th edition soluti...Horngren’s Cost Accounting A Managerial Emphasis, Canadian 9th edition soluti...
Horngren’s Cost Accounting A Managerial Emphasis, Canadian 9th edition soluti...
 
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabiunwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
 
Cracking the 'Career Pathing' Slideshare
Cracking the 'Career Pathing' SlideshareCracking the 'Career Pathing' Slideshare
Cracking the 'Career Pathing' Slideshare
 
Falcon Invoice Discounting: The best investment platform in india for investors
Falcon Invoice Discounting: The best investment platform in india for investorsFalcon Invoice Discounting: The best investment platform in india for investors
Falcon Invoice Discounting: The best investment platform in india for investors
 
Organizational Transformation Lead with Culture
Organizational Transformation Lead with CultureOrganizational Transformation Lead with Culture
Organizational Transformation Lead with Culture
 
Falcon Invoice Discounting: Aviate Your Cash Flow Challenges
Falcon Invoice Discounting: Aviate Your Cash Flow ChallengesFalcon Invoice Discounting: Aviate Your Cash Flow Challenges
Falcon Invoice Discounting: Aviate Your Cash Flow Challenges
 
Falcon's Invoice Discounting: Your Path to Prosperity
Falcon's Invoice Discounting: Your Path to ProsperityFalcon's Invoice Discounting: Your Path to Prosperity
Falcon's Invoice Discounting: Your Path to Prosperity
 
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai KuwaitThe Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
 
Lucknow Housewife Escorts by Sexy Bhabhi Service 8250092165
Lucknow Housewife Escorts  by Sexy Bhabhi Service 8250092165Lucknow Housewife Escorts  by Sexy Bhabhi Service 8250092165
Lucknow Housewife Escorts by Sexy Bhabhi Service 8250092165
 
Falcon Invoice Discounting: Tailored Financial Wings
Falcon Invoice Discounting: Tailored Financial WingsFalcon Invoice Discounting: Tailored Financial Wings
Falcon Invoice Discounting: Tailored Financial Wings
 
CROSS CULTURAL NEGOTIATION BY PANMISEM NS
CROSS CULTURAL NEGOTIATION BY PANMISEM NSCROSS CULTURAL NEGOTIATION BY PANMISEM NS
CROSS CULTURAL NEGOTIATION BY PANMISEM NS
 
SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60% in 6 Months
SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60%  in 6 MonthsSEO Case Study: How I Increased SEO Traffic & Ranking by 50-60%  in 6 Months
SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60% in 6 Months
 
Arti Languages Pre Seed Teaser Deck 2024.pdf
Arti Languages Pre Seed Teaser Deck 2024.pdfArti Languages Pre Seed Teaser Deck 2024.pdf
Arti Languages Pre Seed Teaser Deck 2024.pdf
 
Buy Verified TransferWise Accounts From Seosmmearth
Buy Verified TransferWise Accounts From SeosmmearthBuy Verified TransferWise Accounts From Seosmmearth
Buy Verified TransferWise Accounts From Seosmmearth
 
Over the Top (OTT) Market Size & Growth Outlook 2024-2030
Over the Top (OTT) Market Size & Growth Outlook 2024-2030Over the Top (OTT) Market Size & Growth Outlook 2024-2030
Over the Top (OTT) Market Size & Growth Outlook 2024-2030
 
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
 
Putting the SPARK into Virtual Training.pptx
Putting the SPARK into Virtual Training.pptxPutting the SPARK into Virtual Training.pptx
Putting the SPARK into Virtual Training.pptx
 
Falcon Invoice Discounting: Unlock Your Business Potential
Falcon Invoice Discounting: Unlock Your Business PotentialFalcon Invoice Discounting: Unlock Your Business Potential
Falcon Invoice Discounting: Unlock Your Business Potential
 

Cassandra: Two data centers and great performance

  • 1. Cassandra FTW Andrew Byde Principal Scientist Monday, 15 August 2011
  • 2. Menu • Introduction • Data model + storage architecture • Partitioning + replication • Consistency • De-normalisation Monday, 15 August 2011
  • 3. History + design Monday, 15 August 2011
  • 4. History • 2007: Started at Facebook for inbox search • July 2008: Open sourced by Facebook • March 2009: Apache Incubator • February 2010: Apache top-level project • May 2011:Version 0.8 Monday, 15 August 2011
  • 5. What it’s good for • Horizontal scalability • No single-point of failure • Multi-data centre support • Very high write workloads • Tuneable consistency Monday, 15 August 2011
  • 6. What it’s not so good for • Transactions • Read heavy workloads • Low latency applications • compared to in-memory dbs Monday, 15 August 2011
  • 7. Data model Monday, 15 August 2011
  • 8. Keyspaces and Column Families SQL Cassandra Database row/key col_1 col_2 Keyspace row/key col_1 col_1 row/ col_1 col_1 Table Column Family Keyspaces & CFs have different sets of configuration settings Monday, 15 August 2011
  • 9. Column Family key: { column: value, column: value, ... } Monday, 15 August 2011
  • 10. Rows and columns col1 col2 col3 col4 col5 col6 col7 row1 x x x row2 x x x x x row3 x x x x x row4 x x x x row5 x x x x row6 x row7 x x x Monday, 15 August 2011
  • 11. Reads • get • get_slice One row, some cols • name predicate • slice range • multiget_slice Multiple rows • get_range_slices Monday, 15 August 2011
  • 12. get col1 col2 col3 col4 col5 col6 col7 row1 x x x row2 x x x x x row3 x x x x x row4 x x x x row5 x x x x row6 x row7 x x x Monday, 15 August 2011
  • 13. get_slice: name predicate col1 col2 col3 col4 col5 col6 col7 row1 x x x row2 x x x x x row3 x x x x x row4 x x x x row5 x x x x row6 x row7 x x x Monday, 15 August 2011
  • 14. get_slice: slice range col1 col2 col3 col4 col5 col6 col7 row1 x x x row2 x x x x x row3 x x x x x x row4 x x x x row5 x x x x row6 x row7 x x x Monday, 15 August 2011
  • 15. multiget_slice: name predicate col1 col2 col3 col4 col5 col6 col7 row1 x x x row2 x x x x x row3 x x x x x row4 x x x x row5 x x x x row6 x row7 x x x Monday, 15 August 2011
  • 16. get_range_slices: slice range col1 col2 col3 col4 col5 col6 col7 row1 x x x row2 x x x x x row3 x x x x x row4 x x x x row5 x x x x row6 x row7 x x x Monday, 15 August 2011
  • 17. Storage architecture Monday, 15 August 2011
  • 18. Data Layout writes key-value insert on-disk un-ordered commit log in-memory ... (key,col)-sorted memtable flush on-disk 01001101110101000 01001101110101000 (key,col)-sorted ... SSTables Monday, 15 August 2011
  • 19. Data Layout SSTables SSTable Bloom Filter 01001101110101000 Index Data Monday, 15 August 2011
  • 20. Data Layout reads ? 01001101110101000 01001101110101000 010011011101010001111010101001 Monday, 15 August 2011
  • 21. Data Layout reads ? X X 01001101110101000 01001101110101000 010011011101010001111010101001 Monday, 15 August 2011
  • 22. Distribution: Partitioning + Replication Monday, 15 August 2011
  • 23. Partitioning + Replication (k, v) ? Monday, 15 August 2011
  • 24. Partitioning + Replication • Partitioning data on to nodes • load balancing • row-based • Replication • to protect against failure • better availability Monday, 15 August 2011
  • 25. Partitioning • Random: take hash of row key • good for load balancing • bad for range queries • Ordered: subdivide key space • bad for load balancing • good for range queries • Or build your own... Monday, 15 August 2011
  • 26. Simple Replication (k, v) Nodes arranged on a ‘ring’ Monday, 15 August 2011
  • 27. Simple Replication Primary location (k, v) Nodes arranged on a ‘ring’ Monday, 15 August 2011
  • 28. Simple Replication Primary location (k, v) Extra copies are successors on the ring Nodes arranged on a ‘ring’ Monday, 15 August 2011
  • 29. Topology-aware Replication • Snitch : node IP (DataCenter, rack) • EC2Snitch • Region DC; availability_zone rack • PropertyFileSnitch • Configured from a file Monday, 15 August 2011
  • 30. Topology-aware Replication DC 1 DC 2 (k, v) r1 r2 r1 r2 Monday, 15 August 2011
  • 31. Topology-aware Replication DC 1 DC 2 (k, v) r1 r2 r1 r2 Monday, 15 August 2011
  • 32. Topology-aware Replication DC 1 DC 2 extra copies to different data center (k, v) r1 r2 r1 r2 Monday, 15 August 2011
  • 33. Topology-aware Replication DC 1 DC 2 extra copies to different data center (k, v) spread across racks within a r1 r2 r1 r2 data center Monday, 15 August 2011
  • 34. Distribution: Consistency Monday, 15 August 2011
  • 35. Consistency Level • How many replicas must respond in order to declare success • W/N must succeed for write to succeed • write with client-generated timestamp • R/N must succeed for read to succeed • return most recent, by timestamp Monday, 15 August 2011
  • 36. Consistency Level • 1, 2, 3 responses • Quorum (more than half) • Quorum in local data center • Quorum in each data center Monday, 15 August 2011
  • 37. Maintaining consistency • Read repair • Hinted handoff • Anti-entropy Monday, 15 August 2011
  • 38. Read repair • If the replicas disagree on read, send most recent data back n1 read k? n2 n3 Monday, 15 August 2011
  • 39. Read repair • If the replicas disagree on read, send most recent data back n1 v, t1 read k? n2 not found! n3 v’, t2 Monday, 15 August 2011
  • 40. Read repair • If the replicas disagree on read, send most recent data back n1 v, t1 n2 not found! n3 v’, t2 Monday, 15 August 2011
  • 41. Read repair • If the replicas disagree on read, send most recent data back n1 n2 n3 write (k, v’, t2) Monday, 15 August 2011
  • 42. Hinted handoff • When a node is unavailable • Writes can be written to any node as a hint • Delivered when the node comes back online Monday, 15 August 2011
  • 43. Anti-entropy • Equivalent to ‘read repair all’ • Requires reading all data (woah) • (Although only hashes are sent to calculate diffs) • Manual process Monday, 15 August 2011
  • 45. De-normalisation • Disk space is much cheaper than disk seeks • Read at 100 MB/s, seek at 100 IO/s • => copy data to avoid seeks Monday, 15 August 2011
  • 46. Inbox user2 user1 msg1 user3 msg2 msg3 user4 ... Monday, 15 August 2011
  • 47. Data-centric model m1: { sender: user1 content: “Mary had a little lamb” recipients: user2, user3 } • but how to do ‘recipients’ for Inbox? • one-to-many modelled by a join table Monday, 15 August 2011
  • 48. To join m1: { user2: { sender: user1 m1: true subject: “A rhyme” content: “Mary had a little lamb” } } user3: { m2: { sender: user1 m1: true subject: “colours” m2: true content: “Its fleece was white as snow” } } m3: { user4: { sender: user1 subject: “loyalty” m2: true content: “And everywhere that Mary went” m3: true } } Monday, 15 August 2011
  • 49. .. or not to join • Joins are expensive, so de-normalise to trade off space for time • We can have lots of columns, so think BIG: • Make message id a time-typed super-column. • This makes get_slice an efficient way of searching for messages in a time window Monday, 15 August 2011
  • 50. Super Column Family user2: { m1: { sender: user1 subject: “A rhyme” } } user3: { m1: { sender: user1 subject: “A rhyme” } m2: { sender: user1 subject: “colours” } } ... Monday, 15 August 2011
  • 51. De-normalisation + Cassandra • have to write a copy of the record for each recipient ... but writes are very cheap • get_slice fetches columns for a particular row, so gets received messages for a user • on-disk column order is optimal for this query Monday, 15 August 2011
  • 53. What it’s good for • Horizontal scalability • No single-point of failure • Multi-data centre support • Very high write workloads • Tuneable consistency Monday, 15 August 2011