SlideShare une entreprise Scribd logo
1  sur  32
Cassandra chengxiaojun 1
Backend Amazon Dynamo Facebook Cassandra(Dynama 2.0) Inbox search Apache 2
Cassandra Dynamo-like features Symmetric, P2P architecture Gossip-based cluster management DHT Eventual consistency Bigtable-like features Column family  SSTable disk storage Commit log Memtable Immutable Sstable files 3
Data Model(1/2) A table is a distributed multi dimensional map indexed by a key Keyspace Column Super Column Column Family Types 4
Data Model(2/2) 5
APIs Paper: insert(table; key; rowMutation) get(table; key; columnName) delete(table; key; columnName) Wiki: http://wiki.apache.org/cassandra/API 6
Architecture Layers 7
Partition(1/3) Consistent Hash Table 8
Partition(2/3) Problems: the random position assignment of each node on the ring leads to non-uniform data and load distribution the basic algorithm is oblivious to the heterogeneity in the performance of nodes. Two Ways: Dynamo One node is assigned to multiple positions in the circle Cassandra Analyze load information on the ring and  have lightly loaded nodes move on the ring to alleviate heavily load nodes.  9
Partition(3/3) Each Cassandra server [node] is assigned a unique Token that determines what keys it is the first replica for. Choice InitialToken: assigned RandomPartitioner :Tokens are integers from 0 to 2**127. Keys are converted to this range by MD5 hashing for comparison with Tokens.  NetworkTopologyStrategy:calculate the tokens the nodes in each DC independently. Tokens still needed to be unique, so you can add 1 to the tokens in the 2nd DC, add 2 in the 3rd, and so on. 10
Replication(1/4) high availability and durability replication_factor:N 11
Replication(2/4) Strategy Rack Unaware Rack Aware Datacenter Aware … 12
Replication(3/4) Cassandra system elects a leader amongst its nodes using a system called Zookeeper All nodes on joining the cluster contact the leader who tells them for what ranges they are replicas for The leader makes a concerted effort to maintain the invariant that no node is responsible for more than N-1 ranges in the ring. The metadata about the ranges a node is responsible is cached locally at each node and in a fault-tolerant manner inside Zookeeper This way a node that crashes and comes back up knows what ranges it was responsible for. 13
Replication(4/4) Cassandra provides durability guarantees in the presence of node failures and network partitions by relaxing the quorum requirements 14
Data Versioning Vector clocks 15
Consistency 16 W + R > N
Consistency put() : the coordinator generates the vector clock for the new version and writes the new version locally.  The coordinator then sends the new version (along with the new vector clock) to the N highest-ranked reachable nodes.  If at least W-1 nodes respond then the write is considered successful. get()  the coordinator requests all existing versions of data for that key from the N highest-ranked reachable nodes in the preference list for that key, a waits for R responses before returning the result to the client.  If the coordinator ends up gathering multiple versions of the data, it returns all the versions it deems to be causally unrelated. The divergent versions are then reconciled and the reconciled version superseding the current versions is written back. 17
Handling Temporary  Failures Hinted handoff if node A is temporarily down or unreachable during a write operation then a replica that would normally have lived on A will now be sent to node D.  The replica sent to D will have a hint in its metadata that suggests which node was the intended recipient of the replica (in this case A).  Nodes that receive hinted replicas will keep them in a separate local database that is scanned periodically. Upon detecting that A has recovered, D will attempt to deliver the replica to A.  Once the transfer succeeds, D may delete the object from its local store without decreasing the total number of replicas in the system.
Handling permanent failures Replica synchronization: anti-entropy To detect the inconsistencies between replicas faster and to minimize the amount of transferred data
Cassandra Consistency For Read 20
Cassandra Consistency For Write 21
Cassandra Read Repair Cassandra repairs data in two ways: Read Repair: every time a read is performed, Cassandra compares the versions at each replica (in the background, if a low consistency was requested by the reader to minimize latency), and the newest version is sent to any out-of-date replicas. Anti-Entropy: when nodetool repair is run, Cassandra computes a Merkle tree for each range of data on that node, and compares it with the versions on other replicas, to catch any out of sync data that hasn't been read recently. This is intended to be run infrequently (e.g., weekly) since computing the Merkle tree is relatively expensive in disk i/o and CPU, since it scans ALL the data on the machine (but it is is very network efficient). 22
Bootstrapping New node Position specify an InitialToken pick a Token that will give it half the keys from the node with the most disk space used Note: You should wait long enough for all the nodes in your cluster to become aware of the bootstrapping node via gossip before starting another bootstrap Relating to point 1, one can only bootstrap N nodes at a time with automatic token picking, where N is the size of the existing cluster. As a safety measure, Cassandra does not automatically remove data from nodes that "lose" part of their Token Range to a newly added node. When bootstrapping a new node, existing nodes have to divide the key space before beginning replication. During bootstrap, a node will drop the Thrift port and will not be accessible from nodetool Bootstrap can take many hours when a lot of data is involved 23
Moving or Removing nodes Remove nodes Live node: nodetool decommission the data will stream from the decommissioned node Dead node: nodetool removetoken the data will stream from the remaining replicas Mode nodes nodetool move: decommission + bootstrap LB If you add nodes to your cluster your ring will be unbalanced and only way to get perfect balance is to compute new tokens for every node and assign them to each node manually by using nodetool move command. 24
Membership Scuttlebutt Based on Gossip efficient CPU utilization  efficient utilization of the gossip channel anti-entropy  Gossip Paper:Efficient Reconciliation and Flow Control for Anti-Entropy Protocols 25
Failure Detection The φ Accrual Failure Detector Idea: the failure detection module doesn't emit a Boolean value stating a node is up or down. Instead thefailure detection module emits a value which represents a suspicion level for each of monitored nodes 26
Local Persistence(1/4) Write Operation: 1.  write into a commit log 2.  an update into an in-memory data structure 3.  When the in-memory data structure crosses a certain threshold, calculated based on data size and number of objects, it dumps itself to disk Read Operation: 1.   query the in-memory data structure 2.   look into the files on disk in the order of newest to oldest 3.   combine 27
Local Persistence(2/4) Commit log all writes into the commit log are sequential Fixed size  Create/delete Durability and recoverability 28
Local Persistence(3/4) Memtable Per column family a write-back cache of data rows that can be looked up by key sorted by key 29
Local Persistence(4/4) SStable Flushing Once flushed, SSTable files are immutable; no further writes may be done.  Compaction mergingmultiple old SSTable files into a single new one Since the input SSTables are all sorted by key, merging can be done efficiently, still requiring no random i/o. Once compaction is finished, the old SSTable files may be deleted Discard tombstones index All writes are sequential to disk and also generate an index for efficient lookup based on row key. These indices are also persisted along with the data file In order to prevent lookups into les that do not contain the key, a bloom filter, summarizing the keys in the le, is also stored in each data le and also kept in memory. In order to prevent scanning of every column on disk we maintain column indices which allow us to jump to the right chunk on disk for column retrieval. 30
Facebook inbox search 31 Key: userN
Reference http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf http://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf http://wiki.apache.org/cassandra/FrontPage 32

Contenu connexe

Tendances

Distribute Key Value Store
Distribute Key Value StoreDistribute Key Value Store
Distribute Key Value Store
Santal Li
 
Spanner - Google distributed database
Spanner - Google distributed databaseSpanner - Google distributed database
Spanner - Google distributed database
Abhra Basak
 
Avi Apelbaum - RAC
Avi Apelbaum - RAC Avi Apelbaum - RAC
Avi Apelbaum - RAC
gridcontrol
 

Tendances (20)

Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Google Spanner
Google SpannerGoogle Spanner
Google Spanner
 
Understanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache CassandraUnderstanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache Cassandra
 
Apache Cassandra multi-datacenter essentials
Apache Cassandra multi-datacenter essentialsApache Cassandra multi-datacenter essentials
Apache Cassandra multi-datacenter essentials
 
Cassandra - A Decentralized Structured Storage System
Cassandra - A Decentralized Structured Storage SystemCassandra - A Decentralized Structured Storage System
Cassandra - A Decentralized Structured Storage System
 
Discretized streams
Discretized streamsDiscretized streams
Discretized streams
 
Talon systems - Distributed multi master replication strategy
Talon systems - Distributed multi master replication strategyTalon systems - Distributed multi master replication strategy
Talon systems - Distributed multi master replication strategy
 
Understanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache CassandraUnderstanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache Cassandra
 
Cassandra 1.2 by Eddie Satterly
Cassandra 1.2 by Eddie SatterlyCassandra 1.2 by Eddie Satterly
Cassandra 1.2 by Eddie Satterly
 
Distribute Key Value Store
Distribute Key Value StoreDistribute Key Value Store
Distribute Key Value Store
 
Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Versio...
Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Versio...Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Versio...
Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Versio...
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Deconstructing Apache Cassandra
Deconstructing Apache CassandraDeconstructing Apache Cassandra
Deconstructing Apache Cassandra
 
Spanner - Google distributed database
Spanner - Google distributed databaseSpanner - Google distributed database
Spanner - Google distributed database
 
Cassandra advanced-I
Cassandra advanced-ICassandra advanced-I
Cassandra advanced-I
 
Avi Apelbaum - RAC
Avi Apelbaum - RAC Avi Apelbaum - RAC
Avi Apelbaum - RAC
 
SignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series DatabaseSignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series Database
 
Ch20
Ch20Ch20
Ch20
 
Solution8 v2
Solution8 v2Solution8 v2
Solution8 v2
 
Percona XtraDB 集群文档
Percona XtraDB 集群文档Percona XtraDB 集群文档
Percona XtraDB 集群文档
 

En vedette (7)

Bst svn专项培训
Bst svn专项培训Bst svn专项培训
Bst svn专项培训
 
Hmaster
HmasterHmaster
Hmaster
 
检查实现类
检查实现类检查实现类
检查实现类
 
Trie树分享
Trie树分享Trie树分享
Trie树分享
 
Smart pointer
Smart pointerSmart pointer
Smart pointer
 
向量空间模型与动态规划分享
向量空间模型与动态规划分享 向量空间模型与动态规划分享
向量空间模型与动态规划分享
 
An Efficient Language Model Using Double-Array Structures
An Efficient Language Model Using Double-Array StructuresAn Efficient Language Model Using Double-Array Structures
An Efficient Language Model Using Double-Array Structures
 

Similaire à Dynamo cassandra

Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
DataStax
 
Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011
Boris Yen
 
Design Patterns For Distributed NO-reational databases
Design Patterns For Distributed NO-reational databasesDesign Patterns For Distributed NO-reational databases
Design Patterns For Distributed NO-reational databases
lovingprince58
 
Cassandra overview
Cassandra overviewCassandra overview
Cassandra overview
Sean Murphy
 

Similaire à Dynamo cassandra (20)

Cassandra & Python - Springfield MO User Group
Cassandra & Python - Springfield MO User GroupCassandra & Python - Springfield MO User Group
Cassandra & Python - Springfield MO User Group
 
Cassandra no sql ecosystem
Cassandra no sql ecosystemCassandra no sql ecosystem
Cassandra no sql ecosystem
 
Cassandra
CassandraCassandra
Cassandra
 
Cassandra advanced part-ll
Cassandra advanced part-llCassandra advanced part-ll
Cassandra advanced part-ll
 
Cassandra Architecture
Cassandra ArchitectureCassandra Architecture
Cassandra Architecture
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
 
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
 
The Apache Cassandra ecosystem
The Apache Cassandra ecosystemThe Apache Cassandra ecosystem
The Apache Cassandra ecosystem
 
Apache cassandra
Apache cassandraApache cassandra
Apache cassandra
 
5266732.ppt
5266732.ppt5266732.ppt
5266732.ppt
 
Cassandra multi-datacenter operations essentials
Cassandra multi-datacenter operations essentialsCassandra multi-datacenter operations essentials
Cassandra multi-datacenter operations essentials
 
Cassandra architecture
Cassandra architectureCassandra architecture
Cassandra architecture
 
Cassandra consistency
Cassandra consistencyCassandra consistency
Cassandra consistency
 
Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011
 
Talk About Apache Cassandra
Talk About Apache CassandraTalk About Apache Cassandra
Talk About Apache Cassandra
 
RAC - The Savior of DBA
RAC - The Savior of DBARAC - The Savior of DBA
RAC - The Savior of DBA
 
Design Patterns For Distributed NO-reational databases
Design Patterns For Distributed NO-reational databasesDesign Patterns For Distributed NO-reational databases
Design Patterns For Distributed NO-reational databases
 
Cassandra overview
Cassandra overviewCassandra overview
Cassandra overview
 
Design Patterns for Distributed Non-Relational Databases
Design Patterns for Distributed Non-Relational DatabasesDesign Patterns for Distributed Non-Relational Databases
Design Patterns for Distributed Non-Relational Databases
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 

Dernier

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 

Dernier (20)

On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 

Dynamo cassandra

  • 2. Backend Amazon Dynamo Facebook Cassandra(Dynama 2.0) Inbox search Apache 2
  • 3. Cassandra Dynamo-like features Symmetric, P2P architecture Gossip-based cluster management DHT Eventual consistency Bigtable-like features Column family SSTable disk storage Commit log Memtable Immutable Sstable files 3
  • 4. Data Model(1/2) A table is a distributed multi dimensional map indexed by a key Keyspace Column Super Column Column Family Types 4
  • 6. APIs Paper: insert(table; key; rowMutation) get(table; key; columnName) delete(table; key; columnName) Wiki: http://wiki.apache.org/cassandra/API 6
  • 9. Partition(2/3) Problems: the random position assignment of each node on the ring leads to non-uniform data and load distribution the basic algorithm is oblivious to the heterogeneity in the performance of nodes. Two Ways: Dynamo One node is assigned to multiple positions in the circle Cassandra Analyze load information on the ring and have lightly loaded nodes move on the ring to alleviate heavily load nodes. 9
  • 10. Partition(3/3) Each Cassandra server [node] is assigned a unique Token that determines what keys it is the first replica for. Choice InitialToken: assigned RandomPartitioner :Tokens are integers from 0 to 2**127. Keys are converted to this range by MD5 hashing for comparison with Tokens.  NetworkTopologyStrategy:calculate the tokens the nodes in each DC independently. Tokens still needed to be unique, so you can add 1 to the tokens in the 2nd DC, add 2 in the 3rd, and so on. 10
  • 11. Replication(1/4) high availability and durability replication_factor:N 11
  • 12. Replication(2/4) Strategy Rack Unaware Rack Aware Datacenter Aware … 12
  • 13. Replication(3/4) Cassandra system elects a leader amongst its nodes using a system called Zookeeper All nodes on joining the cluster contact the leader who tells them for what ranges they are replicas for The leader makes a concerted effort to maintain the invariant that no node is responsible for more than N-1 ranges in the ring. The metadata about the ranges a node is responsible is cached locally at each node and in a fault-tolerant manner inside Zookeeper This way a node that crashes and comes back up knows what ranges it was responsible for. 13
  • 14. Replication(4/4) Cassandra provides durability guarantees in the presence of node failures and network partitions by relaxing the quorum requirements 14
  • 16. Consistency 16 W + R > N
  • 17. Consistency put() : the coordinator generates the vector clock for the new version and writes the new version locally. The coordinator then sends the new version (along with the new vector clock) to the N highest-ranked reachable nodes. If at least W-1 nodes respond then the write is considered successful. get() the coordinator requests all existing versions of data for that key from the N highest-ranked reachable nodes in the preference list for that key, a waits for R responses before returning the result to the client. If the coordinator ends up gathering multiple versions of the data, it returns all the versions it deems to be causally unrelated. The divergent versions are then reconciled and the reconciled version superseding the current versions is written back. 17
  • 18. Handling Temporary Failures Hinted handoff if node A is temporarily down or unreachable during a write operation then a replica that would normally have lived on A will now be sent to node D. The replica sent to D will have a hint in its metadata that suggests which node was the intended recipient of the replica (in this case A). Nodes that receive hinted replicas will keep them in a separate local database that is scanned periodically. Upon detecting that A has recovered, D will attempt to deliver the replica to A. Once the transfer succeeds, D may delete the object from its local store without decreasing the total number of replicas in the system.
  • 19. Handling permanent failures Replica synchronization: anti-entropy To detect the inconsistencies between replicas faster and to minimize the amount of transferred data
  • 22. Cassandra Read Repair Cassandra repairs data in two ways: Read Repair: every time a read is performed, Cassandra compares the versions at each replica (in the background, if a low consistency was requested by the reader to minimize latency), and the newest version is sent to any out-of-date replicas. Anti-Entropy: when nodetool repair is run, Cassandra computes a Merkle tree for each range of data on that node, and compares it with the versions on other replicas, to catch any out of sync data that hasn't been read recently. This is intended to be run infrequently (e.g., weekly) since computing the Merkle tree is relatively expensive in disk i/o and CPU, since it scans ALL the data on the machine (but it is is very network efficient). 22
  • 23. Bootstrapping New node Position specify an InitialToken pick a Token that will give it half the keys from the node with the most disk space used Note: You should wait long enough for all the nodes in your cluster to become aware of the bootstrapping node via gossip before starting another bootstrap Relating to point 1, one can only bootstrap N nodes at a time with automatic token picking, where N is the size of the existing cluster. As a safety measure, Cassandra does not automatically remove data from nodes that "lose" part of their Token Range to a newly added node. When bootstrapping a new node, existing nodes have to divide the key space before beginning replication. During bootstrap, a node will drop the Thrift port and will not be accessible from nodetool Bootstrap can take many hours when a lot of data is involved 23
  • 24. Moving or Removing nodes Remove nodes Live node: nodetool decommission the data will stream from the decommissioned node Dead node: nodetool removetoken the data will stream from the remaining replicas Mode nodes nodetool move: decommission + bootstrap LB If you add nodes to your cluster your ring will be unbalanced and only way to get perfect balance is to compute new tokens for every node and assign them to each node manually by using nodetool move command. 24
  • 25. Membership Scuttlebutt Based on Gossip efficient CPU utilization efficient utilization of the gossip channel anti-entropy Gossip Paper:Efficient Reconciliation and Flow Control for Anti-Entropy Protocols 25
  • 26. Failure Detection The φ Accrual Failure Detector Idea: the failure detection module doesn't emit a Boolean value stating a node is up or down. Instead thefailure detection module emits a value which represents a suspicion level for each of monitored nodes 26
  • 27. Local Persistence(1/4) Write Operation: 1. write into a commit log 2. an update into an in-memory data structure 3. When the in-memory data structure crosses a certain threshold, calculated based on data size and number of objects, it dumps itself to disk Read Operation: 1. query the in-memory data structure 2. look into the files on disk in the order of newest to oldest 3. combine 27
  • 28. Local Persistence(2/4) Commit log all writes into the commit log are sequential Fixed size Create/delete Durability and recoverability 28
  • 29. Local Persistence(3/4) Memtable Per column family a write-back cache of data rows that can be looked up by key sorted by key 29
  • 30. Local Persistence(4/4) SStable Flushing Once flushed, SSTable files are immutable; no further writes may be done.  Compaction mergingmultiple old SSTable files into a single new one Since the input SSTables are all sorted by key, merging can be done efficiently, still requiring no random i/o. Once compaction is finished, the old SSTable files may be deleted Discard tombstones index All writes are sequential to disk and also generate an index for efficient lookup based on row key. These indices are also persisted along with the data file In order to prevent lookups into les that do not contain the key, a bloom filter, summarizing the keys in the le, is also stored in each data le and also kept in memory. In order to prevent scanning of every column on disk we maintain column indices which allow us to jump to the right chunk on disk for column retrieval. 30
  • 31. Facebook inbox search 31 Key: userN