This talk focuses Cassandra's anti-entrpoy mechanisms. Jason will discuss the details of read repair, hinted handoff, node repair, and more as they aide in reolving data that has become inconsistent across nodes. In addition, he'll provide insight into how those techniques are used to ensure data consistency at Netflix.
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown
1. When Bad Things
Happen to Good Data:
Understanding Anti-Entropy in
Cassandra
Jason Brown
@jasobrown jasedbrown@gmail.com
2. About me
• Senior Software Engineer @ Netflix
• Apache Cassandra committer
• E-Commerce Architect, Major League
Baseball Advanced Media
• Wireless developer (J2ME and BREW)
4. Inconsistencies creep in
• Node is down
• Network partition
• Dropped mutations
• Process crash before commit log flush
• File corruption
Cassandra trades C for AP
5. Anti-Entropy Overview
• write time
o tunable consistency
o atomic batches
o hinted handoff
• read time
o consistent reads
o read repair
• maintenance time
o node repair
7. Cassandra Writes Basics
• determine all replica nodes in all DCs
• send to replicas in local DC
• send one replica node in remote DCs,
o it will forward to peers
• all respond back to original coordinator
10. Writes - Tunable consistency
Coordinator blocks for specified count of
replicas to respond
• consistency level
o ALL
o EACH_QUORUM
o LOCAL_QUORUM
o ONE / TWO / THREE
o ANY
11. Hinted handoff
Save a copy of the write for down nodes, and
replay later
hint = target replica + mutation data
12. Hinted handoff - storing
• on coordinator, store a hint for any nodes not
currently 'up'
• if a replica doesn't respond within
write_request_timeout_in_ms, store a hint
• max_hint_window_in_ms - maximum
amount of time a dead host will have hints
generated.
13. Hinted handoff - replay
• try to send hints to nodes
• runs every ten minutes
• multithreaded (as of 1.2)
• throttable (kb per second)
17. Atomic Batches
• coordinator stores incoming mutation to two
peers in same DC
o deletes from peers on successful completion
• peers will replay the batch if not deleted
o runs every 60 seconds
• with 1.2, all mutates use atomic batch
19. Cassandra Reads - setup
• determine endpoints to invoke
o consistency level vs. read repair
• first data node to send back full data set,
other nodes only return a digest
• wait until the CL number of nodes to return
21. Consistent reads
• compare the digests of returned data sets
• if any mismatches, send request again to
same CL data nodes.
o this time no digests, full data set
• compare the full data sets, send updates to
out of date replicas
• block until those fixes are responded to
• return data to caller
22. Read Repair
• synchronizes the client-requested data
amongst all replicas
• piggy-backs on normal reads, but waits for
all replicas to respond asynchronously
• then, just like consistent reads, compares
the digests, and fix if needed
24. Read Repair - configuration
• setting per column family
• percentage of all calls to CF
• Local DC vs. Global chance
25. Read repair fixes data that is actually
requested,
... but what about data that isn't requested?
26. Node Repair - introduction
• repairs inconsistencies across all replicas for
a given range
• nodetool repair
o repairs the ranges the node contains
o one of more column families (within the same
keyspace)
o can choose local datacenter only (c* 1.2)
27. • should be part of std operations
maintenance for c*, esp if you delete data
o ensures tombstones are propagated, and avoid
resurrected data
• repair is IO and CPU intensive
Node Repair - cautions
28. Node Repair - details 1
• determine peer nodes with matching ranges
• triggers a major (validation) compaction on
peer nodes
o read and generate hash for every row in CF
o add result to a Merkle Tree
o return tree to initiator
29. Node Repair - details 2
• initiator awaits trees from all nodes
• compares each tree to every other tree
• if any differences exist, two nodes are
exchange the conflicting ranges
o these ranges get written out as new, local sstables
35. Anti-Entropy wrap-up
• CAP Theorem lives, tradeoffs must be made
• C* contains processes to make diverging
data sets consistent
• Tunable controls exist at write and read
times, as well on-demand