3. 3
01
Who am I?
I’m Mark Miller
I’m a Lucene junkie (2006)
I’m a Lucene committer (2008)
And a Solr committer (2009)
And a member of the ASF (2011)
And a former Lucene PMC Chair (2014-2015)
I’ve done a lot of core Solr work and co-created SolrCloud
4. This talk is about how SolrCloud tries to protect your data.
And about some things that should change.
6. 6
03
Failure Cases (Shards of index can be treated independently)
• A Leader dies (loses ZK connection)
• A Replica dies or update from leader to
replica fails.
• A Replica is partitioned (eg can talk to
ZK, but not a shard leader)
R
L
ZK
7. 7
01
Replica Recovery
• A replica will recover from the leader
on startup.
• A replica will recover if an update from
the leader to the replica fails.
• A replica may recover from the leader
in the leader election sync up dance.R
L
ZK
8. 8
01
Replica Recovery Dance
• Start Buffering Updates from Leader
• Publish Recovering to ZK
• Wait for leader to see Recovering State
• On first Recovery try, PeerSync
• Otherwise full index replication
• Commit on leader
• Replicate Index
• Replay Buffered Documents
R
L
ZK
RecoveryStrategy
9. 9
01
A Replica is Partitioned
• In the early days we half punted on this
• Now, when a leader cannot reach a
replica, it will put it in LIR in ZK.
• A replica in LIR will realize that it must
recover before clearing it’s LIR status.
• We worked through some bugs, but
this is very solid now.
R
L
ZK
X
10. 10
01
Leader Recovery
• The ‘best effort’ leader recovery dance
• If it’s after startup and the last
published state is not active, can’t be
leader.
• Otherwise, try to peer sync with shard.
• If success, try to peer sync from
replicas to leader.
• If any of those sync fails, ask replicas to
recover from leader.
R
L
ZK
SyncStrategy / ElectionContext
11. 11
01
Leader Election Forward Progress Stall…
• Each replica decides for itself if it
thinks it should be leader.
• Everyone may think they are unfit.
• Only replicas that have last published
ACTIVE will attempt to be leader after
the first election.
12. 12
01
Leader Election Forward Progress Stall…
• While rare, if all replicas in a shard lose
their connection to ZK at the same
time, no replica will become leader
without intervention.
• There is a manual API to intervene, but
this should be done automatically.
• In practice, this tends to happen for
reasons that can be ‘tuned’ out of.
• Still needs to be improved.
13. 13
01
User chooses durability requirements
• You can specify how many replicas you
want to see success from to consider
an update successful. minRf param.
• This won’t fail based on that criteria
though - simply flag you in the
response.
• If you replicate factor is not achieved,
that also does not mean the update is
rolled back.
14. 14
01
User chooses durability requirements
• If we improve some of this…
• We can stop trying so hard.
• And put it on the user to specify a
replication factor that controls how
‘safe’ updates are.
16. 16
01
Handeling Cluster Shutdown / Startup
• What if an old replica returns?
• How to ensure every replica
participates in election?
• What if no replica thinks it should be
leader?
• Staggered shutdowns?
• Explicit cluster commands might help