SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, Cloudera

•

0 j'aime•856 vues

Lucidworks

Presented at Lucene/Solr Revolution 2016

Technologie

O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A

SolrCloud: High Availability and Fault Tolerance
Mark Miller
Software Engineer, Cloudera

3
01
Who am I?
I’m Mark Miller
I’m a Lucene junkie (2006)
I’m a Lucene committer (2008)
And a Solr committer (2009)
And a member of the ASF (2011)
And a former Lucene PMC Chair (2014-2015)
I’ve done a lot of core Solr work and co-created SolrCloud

This talk is about how SolrCloud tries to protect your data.
And about some things that should change.

6
03
Failure Cases (Shards of index can be treated independently)
• A Leader dies (loses ZK connection)
• A Replica dies or update from leader to
replica fails.
• A Replica is partitioned (eg can talk to
ZK, but not a shard leader)
R
L
ZK

7
01
Replica Recovery
• A replica will recover from the leader
on startup.
• A replica will recover if an update from
the leader to the replica fails.
• A replica may recover from the leader
in the leader election sync up dance.R
L
ZK

8
01
Replica Recovery Dance
• Start Buffering Updates from Leader
• Publish Recovering to ZK
• Wait for leader to see Recovering State
• On ﬁrst Recovery try, PeerSync
• Otherwise full index replication
• Commit on leader
• Replicate Index
• Replay Buffered Documents
R
L
ZK
RecoveryStrategy

9
01
A Replica is Partitioned
• In the early days we half punted on this
• Now, when a leader cannot reach a
replica, it will put it in LIR in ZK.
• A replica in LIR will realize that it must
recover before clearing it’s LIR status.
• We worked through some bugs, but
this is very solid now.
R
L
ZK
X

10
01
Leader Recovery
• The ‘best effort’ leader recovery dance
• If it’s after startup and the last
published state is not active, can’t be
leader.
• Otherwise, try to peer sync with shard.
• If success, try to peer sync from
replicas to leader.
• If any of those sync fails, ask replicas to
recover from leader.
R
L
ZK
SyncStrategy / ElectionContext

11
01
Leader Election Forward Progress Stall…
• Each replica decides for itself if it
thinks it should be leader.
• Everyone may think they are unﬁt.
• Only replicas that have last published
ACTIVE will attempt to be leader after
the ﬁrst election.

12
01
Leader Election Forward Progress Stall…
• While rare, if all replicas in a shard lose
their connection to ZK at the same
time, no replica will become leader
without intervention.
• There is a manual API to intervene, but
this should be done automatically.
• In practice, this tends to happen for
reasons that can be ‘tuned’ out of.
• Still needs to be improved.

13
01
User chooses durability requirements
• You can specify how many replicas you
want to see success from to consider
an update successful. minRf param.
• This won’t fail based on that criteria
though - simply ﬂag you in the
response.
• If you replicate factor is not achieved,
that also does not mean the update is
rolled back.

14
01
User chooses durability requirements
• If we improve some of this…
• We can stop trying so hard.
• And put it on the user to specify a
replication factor that controls how
‘safe’ updates are.

16
01
Handeling Cluster Shutdown / Startup
• What if an old replica returns?
• How to ensure every replica
participates in election?
• What if no replica thinks it should be
leader?
• Staggered shutdowns?
• Explicit cluster commands might help

17
Thank You!
Mark Miller
@heismark
Software
Engineer
Cloudera

Recommandé

Why Is My Solr Slow?: Presented by Mike Drob, ClouderaLucidworks

Introduction to SolrCloudVarun Thacker

Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Shalin Shekhar Mangar

User defined-functions-cassandra-summit-eu-2014Robert Stupp

Solr cluster with SolrCloud at lucenerevolution (tutorial)searchbox-com

How to make a simple cheap high availability self-healing solr clusterlucenerevolution

Parallel Computing with SolrCloud: Presented by Joel Bernstein, AlfrescoLucidworks

Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...Lucidworks

Recommandé

Why Is My Solr Slow?: Presented by Mike Drob, ClouderaLucidworks

Introduction to SolrCloudVarun Thacker

Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Shalin Shekhar Mangar

User defined-functions-cassandra-summit-eu-2014Robert Stupp

Solr cluster with SolrCloud at lucenerevolution (tutorial)searchbox-com

How to make a simple cheap high availability self-healing solr clusterlucenerevolution

Parallel Computing with SolrCloud: Presented by Joel Bernstein, AlfrescoLucidworks

Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...Lucidworks

Solr Exchange: Introduction to SolrCloudthelabdude

Real World Analytics with Solr Cloud and SparkQAware GmbH

GIDS2014: SolrCloud: Searching Big DataShalin Shekhar Mangar

Inside Solr 5 - Bangalore Solr/Lucene MeetupShalin Shekhar Mangar

Solr Compute Cloud - An Elastic SolrCloud Infrastructure Nitin S

Solrcloud Leader Electionravikgiitk

Apache SolrCloudMichał Warecki

Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkitthelabdude

Scaling Solr with Solr CloudSematext Group, Inc.

Intro to Apache SolrShalin Shekhar Mangar

Cross Datacenter Replication in Apache Solr 6Shalin Shekhar Mangar

Scaling search with SolrCloudSaumitra Srivastav

Cassandra UDF and Materialized ViewsDuyhai Doan

Benchmarking Solr Performance at Scalethelabdude

Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCLucidworks (Archived)

Deep Dive into SlickKnoldus Inc.

Training Slides: 102 - Basics - Tungsten Replicator - How We Move Your DataContinuent

Backing Data Silo Atack: Alfresco sharding, SOLR for non-flat objectsITD Systems

Solr consistency and recovery internals - Mano KovacsMano Kovacs

Solr consistency and recovery internalsCloudera, Inc.

Locks, Blocks, and Snapshots: Maximizing Database Concurrency (SQL Saturday M...Bob Pusateri

Locks, Blocks, and Snapshots: Maximizing Database Concurrency (Chicago Suburb...Bob Pusateri

Contenu connexe

En vedette

Solr Exchange: Introduction to SolrCloudthelabdude

Real World Analytics with Solr Cloud and SparkQAware GmbH

GIDS2014: SolrCloud: Searching Big DataShalin Shekhar Mangar

Inside Solr 5 - Bangalore Solr/Lucene MeetupShalin Shekhar Mangar

Solr Compute Cloud - An Elastic SolrCloud Infrastructure Nitin S

Solrcloud Leader Electionravikgiitk

Apache SolrCloudMichał Warecki

Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkitthelabdude

Scaling Solr with Solr CloudSematext Group, Inc.

Intro to Apache SolrShalin Shekhar Mangar

Cross Datacenter Replication in Apache Solr 6Shalin Shekhar Mangar

Scaling search with SolrCloudSaumitra Srivastav

Cassandra UDF and Materialized ViewsDuyhai Doan

Benchmarking Solr Performance at Scalethelabdude

Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCLucidworks (Archived)

En vedette (15)

Solr Exchange: Introduction to SolrCloud

Real World Analytics with Solr Cloud and Spark

GIDS2014: SolrCloud: Searching Big Data

Inside Solr 5 - Bangalore Solr/Lucene Meetup

Solr Compute Cloud - An Elastic SolrCloud Infrastructure

Solrcloud Leader Election

Apache SolrCloud

Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit

Scaling Solr with Solr Cloud

Intro to Apache Solr

Cross Datacenter Replication in Apache Solr 6

Scaling search with SolrCloud

Cassandra UDF and Materialized Views

Benchmarking Solr Performance at Scale

Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC

Similaire à SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, Cloudera

Deep Dive into SlickKnoldus Inc.

Training Slides: 102 - Basics - Tungsten Replicator - How We Move Your DataContinuent

Backing Data Silo Atack: Alfresco sharding, SOLR for non-flat objectsITD Systems

Solr consistency and recovery internals - Mano KovacsMano Kovacs

Solr consistency and recovery internalsCloudera, Inc.

Locks, Blocks, and Snapshots: Maximizing Database Concurrency (SQL Saturday M...Bob Pusateri

Locks, Blocks, and Snapshots: Maximizing Database Concurrency (Chicago Suburb...Bob Pusateri

Leveraging pull replicas in Solr 7Samuel Tatipamula

Database Expert Q&A from 2600hz and CloudantJoshua Goldbard

Apache Solr: Upgrading Your Upgrade Experience - Hrishikesh Gadre, LucidworksLucidworks

Geek Sync | Field Medic’s Guide to Database MirroringIDERA Software

20140228 fp and_performanceshinolajla

What’s up With Availability in Kafka? With Justine Olshan | Current 2022HostedbyConfluent

Declarative Network Configuration Salesforce Engineering

How SolrCloud Solved Recovery Issues - Dat Cao Manh, LucidworksLucidworks

ORMs Meet SQLRicardo Peres

Magic With Oracle - PresentationFrancisco Alvarez

Learning to Rank with Apache Solr and FusionLucidworks

Optimera STHLM 2011 - Mikael Berggren, Spotify.SE (Stiftelsen för Internetinfrastruktur)

Developing A Big Data Search Engine - Where we have gone. Where we are going:...Lucidworks

Similaire à SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, Cloudera (20)

Deep Dive into Slick

Training Slides: 102 - Basics - Tungsten Replicator - How We Move Your Data

Backing Data Silo Atack: Alfresco sharding, SOLR for non-flat objects

Solr consistency and recovery internals - Mano Kovacs

Solr consistency and recovery internals

Locks, Blocks, and Snapshots: Maximizing Database Concurrency (SQL Saturday M...

Locks, Blocks, and Snapshots: Maximizing Database Concurrency (Chicago Suburb...

Leveraging pull replicas in Solr 7

Database Expert Q&A from 2600hz and Cloudant

Apache Solr: Upgrading Your Upgrade Experience - Hrishikesh Gadre, Lucidworks

Geek Sync | Field Medic’s Guide to Database Mirroring

20140228 fp and_performance

What’s up With Availability in Kafka? With Justine Olshan | Current 2022

Declarative Network Configuration

How SolrCloud Solved Recovery Issues - Dat Cao Manh, Lucidworks

ORMs Meet SQL

Magic With Oracle - Presentation

Learning to Rank with Apache Solr and Fusion

Optimera STHLM 2011 - Mikael Berggren, Spotify

Developing A Big Data Search Engine - Where we have gone. Where we are going:...

Plus de Lucidworks

Search is the Tip of the Spear for Your B2B eCommerce StrategyLucidworks

Drive Agent Effectiveness in SalesforceLucidworks

How Crate & Barrel Connects Shoppers with Relevant ProductsLucidworks

Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks

Connected Experiences Are Personalized ExperiencesLucidworks

Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Lucidworks

[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...Lucidworks

Preparing for Peak in Ecommerce | eTail Asia 2020Lucidworks

Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Lucidworks

AI-Powered Linguistics and Search with Fusion and RosetteLucidworks

The Service Industry After COVID-19: The Soul of Service in a Virtual MomentLucidworks

Webinar: Smart answers for employee and customer support after covid 19 - EuropeLucidworks

Smart Answers for Employee and Customer Support After COVID-19Lucidworks

Applying AI & Search in Europe - featuring 451 ResearchLucidworks

Webinar: Accelerate Data Science with Fusion 5.1Lucidworks

Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyLucidworks

Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Lucidworks

Apply Knowledge Graphs and Search for Real-World Decision IntelligenceLucidworks

Webinar: Building a Business Case for Enterprise SearchLucidworks

Why Insight Engines Matter in 2020 and BeyondLucidworks

Plus de Lucidworks (20)

Search is the Tip of the Spear for Your B2B eCommerce Strategy

Drive Agent Effectiveness in Salesforce

How Crate & Barrel Connects Shoppers with Relevant Products

Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery

Connected Experiences Are Personalized Experiences

Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...

[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...

Preparing for Peak in Ecommerce | eTail Asia 2020

Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...

AI-Powered Linguistics and Search with Fusion and Rosette

The Service Industry After COVID-19: The Soul of Service in a Virtual Moment

Webinar: Smart answers for employee and customer support after covid 19 - Europe

Smart Answers for Employee and Customer Support After COVID-19

Applying AI & Search in Europe - featuring 451 Research

Webinar: Accelerate Data Science with Fusion 5.1

Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy

Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...

Apply Knowledge Graphs and Search for Real-World Decision Intelligence

Webinar: Building a Business Case for Enterprise Search

Why Insight Engines Matter in 2020 and Beyond

Dernier

WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays

[BuildWithAI] Introduction to Gemini.pdfSandro Moreira

Why Teams call analytics are critical to your entire businesspanagenda

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

Exploring Multimodal Embeddings with MilvusZilliz

Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

ICT role in 21st century education and its challengesrafiqahmad00786416

Elevate Developer Efficiency & build GenAI Application with Amazon QBhuvaneswari Subramani

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

FWD Group - Insurer Innovation Award 2024The Digital Insurer

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

Dernier (20)

WSO2's API Vision: Unifying Control, Empowering Developers

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

[BuildWithAI] Introduction to Gemini.pdf

Why Teams call analytics are critical to your entire business

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Exploring Multimodal Embeddings with Milvus

Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

ICT role in 21st century education and its challenges

Elevate Developer Efficiency & build GenAI Application with Amazon Q

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

How to Troubleshoot Apps for the Modern Connected Worker

presentation ICT roal in 21st century education

FWD Group - Insurer Innovation Award 2024

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, Cloudera

1. O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A

2. SolrCloud: High Availability and Fault Tolerance Mark Miller Software Engineer, Cloudera

3. 3 01 Who am I? I’m Mark Miller I’m a Lucene junkie (2006) I’m a Lucene committer (2008) And a Solr committer (2009) And a member of the ASF (2011) And a former Lucene PMC Chair (2014-2015) I’ve done a lot of core Solr work and co-created SolrCloud

4. This talk is about how SolrCloud tries to protect your data. And about some things that should change.

5. 5 01 SolrCloud Diagram

6. 6 03 Failure Cases (Shards of index can be treated independently) • A Leader dies (loses ZK connection) • A Replica dies or update from leader to replica fails. • A Replica is partitioned (eg can talk to ZK, but not a shard leader) R L ZK

7. 7 01 Replica Recovery • A replica will recover from the leader on startup. • A replica will recover if an update from the leader to the replica fails. • A replica may recover from the leader in the leader election sync up dance.R L ZK

8. 8 01 Replica Recovery Dance • Start Buffering Updates from Leader • Publish Recovering to ZK • Wait for leader to see Recovering State • On ﬁrst Recovery try, PeerSync • Otherwise full index replication • Commit on leader • Replicate Index • Replay Buffered Documents R L ZK RecoveryStrategy

9. 9 01 A Replica is Partitioned • In the early days we half punted on this • Now, when a leader cannot reach a replica, it will put it in LIR in ZK. • A replica in LIR will realize that it must recover before clearing it’s LIR status. • We worked through some bugs, but this is very solid now. R L ZK X

10. 10 01 Leader Recovery • The ‘best effort’ leader recovery dance • If it’s after startup and the last published state is not active, can’t be leader. • Otherwise, try to peer sync with shard. • If success, try to peer sync from replicas to leader. • If any of those sync fails, ask replicas to recover from leader. R L ZK SyncStrategy / ElectionContext

11. 11 01 Leader Election Forward Progress Stall… • Each replica decides for itself if it thinks it should be leader. • Everyone may think they are unﬁt. • Only replicas that have last published ACTIVE will attempt to be leader after the ﬁrst election.

12. 12 01 Leader Election Forward Progress Stall… • While rare, if all replicas in a shard lose their connection to ZK at the same time, no replica will become leader without intervention. • There is a manual API to intervene, but this should be done automatically. • In practice, this tends to happen for reasons that can be ‘tuned’ out of. • Still needs to be improved.

13. 13 01 User chooses durability requirements • You can specify how many replicas you want to see success from to consider an update successful. minRf param. • This won’t fail based on that criteria though - simply ﬂag you in the response. • If you replicate factor is not achieved, that also does not mean the update is rolled back.

14. 14 01 User chooses durability requirements • If we improve some of this… • We can stop trying so hard. • And put it on the user to specify a replication factor that controls how ‘safe’ updates are.

15. 15 JIRA

16. 16 01 Handeling Cluster Shutdown / Startup • What if an old replica returns? • How to ensure every replica participates in election? • What if no replica thinks it should be leader? • Staggered shutdowns? • Explicit cluster commands might help

17. 17 Thank You! Mark Miller @heismark Software Engineer Cloudera