SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
Introduction To SolrCloud 
Varun Thacker
Apache Solr has a huge install base and tremendous momentum 
Solr is both established & growing 
250,000+ 
most widely used search 
solution on the planet. 8M+ total downloads 
monthly downloads 
You use Solr everyday. 
Solr has tens of thousands 
of applications in production. 
2500+ open Solr jobs. 
Activity Summary 
30 Day summary 
Aug 18 - Sep 17 2014 
• 128 Commits 
• 18 Contributors 
12 Month Summary 
Sep 17, 2013 - Sep 17, 2014 
• 1351 Commits 
• 29 Contributors 
via https://www.openhub.net/p/solr
Solr scalability is unmatched. 
• 10TB+ Index Size 
• 10 Billion+ Documents 
• 100 Million+ Daily Requests
Solr’s scalability is unmatched
What is Solr? 
• A system built to search text 
• A specialized type of database management 
system 
• A platform to build search applications on 
• Customizable, open source software
Where does Solr fit?
What is SolrCloud? 
Subset of optional features in Solr to enable and 
simplify horizontal scaling a search index using 
sharding and replication. 
Goals 
scalability, performance, high-availability, 
simplicity, and elasticity
Terminology 
• ZooKeeper: Distributed coordination service that provides centralised 
configuration, cluster state management, and leader election 
• Node: JVM process bound to a specific port on a machine 
• Collection: Search index distributed across multiple nodes with same 
configuration 
• Shard: Logical slice of a collection; each shard has a name, hash range, leader 
and replication factor. Documents are assigned to one and only one shard 
per collection using a hash-based document routing strategy 
• Replica: A copy of a shard in a collection 
• Overseer: A special node that executes cluster administration commands and 
writes updated state to ZooKeeper. Automatic failover and leader election.
Collection == Distributed Index 
• A collection is a distributed index defined by: 
• named configuration stored in ZooKeeper 
• number of shards: documents are distributed across N partitions of the index 
• document routing strategy: how documents get assigned to shards 
• replication factor: how many copies of each document in the collection 
• Collections API: 
• curl "http://localhost:8983/solr/admin/collections? 
action=CREATE&name=punemeetup&replicationFactor=2&numShards=2&coll 
ection.configName=myonf
DEMO
Document Routing 
• Each shard covers a hash-range 
• Default: Hash ID into 32-bit integer, map to range 
• leads to balanced (roughly) shards 
• Custom-hashing 
• Tri-level: app!user!doc 
• Implicit: no hash-range set for shards
Replication 
• Why replicate? 
• High-availability 
• Load balancing 
• How does it work in SolrCloud? 
• Near-real-time, not master-slave 
• Leader forwards to replicas in parallel, waits for response 
• Error handling during indexing is tricky
Distributed Indexing 
• Get cluster state from ZK 
• Route document directly to leader (hash on doc ID) 
• Persist document on durable storage (tlog) 
• Forward to healthy replicas 
• Acknowledge write succeed to client
Shard Leader 
• Additional responsibilities during indexing only! 
Not a master node 
• Leader is a replica (handles queries) 
• Accepts update requests for the shard 
• Increments the _version_ on the new or updated 
doc 
• Sends updates (in parallel) to all replicas
Distributed Queries 
• Query client can be ZK aware or just query via a load 
balancer 
• Client can send query to any node in the cluster 
• Controller node distributes the query to a replica for 
each shard to identify documents matching query 
• Controller node sorts the results from step 3 and 
issues a second query for all fields for a page of 
results
Scalability / Stability Highlights 
• All nodes in cluster perform indexing and execute 
queries; no master node 
• Distributed indexing: No SPoF, high throughput via 
direct updates to leaders, automated failover to new 
leader 
• Distributed queries: Add replicas to scale-out qps; 
parallelize complex query computations; fault-tolerance 
• Indexing / queries continue so long as there is 1 healthy 
replica per shard
Zookeeper 
• Is a very good thing ... clusters are a zoo! 
• Centralized configuration management 
• Cluster state management 
• Leader election (shard leader and overseer) 
• Overseer distributed work queue 
• Live Nodes 
• Ephemeral znodes used to signal a server is gone 
• Needs 3 nodes for quorum in production
Zookeeper: State Management 
• Keep track of live nodes /live_nodes znode 
• ephemeral nodes 
• ZooKeeper client timeout 
• Collection metadata and replica state in /clusterstate.json 
• Every core has watchers for /live_nodes and / 
clusterstate.json 
• Leader election 
• ZooKeeper sequence number on ephemeral znodes
Other Features/Highlights 
• Near-Real-Time Search: Documents are visible within a second or so after 
being indexed 
• Partial Document Update: Just update the fields you need to change on 
existing documents 
• Optimistic Locking: Ensure updates are applied to the correct version of a 
document 
• Transaction log: Better recoverability; peer-sync between nodes after 
hiccups 
• HTTPS 
• Use HDFS for storing indexes 
• Use MapReduce for building index (SOLR-1301)
Solr on YARN
Solr on YARN 
• Run the SolrClient application : 
• Allocate container to run SolrMaster 
• SolrMaster requests containers to run SolrCloud 
nodes 
• Solr containers allocated across cluster 
• SolrCloud node connects to ZooKeeper
More Information on Solr on YARN 
• https://lucidworks.com/blog/solr-yarn/ 
• https://github.com/LucidWorks/yarn-proto 
• https://issues.apache.org/jira/browse/SOLR-6743
Our users are also pushing the limits 
https://twitter.com/bretthoerner/status/476830302430437376
Up, up and away! 
https://twitter.com/bretthoerner/status/476838275106091008
Connect @ 
https://twitter.com/varunthacker 
http://in.linkedin.com/in/varunthacker 
varun.thacker@lucidworks.com

Contenu connexe

Tendances

Call me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networksCall me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networksShalin Shekhar Mangar
 
How to make a simple cheap high availability self-healing solr cluster
How to make a simple cheap high availability self-healing solr clusterHow to make a simple cheap high availability self-healing solr cluster
How to make a simple cheap high availability self-healing solr clusterlucenerevolution
 
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solrthelabdude
 
Scaling SolrCloud to a large number of Collections
Scaling SolrCloud to a large number of CollectionsScaling SolrCloud to a large number of Collections
Scaling SolrCloud to a large number of CollectionsAnshum Gupta
 
Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Shalin Shekhar Mangar
 
What's New on AWS and What it Means to You
What's New on AWS and What it Means to YouWhat's New on AWS and What it Means to You
What's New on AWS and What it Means to YouAmazon Web Services
 
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...thelabdude
 
Inside Solr 5 - Bangalore Solr/Lucene Meetup
Inside Solr 5 - Bangalore Solr/Lucene MeetupInside Solr 5 - Bangalore Solr/Lucene Meetup
Inside Solr 5 - Bangalore Solr/Lucene MeetupShalin Shekhar Mangar
 
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...Lucidworks
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudAnshum Gupta
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceLucidworks (Archived)
 
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Nitin S
 
SolrCloud Failover and Testing
SolrCloud Failover and TestingSolrCloud Failover and Testing
SolrCloud Failover and TestingMark Miller
 
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Shalin Shekhar Mangar
 
How SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded EnvironmentHow SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded Environmentlucenerevolution
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scalethelabdude
 
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Lucidworks
 

Tendances (20)

Call me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networksCall me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networks
 
How to make a simple cheap high availability self-healing solr cluster
How to make a simple cheap high availability self-healing solr clusterHow to make a simple cheap high availability self-healing solr cluster
How to make a simple cheap high availability self-healing solr cluster
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solr
 
Scaling SolrCloud to a large number of Collections
Scaling SolrCloud to a large number of CollectionsScaling SolrCloud to a large number of Collections
Scaling SolrCloud to a large number of Collections
 
Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6
 
What's New on AWS and What it Means to You
What's New on AWS and What it Means to YouWhat's New on AWS and What it Means to You
What's New on AWS and What it Means to You
 
High Performance Solr
High Performance SolrHigh Performance Solr
High Performance Solr
 
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
 
Inside Solr 5 - Bangalore Solr/Lucene Meetup
Inside Solr 5 - Bangalore Solr/Lucene MeetupInside Solr 5 - Bangalore Solr/Lucene Meetup
Inside Solr 5 - Bangalore Solr/Lucene Meetup
 
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
 
Apache SolrCloud
Apache SolrCloudApache SolrCloud
Apache SolrCloud
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
 
SolrCloud Failover and Testing
SolrCloud Failover and TestingSolrCloud Failover and Testing
SolrCloud Failover and Testing
 
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
 
How SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded EnvironmentHow SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded Environment
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
 

En vedette

Managing a SolrCloud cluster using APIs
Managing a SolrCloud cluster using APIsManaging a SolrCloud cluster using APIs
Managing a SolrCloud cluster using APIsAnshum Gupta
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrAndy Jackson
 
SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...
SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...
SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...Lucidworks
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersSematext Group, Inc.
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studyCharlie Hull
 
Meet Solr For The Tirst Again
Meet Solr For The Tirst AgainMeet Solr For The Tirst Again
Meet Solr For The Tirst AgainVarun Thacker
 
Introduction to apache zoo keeper
Introduction to apache zoo keeper Introduction to apache zoo keeper
Introduction to apache zoo keeper Omid Vahdaty
 
Apache zookeeper seminar_trinh_viet_dung_03_2016
Apache zookeeper seminar_trinh_viet_dung_03_2016Apache zookeeper seminar_trinh_viet_dung_03_2016
Apache zookeeper seminar_trinh_viet_dung_03_2016Viet-Dung TRINH
 

En vedette (12)

Scaling Solr with Solr Cloud
Scaling Solr with Solr CloudScaling Solr with Solr Cloud
Scaling Solr with Solr Cloud
 
Managing a SolrCloud cluster using APIs
Managing a SolrCloud cluster using APIsManaging a SolrCloud cluster using APIs
Managing a SolrCloud cluster using APIs
 
SolrCloud and Shard Splitting
SolrCloud and Shard SplittingSolrCloud and Shard Splitting
SolrCloud and Shard Splitting
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...
SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...
SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance study
 
Meet Solr For The Tirst Again
Meet Solr For The Tirst AgainMeet Solr For The Tirst Again
Meet Solr For The Tirst Again
 
Lucene 101
Lucene 101Lucene 101
Lucene 101
 
Introduction to apache zoo keeper
Introduction to apache zoo keeper Introduction to apache zoo keeper
Introduction to apache zoo keeper
 
Apache zookeeper seminar_trinh_viet_dung_03_2016
Apache zookeeper seminar_trinh_viet_dung_03_2016Apache zookeeper seminar_trinh_viet_dung_03_2016
Apache zookeeper seminar_trinh_viet_dung_03_2016
 

Similaire à Introduction to SolrCloud

Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCLucidworks (Archived)
 
Meetup on Apache Zookeeper
Meetup on Apache ZookeeperMeetup on Apache Zookeeper
Meetup on Apache ZookeeperAnshul Patel
 
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Lucidworks
 
ELK stack introduction
ELK stack introduction ELK stack introduction
ELK stack introduction abenyeung1
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Cosmos DB at VLDB 2019
Cosmos DB at VLDB 2019Cosmos DB at VLDB 2019
Cosmos DB at VLDB 2019Dharma Shukla
 
Automated Cluster Management and Recovery for Large Scale Multi-Tenant Sea...
  Automated Cluster Management and Recovery  for Large Scale Multi-Tenant Sea...  Automated Cluster Management and Recovery  for Large Scale Multi-Tenant Sea...
Automated Cluster Management and Recovery for Large Scale Multi-Tenant Sea...Lucidworks
 
Webinar: Faster Log Indexing with Fusion
Webinar: Faster Log Indexing with FusionWebinar: Faster Log Indexing with Fusion
Webinar: Faster Log Indexing with FusionLucidworks
 
Paul Dix [InfluxData] | InfluxDays Opening Keynote | InfluxDays Virtual Exper...
Paul Dix [InfluxData] | InfluxDays Opening Keynote | InfluxDays Virtual Exper...Paul Dix [InfluxData] | InfluxDays Opening Keynote | InfluxDays Virtual Exper...
Paul Dix [InfluxData] | InfluxDays Opening Keynote | InfluxDays Virtual Exper...InfluxData
 
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Lucidworks
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platformTommaso Teofili
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
 
Hadoop-scale Search with Solr
Hadoop-scale Search with SolrHadoop-scale Search with Solr
Hadoop-scale Search with SolrDataWorks Summit
 
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - NitinSolr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitinbloomreacheng
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrRahul Jain
 
Solr Lucene Conference 2014 - Nitin Presentation
Solr Lucene Conference 2014 - Nitin PresentationSolr Lucene Conference 2014 - Nitin Presentation
Solr Lucene Conference 2014 - Nitin PresentationNitin Sharma
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr WorkshopJSGB
 

Similaire à Introduction to SolrCloud (20)

Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 
Meetup on Apache Zookeeper
Meetup on Apache ZookeeperMeetup on Apache Zookeeper
Meetup on Apache Zookeeper
 
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
 
ELK stack introduction
ELK stack introduction ELK stack introduction
ELK stack introduction
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Cosmos DB at VLDB 2019
Cosmos DB at VLDB 2019Cosmos DB at VLDB 2019
Cosmos DB at VLDB 2019
 
Automated Cluster Management and Recovery for Large Scale Multi-Tenant Sea...
  Automated Cluster Management and Recovery  for Large Scale Multi-Tenant Sea...  Automated Cluster Management and Recovery  for Large Scale Multi-Tenant Sea...
Automated Cluster Management and Recovery for Large Scale Multi-Tenant Sea...
 
Webinar: Faster Log Indexing with Fusion
Webinar: Faster Log Indexing with FusionWebinar: Faster Log Indexing with Fusion
Webinar: Faster Log Indexing with Fusion
 
Paul Dix [InfluxData] | InfluxDays Opening Keynote | InfluxDays Virtual Exper...
Paul Dix [InfluxData] | InfluxDays Opening Keynote | InfluxDays Virtual Exper...Paul Dix [InfluxData] | InfluxDays Opening Keynote | InfluxDays Virtual Exper...
Paul Dix [InfluxData] | InfluxDays Opening Keynote | InfluxDays Virtual Exper...
 
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
Hadoop-scale Search with Solr
Hadoop-scale Search with SolrHadoop-scale Search with Solr
Hadoop-scale Search with Solr
 
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - NitinSolr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
 
Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
 
Solr Lucene Conference 2014 - Nitin Presentation
Solr Lucene Conference 2014 - Nitin PresentationSolr Lucene Conference 2014 - Nitin Presentation
Solr Lucene Conference 2014 - Nitin Presentation
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Geode introduction
Geode introductionGeode introduction
Geode introduction
 

Introduction to SolrCloud

  • 2. Apache Solr has a huge install base and tremendous momentum Solr is both established & growing 250,000+ most widely used search solution on the planet. 8M+ total downloads monthly downloads You use Solr everyday. Solr has tens of thousands of applications in production. 2500+ open Solr jobs. Activity Summary 30 Day summary Aug 18 - Sep 17 2014 • 128 Commits • 18 Contributors 12 Month Summary Sep 17, 2013 - Sep 17, 2014 • 1351 Commits • 29 Contributors via https://www.openhub.net/p/solr
  • 3. Solr scalability is unmatched. • 10TB+ Index Size • 10 Billion+ Documents • 100 Million+ Daily Requests
  • 5. What is Solr? • A system built to search text • A specialized type of database management system • A platform to build search applications on • Customizable, open source software
  • 7. What is SolrCloud? Subset of optional features in Solr to enable and simplify horizontal scaling a search index using sharding and replication. Goals scalability, performance, high-availability, simplicity, and elasticity
  • 8. Terminology • ZooKeeper: Distributed coordination service that provides centralised configuration, cluster state management, and leader election • Node: JVM process bound to a specific port on a machine • Collection: Search index distributed across multiple nodes with same configuration • Shard: Logical slice of a collection; each shard has a name, hash range, leader and replication factor. Documents are assigned to one and only one shard per collection using a hash-based document routing strategy • Replica: A copy of a shard in a collection • Overseer: A special node that executes cluster administration commands and writes updated state to ZooKeeper. Automatic failover and leader election.
  • 9.
  • 10. Collection == Distributed Index • A collection is a distributed index defined by: • named configuration stored in ZooKeeper • number of shards: documents are distributed across N partitions of the index • document routing strategy: how documents get assigned to shards • replication factor: how many copies of each document in the collection • Collections API: • curl "http://localhost:8983/solr/admin/collections? action=CREATE&name=punemeetup&replicationFactor=2&numShards=2&coll ection.configName=myonf
  • 11. DEMO
  • 12. Document Routing • Each shard covers a hash-range • Default: Hash ID into 32-bit integer, map to range • leads to balanced (roughly) shards • Custom-hashing • Tri-level: app!user!doc • Implicit: no hash-range set for shards
  • 13. Replication • Why replicate? • High-availability • Load balancing • How does it work in SolrCloud? • Near-real-time, not master-slave • Leader forwards to replicas in parallel, waits for response • Error handling during indexing is tricky
  • 14. Distributed Indexing • Get cluster state from ZK • Route document directly to leader (hash on doc ID) • Persist document on durable storage (tlog) • Forward to healthy replicas • Acknowledge write succeed to client
  • 15. Shard Leader • Additional responsibilities during indexing only! Not a master node • Leader is a replica (handles queries) • Accepts update requests for the shard • Increments the _version_ on the new or updated doc • Sends updates (in parallel) to all replicas
  • 16. Distributed Queries • Query client can be ZK aware or just query via a load balancer • Client can send query to any node in the cluster • Controller node distributes the query to a replica for each shard to identify documents matching query • Controller node sorts the results from step 3 and issues a second query for all fields for a page of results
  • 17. Scalability / Stability Highlights • All nodes in cluster perform indexing and execute queries; no master node • Distributed indexing: No SPoF, high throughput via direct updates to leaders, automated failover to new leader • Distributed queries: Add replicas to scale-out qps; parallelize complex query computations; fault-tolerance • Indexing / queries continue so long as there is 1 healthy replica per shard
  • 18. Zookeeper • Is a very good thing ... clusters are a zoo! • Centralized configuration management • Cluster state management • Leader election (shard leader and overseer) • Overseer distributed work queue • Live Nodes • Ephemeral znodes used to signal a server is gone • Needs 3 nodes for quorum in production
  • 19. Zookeeper: State Management • Keep track of live nodes /live_nodes znode • ephemeral nodes • ZooKeeper client timeout • Collection metadata and replica state in /clusterstate.json • Every core has watchers for /live_nodes and / clusterstate.json • Leader election • ZooKeeper sequence number on ephemeral znodes
  • 20. Other Features/Highlights • Near-Real-Time Search: Documents are visible within a second or so after being indexed • Partial Document Update: Just update the fields you need to change on existing documents • Optimistic Locking: Ensure updates are applied to the correct version of a document • Transaction log: Better recoverability; peer-sync between nodes after hiccups • HTTPS • Use HDFS for storing indexes • Use MapReduce for building index (SOLR-1301)
  • 22. Solr on YARN • Run the SolrClient application : • Allocate container to run SolrMaster • SolrMaster requests containers to run SolrCloud nodes • Solr containers allocated across cluster • SolrCloud node connects to ZooKeeper
  • 23. More Information on Solr on YARN • https://lucidworks.com/blog/solr-yarn/ • https://github.com/LucidWorks/yarn-proto • https://issues.apache.org/jira/browse/SOLR-6743
  • 24. Our users are also pushing the limits https://twitter.com/bretthoerner/status/476830302430437376
  • 25. Up, up and away! https://twitter.com/bretthoerner/status/476838275106091008
  • 26. Connect @ https://twitter.com/varunthacker http://in.linkedin.com/in/varunthacker varun.thacker@lucidworks.com