SlideShare une entreprise Scribd logo
1  sur  22
Confidential and Proprietary
© 2013 LucidWorks
1
Scaling Through Partitioning and Shard Splitting
in Solr 4
Bigger, Better, Faster, Stronger, Safer
• Agenda
- Next set of problems
- Shard splitting
- Data partitioning using custom hashing
Confidential and Proprietary
© 2013 LucidWorks2
About me
• Independent consultant specializing in search, machine
learning, and big data analytics
• Co-author Solr In Action from Manning Publishers, with Trey
Grainger
- Use code 12mp45 for 42% off your MEAP
• Previously, lead architect, developer, and dev-ops engineer
for large-scale Solr cloud implementation at Dachis Group
• Coming soon! Big Data Jumpstart: 2-day intensive hands-on
workshop covering Hadoop (Hive, Pig, HDFS), Solr, and
Storm
Confidential and Proprietary
© 2013 LucidWorks3
A nice problem to have …
• Solr cloud can scale!
- 18 shards / ~900M docs
- Indexing rate of 6-8k docs / sec
- Millions of updates, deletes, and new docs per day
• Some lessons learned
- Eventually, your shards will outgrow the hardware they are
hosted on
- Search is addictive to organizations, expect broader usage and
greater query complexity
- Beware of the urge to do online analytics with Solr
Confidential and Proprietary
© 2013 LucidWorks4
Sharding review
• Split large index into multiple “shards” containing
unique slices of the index
- Documents assigned to one and only one shard using a hash of
the unique ID field
- Hash function designed to distribute documents evenly across N
shards (default: MurmurHash3 32-bit)
- Every shard must have at least one active host
• Benefits
- Distribute cost of indexing across multiple nodes
- Parallelize complex query computations such as sorting and
facets across multiple nodes
- Smaller index == less JVM heap == less GC headaches
- Smaller index == More of it fits in OS cache (MMapDirectory)
Confidential and Proprietary
© 2013 LucidWorks5
Distributed indexing
View of cluster state from Zk
Shard 1
Leader
Node 1 Node 2
Shard 2
Leader
Shard 2
Replica
Shard 1
Replica
Zookeeper
CloudSolrServer
“smart client”
Hash on docID
1
2
3
tlogtlog
Get URLs of current leaders?
4
5
2 shards with 1 replica each
shard1 range:
80000000-ffffffff
shard2 range:
0-7fffffff
Confidential and Proprietary
© 2013 LucidWorks6
Document routing
• Composite (default)
- numShards specific when collection is bootstrapped
- Each shard has a hash range (32-bit)
• Implicit
- numShards is unknown when collection is bootstrapped
- No “range” for a shard; indexing client is responsible for sending
documents to the correct shard
- Feels a little old school to me ;-)
Confidential and Proprietary
© 2013 LucidWorks7
Distributed Search
Send query request to any node
Two-stage process
1. Query controller sends query to all
shards and merges results
One host per shard must be online or queries
fail
2. Query controller sends 2nd query to
all shards with documents in the
merged result set to get requested
fields
Solr client applications built for 3.x do
not need to change (our query code still
uses SolrJ 3.6)
Limitations
JOINs / Grouping need custom hashing
View of cluster state from Zk
Shard 1
Leader
Node 1 Node 2
Shard 2
Leader
Shard 2
Replica
Shard 1
Replica
Zookeeper
CloudSolrServer
1
3
q=*:*
Get URLs of all live nodes
4
2
Query controller
Or just a load balancer works too
get fields
Confidential and Proprietary
© 2013 LucidWorks8
Shard splitting: overview
• Bit of history, before shard splitting ...
- Re-index to create more shards (difficult at
scale), or
- Over shard (inefficient and costly)
- Better to scale-up organically
• Shard splitting (SOLR-3755)
- Split an existing shard into 2 sub-shards
» May need to double your node count
» Custom hashing may create hotspots in your
cluster
Confidential and Proprietary
© 2013 LucidWorks9
Shard splitting: when to split?
• Can you purge some docs?
• No cut and dry answer, but you
might want to split shards when,
- Avg. query performance begins to slow down
» (hint: this means you have to keep track of this)
- Indexing throughput degrades
- Out Of Memory errors when querying
» And you’ve done as much cache, GC, and query tuning as possible
• Not necessarily have to add more nodes
- May just want to split shards to add more parallelism
Confidential and Proprietary
© 2013 LucidWorks10
Shard splitting mechanics (in Solr 4.3.1)
• Split existing shard into 2 sub-shards
- SPLITSHARD action in collections API
http://localhost:8983/solr/admin/collections?action=SPLITSHARD&collection=logmill&
shard=shard1
- The actual “splitting” occurs using low-level Lucene API, see
org.apache.solr.update.SolrIndexSplitter
• Hard commit after the split completes (fixed in 4.4)
• Unload the “core” of the original shard (on all nodes)
http://localhost:8983/solr/admin/cores?action=UNLOAD&core=logmill&deleteIndex=tr
ue
• Migrate one of the splits to another host (optional)
Confidential and Proprietary
© 2013 LucidWorks11
Interesting questions about the process
• What happens to the replicas when
a shard is split?
• What happens to update requests being sent to
the original shard leader during the split operation?
• The original shard remains in the cluster – doesn’t
that create duplicate docs in queries?
New sub-shards are replicated automatically
using Solr replication
Updates are sent to the correct sub-split during “construction”
No, the original shard enters the “inactive” state and no longer
participates in distributed queries for the collection
Confidential and Proprietary
© 2013 LucidWorks12
Split shard in action: cluster with 2 shards
Graph view before the split
Graph view after the split
Graph view after unloading original shard
Confidential and Proprietary
© 2013 LucidWorks
Shard 1_1
Leader
Node 1 Node 2
Shard 2
Leader
Shard 2
Replica
Shard 1_1
Replica
shard1_0 range:
80000000-bfffffff
shard2 range:
0-7fffffff
Shard 1_0
Leader
Shard 1_0
Replica
shard1_1 range:
c0000000-ffffffff
13
Before and After shard splitting
13
Shard 1
Leader
Node 1 Node 2
Shard 2
Leader
Shard 2
Replica
Shard 1
Replica
shard1 range:
80000000-ffffffff
shard2 range:
0-7fffffff
Before
After
Confidential and Proprietary
© 2013 LucidWorks14
Shard splitting: Limitations
• Both splits end up on same node … but easy to
migrate to another node
- Assuming you have replication, you can unload the core of one
of the new sub-shards, making the replica the leader, and then
bring up another replica for that shard on another node.
- Nice to have ability to specify the disk location of the new sub-
shard indexes (splitting 50GB using 1 disk can take a while)
• No control where the replicas end up
- Possible future enhancement
• Not a collection-wide rebalancing operation
- you can’t grow your cluster from 16 nodes to 24 nodes and end
up with an even distribution of documents per shard
Confidential and Proprietary
© 2013 LucidWorks15
On to data partitioning …
Confidential and Proprietary
© 2013 LucidWorks16
Data Partitioning: Sports Aggregator
• Collection containing all things
sports related: blogs, tweets, news,
photos, etc.
- Bulk of queries for one sport at a time
- Sports have a seasonality aspect to them
• Use custom hashing to route documents to specific shards
based on the sport
• If you only need docs about “baseball”, can query the
“baseball” shard(s)
- Allows you to do JOINs and Grouping as if you are not distributed
- Replicate specific shards based on query volume to that shard
Confidential and Proprietary
© 2013 LucidWorks17
Cluster state
/clusterstate.json
shows the hash range
and document router
Confidential and Proprietary
© 2013 LucidWorks18
Custom hashing: Indexing
18
View of cluster state from Zk
Shard 1
Leader
Node 1 Node 2
Shard 2
Leader
Shard 2
Replica
Shard 1
Replica
Zookeeper
CloudSolrServer
“smart client”
1
2
3
tlogtlog
Get URLs of current leaders?
4
5
shard1 range:
80000000-ffffffff
shard2 range:
0-7fffffff
{
"id" : "football!2",
"sport_s" : "football",
"type_s" : "post",
"lang_s" : "en",
...
},
Hash:
shardKey!docID
Confidential and Proprietary
© 2013 LucidWorks19
Custom hashing: Query side
19
Shard 1
Leader
Node 1 Node 2
Shard 2
Leader
Shard 2
Replica
Shard 1
Replica
1
3
q=*:*&
shard.keys=golf!
4
2
Query controller
get fields
Confidential and Proprietary
© 2013 LucidWorks20
Custom hashing key points
• Co-locate documents having a common property in the
same shard
- e.g. golf!10 and golf!22 will be in the same shard
• Scale-up the replicas for specific shards to address
high query volume – e.g. Golf in summer
• Not as much control over the distribution of keys
- golf, baseball, and tennis all in shard1 in my example
• Can split unbalanced shards when using custom
hashing
Confidential and Proprietary
© 2013 LucidWorks21
What’s next?
• Improvements of splitting
feature coming 4.4
• Client-side routing
- Smart client will decide the best leader to send a document to
- SOLR-4816
• Re-balance collection after adding N more nodes
- SOLR-5025
• Splitting optimizations
- Control the path where sub-shards create their index (similar to
path when doing a snapshot backup)
Confidential and Proprietary
© 2013 LucidWorks22
Thank you for attending!
• Keeping in touch
- Solr mailing list: solr-user@lucene.apache.org
- Solr In Action book: http://www.manning.com/grainger/
- Twitter: @thelabdude
- Email: thelabdude@gmail.com
- LinkedIn: linkedin.com/in/thelabdude
• Questions?

Contenu connexe

Tendances

Data Pipelines with Kafka Connect
Data Pipelines with Kafka ConnectData Pipelines with Kafka Connect
Data Pipelines with Kafka ConnectKaufman Ng
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapKostas Tzoumas
 
Module 5: YANG Tutorial - part 1
Module 5: YANG Tutorial - part 1Module 5: YANG Tutorial - part 1
Module 5: YANG Tutorial - part 1Tail-f Systems
 
Boosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uringBoosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uringShapeBlue
 
High Performance Object Storage in 30 Minutes with Supermicro and MinIO
High Performance Object Storage in 30 Minutes with Supermicro and MinIOHigh Performance Object Storage in 30 Minutes with Supermicro and MinIO
High Performance Object Storage in 30 Minutes with Supermicro and MinIORebekah Rodriguez
 
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Zalando Technology
 
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Edureka!
 
Hue architecture in the Hadoop ecosystem and SQL Editor
Hue architecture in the Hadoop ecosystem and SQL EditorHue architecture in the Hadoop ecosystem and SQL Editor
Hue architecture in the Hadoop ecosystem and SQL EditorRomain Rigaux
 
Dynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisationDynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisationOri Reshef
 
Logical replication with pglogical
Logical replication with pglogicalLogical replication with pglogical
Logical replication with pglogicalUmair Shahid
 
Apache Kafka® Security Overview
Apache Kafka® Security OverviewApache Kafka® Security Overview
Apache Kafka® Security Overviewconfluent
 
Oracle Real Application Clusters (RAC) 12c Rel. 2 - Operational Best Practices
Oracle Real Application Clusters (RAC) 12c Rel. 2 - Operational Best PracticesOracle Real Application Clusters (RAC) 12c Rel. 2 - Operational Best Practices
Oracle Real Application Clusters (RAC) 12c Rel. 2 - Operational Best PracticesMarkus Michalewicz
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Julian Hyde
 
Solr for Indexing and Searching Logs
Solr for Indexing and Searching LogsSolr for Indexing and Searching Logs
Solr for Indexing and Searching LogsSematext Group, Inc.
 
Confluent Tech Talk Korea
Confluent Tech Talk KoreaConfluent Tech Talk Korea
Confluent Tech Talk Koreaconfluent
 
Scylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with RaftScylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with RaftScyllaDB
 
Linking Metrics to Logs using Loki
Linking Metrics to Logs using LokiLinking Metrics to Logs using Loki
Linking Metrics to Logs using LokiKnoldus Inc.
 
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteOpen Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteJulian Hyde
 
Programación reactiva con Vert.x
Programación reactiva con Vert.xProgramación reactiva con Vert.x
Programación reactiva con Vert.xFran García
 

Tendances (20)

Data Pipelines with Kafka Connect
Data Pipelines with Kafka ConnectData Pipelines with Kafka Connect
Data Pipelines with Kafka Connect
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
 
Module 5: YANG Tutorial - part 1
Module 5: YANG Tutorial - part 1Module 5: YANG Tutorial - part 1
Module 5: YANG Tutorial - part 1
 
Boosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uringBoosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uring
 
High Performance Object Storage in 30 Minutes with Supermicro and MinIO
High Performance Object Storage in 30 Minutes with Supermicro and MinIOHigh Performance Object Storage in 30 Minutes with Supermicro and MinIO
High Performance Object Storage in 30 Minutes with Supermicro and MinIO
 
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
 
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
 
Hue architecture in the Hadoop ecosystem and SQL Editor
Hue architecture in the Hadoop ecosystem and SQL EditorHue architecture in the Hadoop ecosystem and SQL Editor
Hue architecture in the Hadoop ecosystem and SQL Editor
 
Dynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisationDynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisation
 
Logical replication with pglogical
Logical replication with pglogicalLogical replication with pglogical
Logical replication with pglogical
 
Apache Kafka® Security Overview
Apache Kafka® Security OverviewApache Kafka® Security Overview
Apache Kafka® Security Overview
 
Oracle Real Application Clusters (RAC) 12c Rel. 2 - Operational Best Practices
Oracle Real Application Clusters (RAC) 12c Rel. 2 - Operational Best PracticesOracle Real Application Clusters (RAC) 12c Rel. 2 - Operational Best Practices
Oracle Real Application Clusters (RAC) 12c Rel. 2 - Operational Best Practices
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)
 
Solr for Indexing and Searching Logs
Solr for Indexing and Searching LogsSolr for Indexing and Searching Logs
Solr for Indexing and Searching Logs
 
Confluent Tech Talk Korea
Confluent Tech Talk KoreaConfluent Tech Talk Korea
Confluent Tech Talk Korea
 
Scylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with RaftScylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with Raft
 
Using galera replication to create geo distributed clusters on the wan
Using galera replication to create geo distributed clusters on the wanUsing galera replication to create geo distributed clusters on the wan
Using galera replication to create geo distributed clusters on the wan
 
Linking Metrics to Logs using Loki
Linking Metrics to Logs using LokiLinking Metrics to Logs using Loki
Linking Metrics to Logs using Loki
 
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteOpen Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
 
Programación reactiva con Vert.x
Programación reactiva con Vert.xProgramación reactiva con Vert.x
Programación reactiva con Vert.x
 

Similaire à Scaling Through Partitioning and Shard Splitting in Solr 4

Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCLucidworks (Archived)
 
Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudthelabdude
 
How SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded EnvironmentHow SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded Environmentlucenerevolution
 
Oracle sharding : Installation & Configuration
Oracle sharding : Installation & ConfigurationOracle sharding : Installation & Configuration
Oracle sharding : Installation & Configurationsuresh gandhi
 
Migrating to XtraDB Cluster
Migrating to XtraDB ClusterMigrating to XtraDB Cluster
Migrating to XtraDB Clusterpercona2013
 
2014 05-07-fr - add dev series - session 6 - deploying your application-2
2014 05-07-fr - add dev series - session 6 - deploying your application-22014 05-07-fr - add dev series - session 6 - deploying your application-2
2014 05-07-fr - add dev series - session 6 - deploying your application-2MongoDB
 
Elasticsearch Data Analyses
Elasticsearch Data AnalysesElasticsearch Data Analyses
Elasticsearch Data AnalysesAlaa Elhadba
 
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Lucidworks
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanSpark Summit
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xGrant Ingersoll
 
Akka Cluster in Production
Akka Cluster in ProductionAkka Cluster in Production
Akka Cluster in Productionbilyushonak
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerEvan Chan
 
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-DeviceSUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-DeviceSUSE
 

Similaire à Scaling Through Partitioning and Shard Splitting in Solr 4 (20)

Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 
Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloud
 
Cloudera search
Cloudera searchCloudera search
Cloudera search
 
How SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded EnvironmentHow SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded Environment
 
Oracle sharding : Installation & Configuration
Oracle sharding : Installation & ConfigurationOracle sharding : Installation & Configuration
Oracle sharding : Installation & Configuration
 
Migrating to XtraDB Cluster
Migrating to XtraDB ClusterMigrating to XtraDB Cluster
Migrating to XtraDB Cluster
 
Spark 101
Spark 101Spark 101
Spark 101
 
2014 05-07-fr - add dev series - session 6 - deploying your application-2
2014 05-07-fr - add dev series - session 6 - deploying your application-22014 05-07-fr - add dev series - session 6 - deploying your application-2
2014 05-07-fr - add dev series - session 6 - deploying your application-2
 
Elasticsearch Data Analyses
Elasticsearch Data AnalysesElasticsearch Data Analyses
Elasticsearch Data Analyses
 
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.x
 
Akka Cluster in Production
Akka Cluster in ProductionAkka Cluster in Production
Akka Cluster in Production
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Scaling search with SolrCloud
Scaling search with SolrCloudScaling search with SolrCloud
Scaling search with SolrCloud
 
Solr 4
Solr 4Solr 4
Solr 4
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
 
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-DeviceSUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 

Plus de thelabdude

Running Solr in the Cloud at Memory Speed with Alluxio
Running Solr in the Cloud at Memory Speed with AlluxioRunning Solr in the Cloud at Memory Speed with Alluxio
Running Solr in the Cloud at Memory Speed with Alluxiothelabdude
 
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solrthelabdude
 
ApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr IntegrationApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr Integrationthelabdude
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scalethelabdude
 
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale ToolkitDeploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkitthelabdude
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsthelabdude
 
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...thelabdude
 
Boosting Documents in Solr (Lucene Revolution 2011)
Boosting Documents in Solr (Lucene Revolution 2011)Boosting Documents in Solr (Lucene Revolution 2011)
Boosting Documents in Solr (Lucene Revolution 2011)thelabdude
 
Dachis Group Pig Hackday: Pig 202
Dachis Group Pig Hackday: Pig 202Dachis Group Pig Hackday: Pig 202
Dachis Group Pig Hackday: Pig 202thelabdude
 

Plus de thelabdude (9)

Running Solr in the Cloud at Memory Speed with Alluxio
Running Solr in the Cloud at Memory Speed with AlluxioRunning Solr in the Cloud at Memory Speed with Alluxio
Running Solr in the Cloud at Memory Speed with Alluxio
 
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solr
 
ApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr IntegrationApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr Integration
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale ToolkitDeploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
 
Boosting Documents in Solr (Lucene Revolution 2011)
Boosting Documents in Solr (Lucene Revolution 2011)Boosting Documents in Solr (Lucene Revolution 2011)
Boosting Documents in Solr (Lucene Revolution 2011)
 
Dachis Group Pig Hackday: Pig 202
Dachis Group Pig Hackday: Pig 202Dachis Group Pig Hackday: Pig 202
Dachis Group Pig Hackday: Pig 202
 

Dernier

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 

Dernier (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

Scaling Through Partitioning and Shard Splitting in Solr 4

  • 1. Confidential and Proprietary © 2013 LucidWorks 1 Scaling Through Partitioning and Shard Splitting in Solr 4 Bigger, Better, Faster, Stronger, Safer • Agenda - Next set of problems - Shard splitting - Data partitioning using custom hashing
  • 2. Confidential and Proprietary © 2013 LucidWorks2 About me • Independent consultant specializing in search, machine learning, and big data analytics • Co-author Solr In Action from Manning Publishers, with Trey Grainger - Use code 12mp45 for 42% off your MEAP • Previously, lead architect, developer, and dev-ops engineer for large-scale Solr cloud implementation at Dachis Group • Coming soon! Big Data Jumpstart: 2-day intensive hands-on workshop covering Hadoop (Hive, Pig, HDFS), Solr, and Storm
  • 3. Confidential and Proprietary © 2013 LucidWorks3 A nice problem to have … • Solr cloud can scale! - 18 shards / ~900M docs - Indexing rate of 6-8k docs / sec - Millions of updates, deletes, and new docs per day • Some lessons learned - Eventually, your shards will outgrow the hardware they are hosted on - Search is addictive to organizations, expect broader usage and greater query complexity - Beware of the urge to do online analytics with Solr
  • 4. Confidential and Proprietary © 2013 LucidWorks4 Sharding review • Split large index into multiple “shards” containing unique slices of the index - Documents assigned to one and only one shard using a hash of the unique ID field - Hash function designed to distribute documents evenly across N shards (default: MurmurHash3 32-bit) - Every shard must have at least one active host • Benefits - Distribute cost of indexing across multiple nodes - Parallelize complex query computations such as sorting and facets across multiple nodes - Smaller index == less JVM heap == less GC headaches - Smaller index == More of it fits in OS cache (MMapDirectory)
  • 5. Confidential and Proprietary © 2013 LucidWorks5 Distributed indexing View of cluster state from Zk Shard 1 Leader Node 1 Node 2 Shard 2 Leader Shard 2 Replica Shard 1 Replica Zookeeper CloudSolrServer “smart client” Hash on docID 1 2 3 tlogtlog Get URLs of current leaders? 4 5 2 shards with 1 replica each shard1 range: 80000000-ffffffff shard2 range: 0-7fffffff
  • 6. Confidential and Proprietary © 2013 LucidWorks6 Document routing • Composite (default) - numShards specific when collection is bootstrapped - Each shard has a hash range (32-bit) • Implicit - numShards is unknown when collection is bootstrapped - No “range” for a shard; indexing client is responsible for sending documents to the correct shard - Feels a little old school to me ;-)
  • 7. Confidential and Proprietary © 2013 LucidWorks7 Distributed Search Send query request to any node Two-stage process 1. Query controller sends query to all shards and merges results One host per shard must be online or queries fail 2. Query controller sends 2nd query to all shards with documents in the merged result set to get requested fields Solr client applications built for 3.x do not need to change (our query code still uses SolrJ 3.6) Limitations JOINs / Grouping need custom hashing View of cluster state from Zk Shard 1 Leader Node 1 Node 2 Shard 2 Leader Shard 2 Replica Shard 1 Replica Zookeeper CloudSolrServer 1 3 q=*:* Get URLs of all live nodes 4 2 Query controller Or just a load balancer works too get fields
  • 8. Confidential and Proprietary © 2013 LucidWorks8 Shard splitting: overview • Bit of history, before shard splitting ... - Re-index to create more shards (difficult at scale), or - Over shard (inefficient and costly) - Better to scale-up organically • Shard splitting (SOLR-3755) - Split an existing shard into 2 sub-shards » May need to double your node count » Custom hashing may create hotspots in your cluster
  • 9. Confidential and Proprietary © 2013 LucidWorks9 Shard splitting: when to split? • Can you purge some docs? • No cut and dry answer, but you might want to split shards when, - Avg. query performance begins to slow down » (hint: this means you have to keep track of this) - Indexing throughput degrades - Out Of Memory errors when querying » And you’ve done as much cache, GC, and query tuning as possible • Not necessarily have to add more nodes - May just want to split shards to add more parallelism
  • 10. Confidential and Proprietary © 2013 LucidWorks10 Shard splitting mechanics (in Solr 4.3.1) • Split existing shard into 2 sub-shards - SPLITSHARD action in collections API http://localhost:8983/solr/admin/collections?action=SPLITSHARD&collection=logmill& shard=shard1 - The actual “splitting” occurs using low-level Lucene API, see org.apache.solr.update.SolrIndexSplitter • Hard commit after the split completes (fixed in 4.4) • Unload the “core” of the original shard (on all nodes) http://localhost:8983/solr/admin/cores?action=UNLOAD&core=logmill&deleteIndex=tr ue • Migrate one of the splits to another host (optional)
  • 11. Confidential and Proprietary © 2013 LucidWorks11 Interesting questions about the process • What happens to the replicas when a shard is split? • What happens to update requests being sent to the original shard leader during the split operation? • The original shard remains in the cluster – doesn’t that create duplicate docs in queries? New sub-shards are replicated automatically using Solr replication Updates are sent to the correct sub-split during “construction” No, the original shard enters the “inactive” state and no longer participates in distributed queries for the collection
  • 12. Confidential and Proprietary © 2013 LucidWorks12 Split shard in action: cluster with 2 shards Graph view before the split Graph view after the split Graph view after unloading original shard
  • 13. Confidential and Proprietary © 2013 LucidWorks Shard 1_1 Leader Node 1 Node 2 Shard 2 Leader Shard 2 Replica Shard 1_1 Replica shard1_0 range: 80000000-bfffffff shard2 range: 0-7fffffff Shard 1_0 Leader Shard 1_0 Replica shard1_1 range: c0000000-ffffffff 13 Before and After shard splitting 13 Shard 1 Leader Node 1 Node 2 Shard 2 Leader Shard 2 Replica Shard 1 Replica shard1 range: 80000000-ffffffff shard2 range: 0-7fffffff Before After
  • 14. Confidential and Proprietary © 2013 LucidWorks14 Shard splitting: Limitations • Both splits end up on same node … but easy to migrate to another node - Assuming you have replication, you can unload the core of one of the new sub-shards, making the replica the leader, and then bring up another replica for that shard on another node. - Nice to have ability to specify the disk location of the new sub- shard indexes (splitting 50GB using 1 disk can take a while) • No control where the replicas end up - Possible future enhancement • Not a collection-wide rebalancing operation - you can’t grow your cluster from 16 nodes to 24 nodes and end up with an even distribution of documents per shard
  • 15. Confidential and Proprietary © 2013 LucidWorks15 On to data partitioning …
  • 16. Confidential and Proprietary © 2013 LucidWorks16 Data Partitioning: Sports Aggregator • Collection containing all things sports related: blogs, tweets, news, photos, etc. - Bulk of queries for one sport at a time - Sports have a seasonality aspect to them • Use custom hashing to route documents to specific shards based on the sport • If you only need docs about “baseball”, can query the “baseball” shard(s) - Allows you to do JOINs and Grouping as if you are not distributed - Replicate specific shards based on query volume to that shard
  • 17. Confidential and Proprietary © 2013 LucidWorks17 Cluster state /clusterstate.json shows the hash range and document router
  • 18. Confidential and Proprietary © 2013 LucidWorks18 Custom hashing: Indexing 18 View of cluster state from Zk Shard 1 Leader Node 1 Node 2 Shard 2 Leader Shard 2 Replica Shard 1 Replica Zookeeper CloudSolrServer “smart client” 1 2 3 tlogtlog Get URLs of current leaders? 4 5 shard1 range: 80000000-ffffffff shard2 range: 0-7fffffff { "id" : "football!2", "sport_s" : "football", "type_s" : "post", "lang_s" : "en", ... }, Hash: shardKey!docID
  • 19. Confidential and Proprietary © 2013 LucidWorks19 Custom hashing: Query side 19 Shard 1 Leader Node 1 Node 2 Shard 2 Leader Shard 2 Replica Shard 1 Replica 1 3 q=*:*& shard.keys=golf! 4 2 Query controller get fields
  • 20. Confidential and Proprietary © 2013 LucidWorks20 Custom hashing key points • Co-locate documents having a common property in the same shard - e.g. golf!10 and golf!22 will be in the same shard • Scale-up the replicas for specific shards to address high query volume – e.g. Golf in summer • Not as much control over the distribution of keys - golf, baseball, and tennis all in shard1 in my example • Can split unbalanced shards when using custom hashing
  • 21. Confidential and Proprietary © 2013 LucidWorks21 What’s next? • Improvements of splitting feature coming 4.4 • Client-side routing - Smart client will decide the best leader to send a document to - SOLR-4816 • Re-balance collection after adding N more nodes - SOLR-5025 • Splitting optimizations - Control the path where sub-shards create their index (similar to path when doing a snapshot backup)
  • 22. Confidential and Proprietary © 2013 LucidWorks22 Thank you for attending! • Keeping in touch - Solr mailing list: solr-user@lucene.apache.org - Solr In Action book: http://www.manning.com/grainger/ - Twitter: @thelabdude - Email: thelabdude@gmail.com - LinkedIn: linkedin.com/in/thelabdude • Questions?

Notes de l'éditeur

  1. This is my first webinar and I’m used to asking questions to the audience and taking polls and that sort of thing so we’ll see how it works.
  2. Some of this content will be released in our chapter on Solr cloud in the next MEAP, hopefully within a week or so.
  3. There should be no doubt anymore whether Solr cloud can scale. This gives rise to a new set of problems. My focus today is on dealing with unbounded growth and complexityI became interested in these types of problems after having developed a large-scale Solr cloud cluster. My problems went from understanding sharding and replication and operating a cluster to how to manage unbounded growth of the cluster as well as the urge from the business side to do online, near real-time analytics with Solr, e.g. complex facets, sorting, huge lists of boolean clauses, large page sizes
  4. Why shard? - Distribute indexing and query processing to multiple nodes - Parallelize some of the complex query processing stages, such as facets, sortingAsk yourself this question: is it faster to sort 10M docs on 10 nodes each having to sort 1M or on one node having to sort all 10? Probably the former - Achieve a smaller index per node (faster sorting, smaller filters)How to shard? - Each shard covers a hash range - Default – hash of unique document ID field - diagramCustom HashingCustom document routing using composite ID
  5. Smart client knows the current leaders by asking Zk, but doesn’t know which leader to assign the doc to (that is planned though)Node accepting the new document computes a hash on the document ID to determine which shard to assign the doc toNew document is sent to the appropriate shard leader, which sets the _version_ and indexes itLeader saves the update request to durable storage in the update log (tlog)Leader sends the update to all active replicas in parallelCommits – sent around the clusterInteresting question is how are batches of updates from the client handled?
  6. Shard splitting is more efficient; re-indexing is expensive and time-consuming; over-sharding initially is inefficient and wastefulhttps://issues.apache.org/jira/browse/SOLR-3755Signs your shards are too big for your hardwareOutOfMemory errors begin to appear – you could throw more RAM at itQuery performance begins to slowNew searcher warming slows downConsider whether just adding replicas is what you really needBottom line – take hardware into consideration when considering shard splitting
  7. Diagram of the processSplit request blocks until the split finishesWhen you split a shard, two sub-shards are created in new cores on the same node (can you target another node?)Replication factor is maintained.Need to call commit after is fixed in 4.4 - https://issues.apache.org/jira/browse/SOLR-4997IndexWriter can take an IndexReader and do a fast copy on it
  8. New sub-shards are replicated automatically using Solr replicationUpdates are buffered and sent to the correct sub-split during “construction”No, the original shard enters the “inactive” state and no longer participates in distributed queries for the collection
  9. Simple cluster with 2 shards (shard1 has a replica)We’ll split shard1 using collections APICouple things to notice:The original shard1 core is still active in the clusterEach new sub-shard has a replicashard1 is actually inactive in Zookeeper so queries are not going to go to that shard1 anymore
  10. On this last point, I would approach shard splitting as a maintenance operation and not something you just do willy-nilly in production. The actual work gets done in the background and its designed to accept incoming update requests while the split is processing.Note: the original shard remains intact but in the inactive state. This means you can re-activate it by updating clusterstate.json if need be.
  11. index about sports - all things sports related, news, blogs, tweets, photos, you name itsome people care about many sports but most of your users care about one sport at a time especially given the seasonality ofsportswe'll use custom document routing to send football related docs to the football shard and baseball related docs to the baseball shard.some of the concerns with this is that you lose some of the benefits from sharding: distributing indexing across multiple nodeshowever there's nothing that prevents you from splitting the football shard onto several nodes as needed. Of course you lose JOINs and ngroups
  12. You’ll keep docs for the same sport in the same shard, but could end up with un-even distribution of docs!
  13. You can restrict which shards are considered in the distributed query using the shard.keys parameter
  14. curl -i -v "http://localhost:8983/solr/admin/collections?action=SPLITSHARD&collection=sports&shard=shard1"