Over the past several months, Solr has reached a critical milestone of being able to elastically scale-out to handle indexes reaching into the hundreds of millions of documents. At Dachis Group, we've scaled our largest Solr 4 index to nearly 900M documents and growing. As our index grows, so does our need to manage this growth.
In practice, it's common for indexes to continue to grow as organizations acquire new data. Over time, even the best designed Solr cluster will reach a point where individual shards are too large to maintain query performance. In this Webinar, you'll learn about new features in Solr to help manage large-scale clusters. Specifically, we'll cover data partitioning and shard splitting.
Partitioning helps you organize subsets of data based on data contained in your documents, such as a date or customer ID. We'll see how to use custom hashing to route documents to specific shards during indexing. Shard splitting allows you to split a large shard into 2 smaller shards to increase parallelism during query execution.
Attendees will come away from this presentation with a real-world use case that proves Solr 4 is elastically scalable, stable, and is production ready.
This is my first webinar and I’m used to asking questions to the audience and taking polls and that sort of thing so we’ll see how it works.
Some of this content will be released in our chapter on Solr cloud in the next MEAP, hopefully within a week or so.
There should be no doubt anymore whether Solr cloud can scale. This gives rise to a new set of problems. My focus today is on dealing with unbounded growth and complexityI became interested in these types of problems after having developed a large-scale Solr cloud cluster. My problems went from understanding sharding and replication and operating a cluster to how to manage unbounded growth of the cluster as well as the urge from the business side to do online, near real-time analytics with Solr, e.g. complex facets, sorting, huge lists of boolean clauses, large page sizes
Why shard? - Distribute indexing and query processing to multiple nodes - Parallelize some of the complex query processing stages, such as facets, sortingAsk yourself this question: is it faster to sort 10M docs on 10 nodes each having to sort 1M or on one node having to sort all 10? Probably the former - Achieve a smaller index per node (faster sorting, smaller filters)How to shard? - Each shard covers a hash range - Default – hash of unique document ID field - diagramCustom HashingCustom document routing using composite ID
Smart client knows the current leaders by asking Zk, but doesn’t know which leader to assign the doc to (that is planned though)Node accepting the new document computes a hash on the document ID to determine which shard to assign the doc toNew document is sent to the appropriate shard leader, which sets the _version_ and indexes itLeader saves the update request to durable storage in the update log (tlog)Leader sends the update to all active replicas in parallelCommits – sent around the clusterInteresting question is how are batches of updates from the client handled?
Shard splitting is more efficient; re-indexing is expensive and time-consuming; over-sharding initially is inefficient and wastefulhttps://issues.apache.org/jira/browse/SOLR-3755Signs your shards are too big for your hardwareOutOfMemory errors begin to appear – you could throw more RAM at itQuery performance begins to slowNew searcher warming slows downConsider whether just adding replicas is what you really needBottom line – take hardware into consideration when considering shard splitting
Diagram of the processSplit request blocks until the split finishesWhen you split a shard, two sub-shards are created in new cores on the same node (can you target another node?)Replication factor is maintained.Need to call commit after is fixed in 4.4 - https://issues.apache.org/jira/browse/SOLR-4997IndexWriter can take an IndexReader and do a fast copy on it
New sub-shards are replicated automatically using Solr replicationUpdates are buffered and sent to the correct sub-split during “construction”No, the original shard enters the “inactive” state and no longer participates in distributed queries for the collection
Simple cluster with 2 shards (shard1 has a replica)We’ll split shard1 using collections APICouple things to notice:The original shard1 core is still active in the clusterEach new sub-shard has a replicashard1 is actually inactive in Zookeeper so queries are not going to go to that shard1 anymore
On this last point, I would approach shard splitting as a maintenance operation and not something you just do willy-nilly in production. The actual work gets done in the background and its designed to accept incoming update requests while the split is processing.Note: the original shard remains intact but in the inactive state. This means you can re-activate it by updating clusterstate.json if need be.
index about sports - all things sports related, news, blogs, tweets, photos, you name itsome people care about many sports but most of your users care about one sport at a time especially given the seasonality ofsportswe'll use custom document routing to send football related docs to the football shard and baseball related docs to the baseball shard.some of the concerns with this is that you lose some of the benefits from sharding: distributing indexing across multiple nodeshowever there's nothing that prevents you from splitting the football shard onto several nodes as needed. Of course you lose JOINs and ngroups
You’ll keep docs for the same sport in the same shard, but could end up with un-even distribution of docs!
You can restrict which shards are considered in the distributed query using the shard.keys parameter