2. Apache Solr has a huge install base and tremendous momentum
Solr is both established & growing
250,000+
most widely used search
solution on the planet. 8M+ total downloads
monthly downloads
You use Solr everyday.
Solr has tens of thousands
of applications in production.
2500+ open Solr jobs.
Activity Summary
30 Day summary
Aug 18 - Sep 17 2014
• 128 Commits
• 18 Contributors
12 Month Summary
Sep 17, 2013 - Sep 17, 2014
• 1351 Commits
• 29 Contributors
via https://www.openhub.net/p/solr
3. Solr scalability is unmatched.
• 10TB+ Index Size
• 10 Billion+ Documents
• 100 Million+ Daily Requests
5. What is Solr?
• A system built to search text
• A specialized type of database management
system
• A platform to build search applications on
• Customizable, open source software
7. What is SolrCloud?
Subset of optional features in Solr to enable and
simplify horizontal scaling a search index using
sharding and replication.
Goals
scalability, performance, high-availability,
simplicity, and elasticity
8. Terminology
• ZooKeeper: Distributed coordination service that provides centralised
configuration, cluster state management, and leader election
• Node: JVM process bound to a specific port on a machine
• Collection: Search index distributed across multiple nodes with same
configuration
• Shard: Logical slice of a collection; each shard has a name, hash range, leader
and replication factor. Documents are assigned to one and only one shard
per collection using a hash-based document routing strategy
• Replica: A copy of a shard in a collection
• Overseer: A special node that executes cluster administration commands and
writes updated state to ZooKeeper. Automatic failover and leader election.
9.
10. Collection == Distributed Index
• A collection is a distributed index defined by:
• named configuration stored in ZooKeeper
• number of shards: documents are distributed across N partitions of the index
• document routing strategy: how documents get assigned to shards
• replication factor: how many copies of each document in the collection
• Collections API:
• curl "http://localhost:8983/solr/admin/collections?
action=CREATE&name=punemeetup&replicationFactor=2&numShards=2&coll
ection.configName=myonf
12. Document Routing
• Each shard covers a hash-range
• Default: Hash ID into 32-bit integer, map to range
• leads to balanced (roughly) shards
• Custom-hashing
• Tri-level: app!user!doc
• Implicit: no hash-range set for shards
13. Replication
• Why replicate?
• High-availability
• Load balancing
• How does it work in SolrCloud?
• Near-real-time, not master-slave
• Leader forwards to replicas in parallel, waits for response
• Error handling during indexing is tricky
14. Distributed Indexing
• Get cluster state from ZK
• Route document directly to leader (hash on doc ID)
• Persist document on durable storage (tlog)
• Forward to healthy replicas
• Acknowledge write succeed to client
15. Shard Leader
• Additional responsibilities during indexing only!
Not a master node
• Leader is a replica (handles queries)
• Accepts update requests for the shard
• Increments the _version_ on the new or updated
doc
• Sends updates (in parallel) to all replicas
16. Distributed Queries
• Query client can be ZK aware or just query via a load
balancer
• Client can send query to any node in the cluster
• Controller node distributes the query to a replica for
each shard to identify documents matching query
• Controller node sorts the results from step 3 and
issues a second query for all fields for a page of
results
17. Scalability / Stability Highlights
• All nodes in cluster perform indexing and execute
queries; no master node
• Distributed indexing: No SPoF, high throughput via
direct updates to leaders, automated failover to new
leader
• Distributed queries: Add replicas to scale-out qps;
parallelize complex query computations; fault-tolerance
• Indexing / queries continue so long as there is 1 healthy
replica per shard
18. Zookeeper
• Is a very good thing ... clusters are a zoo!
• Centralized configuration management
• Cluster state management
• Leader election (shard leader and overseer)
• Overseer distributed work queue
• Live Nodes
• Ephemeral znodes used to signal a server is gone
• Needs 3 nodes for quorum in production
19. Zookeeper: State Management
• Keep track of live nodes /live_nodes znode
• ephemeral nodes
• ZooKeeper client timeout
• Collection metadata and replica state in /clusterstate.json
• Every core has watchers for /live_nodes and /
clusterstate.json
• Leader election
• ZooKeeper sequence number on ephemeral znodes
20. Other Features/Highlights
• Near-Real-Time Search: Documents are visible within a second or so after
being indexed
• Partial Document Update: Just update the fields you need to change on
existing documents
• Optimistic Locking: Ensure updates are applied to the correct version of a
document
• Transaction log: Better recoverability; peer-sync between nodes after
hiccups
• HTTPS
• Use HDFS for storing indexes
• Use MapReduce for building index (SOLR-1301)
22. Solr on YARN
• Run the SolrClient application :
• Allocate container to run SolrMaster
• SolrMaster requests containers to run SolrCloud
nodes
• Solr containers allocated across cluster
• SolrCloud node connects to ZooKeeper
23. More Information on Solr on YARN
• https://lucidworks.com/blog/solr-yarn/
• https://github.com/LucidWorks/yarn-proto
• https://issues.apache.org/jira/browse/SOLR-6743
24. Our users are also pushing the limits
https://twitter.com/bretthoerner/status/476830302430437376
25. Up, up and away!
https://twitter.com/bretthoerner/status/476838275106091008