Apache Solr is an open-source enterprise search platform that provides fast, scalable, and reliable full-text search functionality. It powers the search capabilities of many large websites and applications. Some key features of Solr include fast indexing and search, faceted search, autocomplete, geospatial search, and integration with various databases and applications.
2. Enterprise Search Server
The criteria ...
•Fast
•Flexible
•Powerful
•Scalable
•Relevant Results
•Production ready & Easy deployment
3. Why SOLR
• Greater control over your website search
• Caching, Replication, Distributed search
• Really fast Indexing/Searching, Indexes can
be merged/optimized (Index compaction)
• Great admin interface can be used over
HTTP
• Awesome community support
• Support for integration with various other
products
5. What is SOLR?
• Very fast full text search engine
http://lucene.apache.org/solr/
• Based on Apache Lucene - high-performance, full-
featured text search engine library written entirely in
Java.
In brief Apache Solr exposes Lucene's JAVA API as
REST like API's which can be called over HTTP
from any programming language/platform
6. Features
• Full Text Search
• Faceted navigation
• More items like this(Recommendation)/
Related searches
• Spell Suggest/Auto-Complete
• Custom document ranking/ordering
• Snippet generation/highlighting
• Geospatial Search
10. More Features ...
• Database integration
• Rich document (Word, PDF) handling
• REST-like HTTP/XML, JSON APIs (so,
you can code virtually in any language)
• Flexible configuration
• Extensive Plugin architecture for advanced
customization
• Scalable distributed search, dynamic
clustering, index replication
11. App Server Support
• Apache Tomcat
• Jetty
• Resin
• WebLogicTM
• WebSphereTM
• GlassFish
• dmServerTM
• JBossTM... and many more
12. SOLR History
• Developed at CNET Networks by
Yonik Seeley
• Donated to ASF (Apache Software
Foundation) in early 2006
• Incubation period ended in january
2007 (v1.2 released)
• Solr is now maintained as a
subproject of Lucene
13. Solr
• Only one table (documents). No joins.
• Each row is a document
• A document can have multiple fields and
fields can have multiple values
– e.g. Tags, Categories, ...
• Fast for search (finding the documents)
• Slow when returning large sets of data
• Can scale to many millions of documents
14. Solr Architecture
• Servlet: Jetty,Tomcat ... any :)
– Handles http
• Solr
– Connectivity between Servlet and Lucene
• Lucene
– Full Text Search Framework
16. How Lucene Works
•
key ID
Regular indexes banana 1
repeat index data banana 2
for each row banana 3
cat 2
cat 3
dog 1
• Inverted Indexes dog 3
reference the term
Term IDs
once and then the banana 1,2,3
matching documents cat 2,3
dog 1,3
17. Inverted Index Matching
cat banana
Term IDs Document 1 2 3
banana 1,2,3 cat 0 1 1
cat 2,3 banana 1 1 1
•
dog
Lucene uses bit
1,3 Match 0 1 1
vectors to quickly dog cat
find all documents Document 1 2 3
with terms dog 1 0 1
cat 0 1 1
Match 0 0 1
18. Scoring
• Now that the documents are found, what order should
they be viewed
• Lucene uses TF-IDF (Term Frequency-
Inverse Document Frequency) to score the
documents
Term IDs
banana {1.28} 1 {2}, 2 {5}, 3 {1}
cat {1.60} 2 {4}, 3 {2}
dog {1.60} 1 {1}, 3 {6}
19. Scoring Notes
The goal of scoring is:
•To boost the importance of documents where
the word is mentioned often
•To boost the importance of rare words (that
don’t appear in many documents)
Solr supports term boosts to increase the
importance of one term over another as
well
20. Stemming, Stopwords, Synonyms
• Terms are trimmed of suffixes
trimmed -> trim
stemming -> stem
• Stopwords remove common parts of
speech that are not important
the, and, for, it, ...
• This is done with both the words in the
document and the query terms
• Solr supports search by predefined
synonyms list
21. Configuring Solr
• Schema.xml – Contains all of the details
about document structure, index-time and
query-time processing
• Solrconfig.xml - Contains most of the
parameters for configuring Solr itself
22. QUERY SYNTAXES (RDBMS)
SELECT * FROM post WHERE
(topic LIKE ‘%apache%’ OR author LIKE ‘%bambr%’)
OR (topic LIKE ‘%solr%’ OR author LIKE ‘%frank%’)
ORDER BY id DESC
QUERY SYNTAXES (SOLR)
Topic:"The Right Way" AND author:WrongGuy
23. Querying Solr 1
• Plain text search
q = text:"I love android"
• Expanding search to more fields :
title:android & type:review & price:[* To 500]
• Add facets
facet.field=product & facet.field=rating
• Ordering results
sort = score desc, price asc
24. Querying Solr 2
• Add facets for range queries
facet.query=price:[* TO 100]
&facet.query=price:[100 TO 200]
&facet.query=price:[500 TO *]
• Limiting results
rows=15
• Paginating on results
start=25 & rows=10
25. Querying Solr 3
Advanced Query operators:
•fq : FilterQuery fq = type:review & price:[* TO
500]
•fl : Restrict fields to be returnedfl=id,title,text
•hl : Highlighting matches in snippet, Snippet
generation etc. hl=true&hl.fl=title,text
26. Solr Caching
• External Caching : Memcached, etc.
• Internal Caching
Different types of cache:
1) FilterCache: Used by facetQueries(fq),
sometimes for faceting too
2) QueryResultCache : Used for results
returned by generic queries
3) DocumentCache