Enterprise Search Solution: Apache SOLR. What's available and why it's so cool

Apache SOLR
Enterprise Search Solution
(overview)

Enterprise Search Server

The criteria ...
•Fast

•Flexible

•Powerful

•Scalable

•Relevant Results

•Production ready & Easy deployment

Why SOLR

• Greater control over your website search
• Caching, Replication, Distributed search
• Really fast Indexing/Searching, Indexes can
be merged/optimized (Index compaction)
• Great admin interface can be used over
HTTP
• Awesome community support
• Support for integration with various other
products

SOLR Powered

http://wiki.apache.org/solr/PublicServers/

• whitehouse.gov • eBay
• Instagram • The Guardian
• Apple • Netflix
• NASA • Shopper
• CISCO • News.com
• Disney • digg
• Sears • AOL

What is SOLR?

• Very fast full text search engine
http://lucene.apache.org/solr/
• Based on Apache Lucene - high-performance, full-
featured text search engine library written entirely in
Java.

In brief Apache Solr exposes Lucene's JAVA API as
REST like API's which can be called over HTTP
from any programming language/platform

Features
• Full Text Search
• Faceted navigation
• More items like this(Recommendation)/
Related searches
• Spell Suggest/Auto-Complete
• Custom document ranking/ordering
• Snippet generation/highlighting
• Geospatial Search

More Features ...

• Database integration
• Rich document (Word, PDF) handling
• REST-like HTTP/XML, JSON APIs (so,
you can code virtually in any language)
• Flexible configuration
• Extensive Plugin architecture for advanced
customization
• Scalable distributed search, dynamic
clustering, index replication

App Server Support

• Apache Tomcat
• Jetty
• Resin
• WebLogicTM
• WebSphereTM
• GlassFish
• dmServerTM
• JBossTM... and many more

SOLR History

• Developed at CNET Networks by
Yonik Seeley
• Donated to ASF (Apache Software
Foundation) in early 2006
• Incubation period ended in january
2007 (v1.2 released)
• Solr is now maintained as a
subproject of Lucene

Solr

• Only one table (documents). No joins.
• Each row is a document
• A document can have multiple fields and
fields can have multiple values
– e.g. Tags, Categories, ...

• Fast for search (finding the documents)
• Slow when returning large sets of data
• Can scale to many millions of documents

Solr Architecture

• Servlet: Jetty,Tomcat ... any :)
– Handles http

• Solr
– Connectivity between Servlet and Lucene

• Lucene
– Full Text Search Framework

How Lucene Works

•
key ID
Regular indexes banana 1
repeat index data banana 2

for each row banana 3

cat 2

cat 3

dog 1

• Inverted Indexes dog 3

reference the term
Term IDs
once and then the banana 1,2,3
matching documents cat 2,3

dog 1,3

Inverted Index Matching
cat banana
Term IDs Document 1 2 3

banana 1,2,3 cat 0 1 1

cat 2,3 banana 1 1 1

•
dog
Lucene uses bit
1,3 Match 0 1 1

vectors to quickly dog cat
find all documents Document 1 2 3

with terms dog 1 0 1

cat 0 1 1

Match 0 0 1

Scoring
• Now that the documents are found, what order should
they be viewed
• Lucene uses TF-IDF (Term Frequency-
Inverse Document Frequency) to score the
documents

Term IDs

banana {1.28} 1 {2}, 2 {5}, 3 {1}

cat {1.60} 2 {4}, 3 {2}

dog {1.60} 1 {1}, 3 {6}

Scoring Notes

The goal of scoring is:
•To boost the importance of documents where
the word is mentioned often
•To boost the importance of rare words (that
don’t appear in many documents)
Solr supports term boosts to increase the
importance of one term over another as
well

Stemming, Stopwords, Synonyms

• Terms are trimmed of suffixes
trimmed -> trim
stemming -> stem
• Stopwords remove common parts of
speech that are not important
the, and, for, it, ...
• This is done with both the words in the
document and the query terms
• Solr supports search by predefined
synonyms list

Configuring Solr

• Schema.xml – Contains all of the details
about document structure, index-time and
query-time processing
• Solrconfig.xml - Contains most of the
parameters for configuring Solr itself

QUERY SYNTAXES (RDBMS)

SELECT * FROM post WHERE
(topic LIKE ‘%apache%’ OR author LIKE ‘%bambr%’)
OR (topic LIKE ‘%solr%’ OR author LIKE ‘%frank%’)
ORDER BY id DESC

QUERY SYNTAXES (SOLR)
Topic:"The Right Way" AND author:WrongGuy

Querying Solr 1

• Plain text search
q = text:"I love android"
• Expanding search to more fields :
title:android & type:review & price:[* To 500]
• Add facets
facet.field=product & facet.field=rating
• Ordering results
sort = score desc, price asc

Querying Solr 2

• Add facets for range queries
facet.query=price:[* TO 100]
&facet.query=price:[100 TO 200]
&facet.query=price:[500 TO *]
• Limiting results
rows=15
• Paginating on results
start=25 & rows=10

Querying Solr 3

Advanced Query operators:
•fq : FilterQuery fq = type:review & price:[* TO
500]
•fl : Restrict fields to be returnedfl=id,title,text
•hl : Highlighting matches in snippet, Snippet
generation etc. hl=true&hl.fl=title,text

Solr Caching

• External Caching : Memcached, etc.
• Internal Caching
Different types of cache:
1) FilterCache: Used by facetQueries(fq),
sometimes for faceting too
2) QueryResultCache : Used for results
returned by generic queries
3) DocumentCache

Skype: dgolovko
dimtkg@gmail.com

Enterprise Search Solution: Apache SOLR. What's available and why it's so cool

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Enterprise Search Solution: Apache SOLR. What's available and why it's so cool

Similaire à Enterprise Search Solution: Apache SOLR. What's available and why it's so cool (20)

Plus de Ecommerce Solution Provider SysIQ

Plus de Ecommerce Solution Provider SysIQ (17)

Dernier

Dernier (20)

Enterprise Search Solution: Apache SOLR. What's available and why it's so cool