Apache solr is an enterprise search engine. It facilitates indexing of large number of documents of any size and provides very robust search techniques. This ppt provides brief introduction of it.
2. Agenda
• What is Solr?
• Features of Solr
• High level Architecture of Solr
• How things work in Solr?
• What is Fuzzy Search?
• How is Performance of Solr?
3. Solr - Introduction
• An open source Enterprise search platform by Apache.
• A full text search server running on Web containers like Tomcat or Jetty.
• Indexes input files and provides various search facilities over them.
• Uses the Lucene Java search library at its core.
Type of Tool: Search and Index API
Documentation: http://lucene.apache.org/solr/4_3_0/
License Type Apache License 2.0
Last Release Date 6 May 2013(4.3.0)
Release Frequency 1 month approximately
Mailing
List/Community
support http://lucene.apache.org/solr/discussion.html
Major
Applications/Users Instagram, AOL, the Guardian, Shopper.com, SourceForge, eBay
Stability Stable version "4.3.0".
4. Solr - Features
• Faceted Search
• Can take input in form of XML, CSV, JSON files
and from database.
• Using Apache Tika, supports more than 25 input
formats like PDF and MS Word.
• JSON, XML, PHP, Ruby, Python and custom
Java binary output formats.
• Scalable in form of Solr Cloud
• Supports 32 major languages including Chinese,
Korean, Japanese, Arabian etc.
• Boosting of results
• Extensible plugin architecture
• HTML Administration Interface
8. Solr – Fuzzy Search
• It is the technique of finding strings that match a
pattern approximately (rather than exactly).
• It is used to find documents that contain words with
similar spelling to the search term. Ex. - If you
search for appple then search engine will show all
documents having term "apple" also.
• Used in spell checking, spam filtering, OCR
scanning.
• Solr's standard query parser supports fuzzy
searches based on the Levenshtein Distance or
Edit Distance algorithm.
• Closeness of search is based upon Edit Distance
(No of steps required to convert one word into
another) of words.
9. Solr - Performance
• Indexing – The time taken in indexing depends
on –
Size and number of fields in each
document
Number of fields to be indexed
Type of fields
Machine capabilities (CPU, Memory)
With each document size ~1 KB, if we have
100 million documents then total indexing
time must be a few hours.
• Query – If we have 100 million documents on
10 Solr nodes(10 million documents each) then
average search response time is ~1 second.