2. AGENDA
What is Apache Lucene ?
Focus of Apache Lucene
Lucene Architecture
Analyzers
Analysis Example
Demo
3. WHAT IS APACHE LUCENE?
Apache Lucene is an open source Java based full-
text search engine.
Lucene is not a Web application, but rather a code
library and API that can easily be used to add search
capabilities to applications.
It is also known as Information Retrieval Library.
Lucene is independent of the file format. Text from
PDFs, HTML, Word document can be indexed as
long as their textual information can be extracted.
5. INDEXING DOCUMENTS
What is Indexing?
1. Conversion to Plain text (for PDF, html files etc.)
2. Analysis (Convert the text into Tokens)
3. Index (Map the tokens into indexes)
6. SEARCHING DOCUMENTS
What is Searching?
1. Take the User Input
2. Create a query
3. Query the index
4. Return the results
8. ANALYZER
Tokenizes the input text
Common Analyzers
1. WhitespaceAnalyzer
Splits tokens on whitespace
2. SimpleAnalyzer
Splits tokens on non-letters, and then lowercases
3. StopAnalyzer
Same as SimpleAnalyzer, but also removes stop words
4. StandardAnalyzer
Most sophisticated analyzer that knows about certain token types,
lowercases, removes stop words
9. ANALYSIS EXAMPLES
“Boost is the Secrete of our Energy”
Whitespace Analyzer
[Boost][is][the][Secrete][of][our][Energy]
Simple Analyzer
[boost][is][the][secrete][of][our][energy]
Stop Analyzer
[boost][secrete][energy]
Standard Analyzer
[boost][secrete][energy]
10. DEMO OF SIMPLE INDEXING AND SEARCHING
USING APACHE LUCENE