Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Custom analyzer using lucene
1. Custom Analyzer in Lucene
Lucene/Solr Meetup
Ganesh.M
http://www.linkedin.com/in/gmurugappan
2. • Introduction to Analyzer
• Why we require Custom Analyzer
• Use case / Scenario
• Writing custom analyzer
• Know your analyzer
3. • Analyzer : Analyzes the given text and returns
tokens using Tokenizer and TokenFilter
• Tokenizer : Understands the language and breaks
the text in to tokens.
– WhitespaceTokenizer divides text at whitespace
– LetterTokenizer divides text at non-letter
– CJKTokenizer – Chinese, Japanese, Korean language
tokenizer
• TokenFiler: adds / stem / deletes token
– StopFilter – removes stop words
– PorterStemFilter – Transforms the token
4. • Lets have the text
“The quick brown fox jumps over lazy dog”
Using Standard Analyzer, it will generate
following tokens
Quick Brown
Fox Jumps
Over Lazy
dog
5. Know Your analyzer
• It is important to choose best analyzer for
your fields.
• If you choose it wrong then it may not give
expected search result.
• If you ever think you are not expecting the
correct result then check your Analyzer and
Query parser.
6. Lucene 3.x: Below code will print the tokens
generated from given analyzer
Analyzer analyzer = new SimpleAnalyzer();
TokenStream ts = analyzer.tokenStream(“Field", new
StringReader(“Hello world-2013 "));
ts.reset();
while (ts.incrementToken()) {
System.out.println("token: " +
ts.getAttribute(TermAttribute.class).term());
}
ts.close();
7. The purpose of Custom Analyzer
• Existing analyzers not always solves our
purpose, some times we need to analyze in a
different way
• Custom Analyzer could use existing inbuilt
filters.
• It could also be used for parsing queries
8. Use case
• Synonym Injection / Abbreviation Expansion
– Add synonyms at the time of indexing.
– In case of parsing resume, add related content for
a keyword. If you find text “lucene/solr” then you
could add information retrieval, search engine.
– If you are searching medical documents, chat
messages etc you need to expand the
abbreviation / codes at the time of indexing
9. • Stripping XML / HTML tags and index only the
content
<Address>
<Street>123, MG Road<Street>
<City>Bangalore<Bangalore>
<State>Karnataka<State>
</Address>
10. • Break Email ID / URL in to multiple tokens
– Sachin Tendulkar
<sachin.tendulkar123@gmail.com>
– Should be analyzed as
• sachin
• tendulkar
• sachin
• tendulkar123
• gmail
• com
11. HTMLAnalyzer in Lucene 4.5
public class HTMLAnalyzer extends Analyzer {
@Override
protected TokenStreamComponents createComponents(String
arg0, Reader reader) {
HTMLStripCharFilter htmlFilter = new HTMLStripCharFilter(reader);
WhitespaceTokenizer tokenizer = new
WhitespaceTokenizer(Version.LUCENE_45, htmlFilter);
TokenStream result = new LowerCaseFilter(Version.LUCENE_45,
tokenizer);
return new TokenStreamComponents (tokenizer, result);
}
}
13. SynonymAnalyzer
• SynonymAnalyzer will inject the synonym as
part of the indexed content using Lucene 3.3
• Check out the code..
https://github.com/geekganesh/SynonymAnal
yzer
14. PerFieldAnalyzerWrapper
• IndexWriter / IndexWriterConfig will take only
one Analyzer and it will use that for all its
fields.
• We may have multiple fields and each field
should be indexed using specific analyzer then
we need to use PerFieldAnalyzerWrapper
• PerFieldAnalyzerWrapper is used to have
different analyzer per field. It will be passed to
IndexWriter