SlideShare une entreprise Scribd logo
1  sur  14
Custom Analyzer in Lucene
Lucene/Solr Meetup
Ganesh.M
http://www.linkedin.com/in/gmurugappan
• Introduction to Analyzer
• Why we require Custom Analyzer
• Use case / Scenario
• Writing custom analyzer
• Know your analyzer
• Analyzer : Analyzes the given text and returns
tokens using Tokenizer and TokenFilter
• Tokenizer : Understands the language and breaks
the text in to tokens.
– WhitespaceTokenizer divides text at whitespace
– LetterTokenizer divides text at non-letter
– CJKTokenizer – Chinese, Japanese, Korean language
tokenizer
• TokenFiler: adds / stem / deletes token
– StopFilter – removes stop words
– PorterStemFilter – Transforms the token
• Lets have the text
“The quick brown fox jumps over lazy dog”
Using Standard Analyzer, it will generate
following tokens
Quick Brown
Fox Jumps
Over Lazy
dog
Know Your analyzer
• It is important to choose best analyzer for
your fields.
• If you choose it wrong then it may not give
expected search result.
• If you ever think you are not expecting the
correct result then check your Analyzer and
Query parser.
Lucene 3.x: Below code will print the tokens
generated from given analyzer
Analyzer analyzer = new SimpleAnalyzer();
TokenStream ts = analyzer.tokenStream(“Field", new
StringReader(“Hello world-2013 "));
ts.reset();
while (ts.incrementToken()) {
System.out.println("token: " +
ts.getAttribute(TermAttribute.class).term());
}
ts.close();
The purpose of Custom Analyzer
• Existing analyzers not always solves our
purpose, some times we need to analyze in a
different way
• Custom Analyzer could use existing inbuilt
filters.
• It could also be used for parsing queries
Use case
• Synonym Injection / Abbreviation Expansion
– Add synonyms at the time of indexing.
– In case of parsing resume, add related content for
a keyword. If you find text “lucene/solr” then you
could add information retrieval, search engine.
– If you are searching medical documents, chat
messages etc you need to expand the
abbreviation / codes at the time of indexing
• Stripping XML / HTML tags and index only the
content
<Address>
<Street>123, MG Road<Street>
<City>Bangalore<Bangalore>
<State>Karnataka<State>
</Address>
• Break Email ID / URL in to multiple tokens
– Sachin Tendulkar
<sachin.tendulkar123@gmail.com>
– Should be analyzed as
• sachin
• tendulkar
• sachin
• tendulkar123
• gmail
• com
HTMLAnalyzer in Lucene 4.5
public class HTMLAnalyzer extends Analyzer {
@Override
protected TokenStreamComponents createComponents(String
arg0, Reader reader) {
HTMLStripCharFilter htmlFilter = new HTMLStripCharFilter(reader);
WhitespaceTokenizer tokenizer = new
WhitespaceTokenizer(Version.LUCENE_45, htmlFilter);
TokenStream result = new LowerCaseFilter(Version.LUCENE_45,
tokenizer);
return new TokenStreamComponents (tokenizer, result);
}
}
HTMLAnalyzer in Solr
<fieldType name="text_html" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<charFilter class="solr.HTMLStripCharFilterFactory"
escapedTags="a, title" /> <tokenizer
class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>
SynonymAnalyzer
• SynonymAnalyzer will inject the synonym as
part of the indexed content using Lucene 3.3
• Check out the code..
https://github.com/geekganesh/SynonymAnal
yzer
PerFieldAnalyzerWrapper
• IndexWriter / IndexWriterConfig will take only
one Analyzer and it will use that for all its
fields.
• We may have multiple fields and each field
should be indexed using specific analyzer then
we need to use PerFieldAnalyzerWrapper
• PerFieldAnalyzerWrapper is used to have
different analyzer per field. It will be passed to
IndexWriter

Contenu connexe

Similaire à Custom analyzer using lucene

Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
GokulD
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
Erik Hatcher
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2
GokulD
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
Trey Grainger
 

Similaire à Custom analyzer using lucene (20)

Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
 
Configuring Apache Solr for Thai Text Search
Configuring Apache Solr for Thai Text SearchConfiguring Apache Solr for Thai Text Search
Configuring Apache Solr for Thai Text Search
 
Autocomplete in elasticsearch
Autocomplete in elasticsearchAutocomplete in elasticsearch
Autocomplete in elasticsearch
 
Lexical Analysis - Compiler Design
Lexical Analysis - Compiler DesignLexical Analysis - Compiler Design
Lexical Analysis - Compiler Design
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
 
Find it, possibly also near you!
Find it, possibly also near you!Find it, possibly also near you!
Find it, possibly also near you!
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 
Introduction to Search Engines
Introduction to Search EnginesIntroduction to Search Engines
Introduction to Search Engines
 
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Feedparser
FeedparserFeedparser
Feedparser
 
Cd ch2 - lexical analysis
Cd   ch2 - lexical analysisCd   ch2 - lexical analysis
Cd ch2 - lexical analysis
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
 
Lucene And Solr Intro
Lucene And Solr IntroLucene And Solr Intro
Lucene And Solr Intro
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
 
Systematic Searching Strategies.pptx
Systematic Searching Strategies.pptxSystematic Searching Strategies.pptx
Systematic Searching Strategies.pptx
 
Sumo Logic QuickStart Webinar Oct 2016
Sumo Logic QuickStart Webinar Oct 2016Sumo Logic QuickStart Webinar Oct 2016
Sumo Logic QuickStart Webinar Oct 2016
 

Dernier

The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
masabamasaba
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 

Dernier (20)

Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 

Custom analyzer using lucene

  • 1. Custom Analyzer in Lucene Lucene/Solr Meetup Ganesh.M http://www.linkedin.com/in/gmurugappan
  • 2. • Introduction to Analyzer • Why we require Custom Analyzer • Use case / Scenario • Writing custom analyzer • Know your analyzer
  • 3. • Analyzer : Analyzes the given text and returns tokens using Tokenizer and TokenFilter • Tokenizer : Understands the language and breaks the text in to tokens. – WhitespaceTokenizer divides text at whitespace – LetterTokenizer divides text at non-letter – CJKTokenizer – Chinese, Japanese, Korean language tokenizer • TokenFiler: adds / stem / deletes token – StopFilter – removes stop words – PorterStemFilter – Transforms the token
  • 4. • Lets have the text “The quick brown fox jumps over lazy dog” Using Standard Analyzer, it will generate following tokens Quick Brown Fox Jumps Over Lazy dog
  • 5. Know Your analyzer • It is important to choose best analyzer for your fields. • If you choose it wrong then it may not give expected search result. • If you ever think you are not expecting the correct result then check your Analyzer and Query parser.
  • 6. Lucene 3.x: Below code will print the tokens generated from given analyzer Analyzer analyzer = new SimpleAnalyzer(); TokenStream ts = analyzer.tokenStream(“Field", new StringReader(“Hello world-2013 ")); ts.reset(); while (ts.incrementToken()) { System.out.println("token: " + ts.getAttribute(TermAttribute.class).term()); } ts.close();
  • 7. The purpose of Custom Analyzer • Existing analyzers not always solves our purpose, some times we need to analyze in a different way • Custom Analyzer could use existing inbuilt filters. • It could also be used for parsing queries
  • 8. Use case • Synonym Injection / Abbreviation Expansion – Add synonyms at the time of indexing. – In case of parsing resume, add related content for a keyword. If you find text “lucene/solr” then you could add information retrieval, search engine. – If you are searching medical documents, chat messages etc you need to expand the abbreviation / codes at the time of indexing
  • 9. • Stripping XML / HTML tags and index only the content <Address> <Street>123, MG Road<Street> <City>Bangalore<Bangalore> <State>Karnataka<State> </Address>
  • 10. • Break Email ID / URL in to multiple tokens – Sachin Tendulkar <sachin.tendulkar123@gmail.com> – Should be analyzed as • sachin • tendulkar • sachin • tendulkar123 • gmail • com
  • 11. HTMLAnalyzer in Lucene 4.5 public class HTMLAnalyzer extends Analyzer { @Override protected TokenStreamComponents createComponents(String arg0, Reader reader) { HTMLStripCharFilter htmlFilter = new HTMLStripCharFilter(reader); WhitespaceTokenizer tokenizer = new WhitespaceTokenizer(Version.LUCENE_45, htmlFilter); TokenStream result = new LowerCaseFilter(Version.LUCENE_45, tokenizer); return new TokenStreamComponents (tokenizer, result); } }
  • 12. HTMLAnalyzer in Solr <fieldType name="text_html" class="solr.TextField" positionIncrementGap="100"> <analyzer> <charFilter class="solr.HTMLStripCharFilterFactory" escapedTags="a, title" /> <tokenizer class="solr.WhitespaceTokenizerFactory"/> </analyzer> </fieldType>
  • 13. SynonymAnalyzer • SynonymAnalyzer will inject the synonym as part of the indexed content using Lucene 3.3 • Check out the code.. https://github.com/geekganesh/SynonymAnal yzer
  • 14. PerFieldAnalyzerWrapper • IndexWriter / IndexWriterConfig will take only one Analyzer and it will use that for all its fields. • We may have multiple fields and each field should be indexed using specific analyzer then we need to use PerFieldAnalyzerWrapper • PerFieldAnalyzerWrapper is used to have different analyzer per field. It will be passed to IndexWriter