Webinar: OpenNLP and Solr for Superior Relevance

2016
OCTOBER 11-14 
BOSTON, MA
http://lucenerevolution.com

OpenNLP and Solr
for Superior Relevance
Steve Rowe
@steven_a_rowe
Sr. Software Engineer, Lucidworks

• Previously worked at the Center for Natural Language Processing
(CNLP) at Syracuse University
• Sr. Software Engineer at Lucidworks
• Committer on Apache Lucene/Solr project
• Committer on JFlex scanner generator project
About me

• OpenNLP capabilities
• OpenNLP / Solr integration
• OpenNLP models: training and licensing
• Part-of-speech: what is it good for (absolutely/RB something/NN)
• Lemmatization versus stemming
• Solr conﬁguration and demonstration of lemmatization and
named entity extraction
• Future Work
Agenda

• Machine learning: maximum entropy and perceptron based
• Sentence segmentation
• Tokenization
• Part-of-speech (POS) tagging
• Lemmatization
• Named entity recognition (NER)
• Phrase chunking
• Parsing
• Co-reference resolution
• Document classiﬁcation
Apache OpenNLP capabilities

• OpenNLP isn’t integrated with Solr in any release (yet)
• TDD (talk driven development)
• LUCENE-2899: WIP patch, builds, works (demo later)
• Currently implemented:
• Sentence segmentation and tokenization
• Part-of-speech (POS) tagging
• Phrase chunking
• Dictionary-based lemmatization
• Named entity recognition (NER)
OpenNLP Solr integration

• Most OpenNLP phases can be trained, but each phase depends on the previous
ones.
• Publicly available models are based on data with non-free licenses.
• You can train your own models, and you very likely want to for production use.
• Example workﬂow:
• Use an existing model to tag your training data
• Modify the tagged data according to your needs
• One way to do that: the brat rapid annotation tool (OpenNLP understands its
output format)
• Run OpenNLP command-line training tools to create a model
• Run OpenNLP command-line evaluation tools to test model performance
• Repeat until you get the quality you want
OpenNLP models: training & licensing

• Created: 30/Jan/11 10:44 <- over 5 years old
• Lance Norskog wrote the bulk of the implementation
• I modernized Lance’s patch and added support for dictionary-based
lemmatization
LUCENE-2899

• Both can be used with search to increase recall
• Lemmas are real words: inﬁnitive verbs, singular nouns
• e.g. Speaking/VBG, spoke/VB -> speak; stigmata/NNS -> stigma
• Can be produced by algorithm and/or known-item dictionary
• OpenNLP 1.6.1 will include a machine-learned lemmatization implementation
• Caveat: poor quality part-of-speech over short query text
• Stems are not (necessarily) real words
• e.g. Speaking -> speak, spoke -> spoke, stigmata -> stigmata 
(Porter stemmer)
• produced via algorithm
Lemmatization vs. Stemming

Penn Treebank part of speech tags
PRP$ Possessive pronoun
RB Adverb
RBR Adverb, comparative
RBS Adverb, superlative
RP Particle
SYM Symbol
TO to
UH Interjection
VB Verb, base form
VBD Verb, past tense
VBG Verb, gerund or present participle
VBN Verb, past participle
VBP Verb, non-3rd person singular present
VBZ Verb, 3rd person singular present
WDT Wh-determiner
WP Wh-pronoun
WP$ Possessive wh-pronoun
WRB Wh-adverb
CC Coordinating conjunction
CD Cardinal number
DT Determiner
EX Existential there
FW Foreign word
IN Preposition/subordinating conjunction
JJ Adjective
JJR Adjective, comparative
JJS Adjective, superlative
LS List item marker
MD Modal
NN Noun, singular or mass
NNS Noun, plural
NNP Proper noun, singular
NNPS Proper noun, plural
PDT Predeterminer
POS Possessive ending
PRP Personal pronoun
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

Solr conﬁguration
• Put jars on classpath
• Add required resources to configset:
• models
• lemmatization dictionary
• Add field type(s) using OpenNLP-based analysis components,
then fields using these field types

Put jars on the classpath
• Add to conﬁgset’s solrconfig.xml:
<lib dir="${solr.install.dir:../../../..} 
/contrib/analysis-extras/lucene-libs" regex=".*.jar"/>
<lib dir="${solr.install.dir:../../../..} 
/contrib/analysis-extras/lib" regex="opennlp-.*.jar"/>

Add required resources to conﬁgset
• Download models from  
http://opennlp.sourceforge.net/models-1.5/
• Download lemma dictionary from  
http://ixa2.si.ehu.es/ragerri/lemmatizer-dicts.tgz
conf/ 
opennlp/ 
en-ner-person.bin 
en-pos-maxent.bin 
en-sent.bin 
en-token.bin 
language-tool-en-lemmatizer.txt

Add ﬁeld type and ﬁelds
curl -X POST http://localhost:8983/solr/opennlp/schema  
-H 'Content-type: application/json' --data-binary '{ 
"add-field-type":{ 
"name":"text_lemma", 
"class":"solr.TextField", 
"positionIncrementGap":"100", 
"analyzer":{ 
"tokenizer":{ 
"class":"solr.OpenNLPTokenizerFactory", 
"sentenceModel":"opennlp/en-sent.bin", 
"tokenizerModel":"opennlp/en-token.bin" 
}, 
"filters":[{ 
"class":"solr.OpenNLPFilterFactory", 
"posTaggerModel":"opennlp/en-pos-maxent.bin" 
},{ 
"class":"solr.OpenNLPLemmatizerFilterFactory", 
"dictionary":"opennlp/language-tool-en-lemmatizer.txt" 
}]}}, 
"add-field":{ 
"name":"content_lemma", 
"type":"text_lemma", 
“stored":true } 
}'

• (Switch to http://localhost:8983/solr here)
Demo

• Make Solr update request processors for named entity recognition,
maybe phrase chunker.
• Optimize memory usage to only process one sentence at a time.
• Commit/release LUCENE-2899!
Future Work

Resources
Solr: http://lucene.apache.org/solr
OpenNLP: http://opennlp.apache.org
LUCENE-2899: https://issues.apache.org/jira/browse/LUCENE-2899
OpenNLP pre-trained models: http://opennlp.sourceforge.net/models-1.5/
brat rapid annotation tool: http://brat.nlplab.org/index.html
LanguageTool lemma dictionaries: http://ixa2.si.ehu.es/ragerri/lemmatizer-dicts.tgz
Company: http://www.lucidworks.com
Our blog: http://www.lucidworks.com/blog
Twitter: @steven_a_rowe

Webinar: OpenNLP and Solr for Superior Relevance

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Webinar: OpenNLP and Solr for Superior Relevance

Similaire à Webinar: OpenNLP and Solr for Superior Relevance (20)

Plus de Lucidworks

Plus de Lucidworks (20)

Dernier

Dernier (20)

Webinar: OpenNLP and Solr for Superior Relevance