Using OpenNLP with Solr to improve search relevance and to extract named entities

Using OpenNLP with Solr to improve search
relevance and to extract named entities
Steve Rowe
Lucidworks

About me
• Previously worked at the Center for Natural
Language Processing at Syracuse University
• Sr. Software Engineer at Lucidworks
• Committer on Apache Lucene/Solr project
• Committer on JFlex scanner generator project

Apache OpenNLP
• sentence segmentation
• tokenization
• part-of-speech tagging
• lemmatization
• named entity extraction
• phrase chunking
• parsing
• coreference resolution
• machine learning: maximum entropy and perceptron based
• caveat: model licensing: not Apache

Expectation Management
• OpenNLP isn’t integrated with Solr in any release
• LUCENE-2899: patches
• TDD (talk driven development)
• No Spanish - OpenNLP doesn’t publish Spanish
models for sentence splitting, tokenization, or part-
of-speech.
• No precision/recall/F-measure/MAP testing

LUCENE-2899
• Created: 30/Jan/11 10:44 <- over 5 years old
• Lance Norskog wrote the bulk of the
implementation
• I modernized Lance’s patch and added
lemmatization support

Lemmatization vs. stemming
• Both can be used with search to increase recall
• Lemmas are real words: inﬁnitive verbs, singular nouns
• e.g. Speaking/VBG, spoke/VB -> speak; stigmata/NNS -> stigma
• Can be produced by algorithm and/or known-item dictionary
• OpenNLP 1.6.1 will include a machine-learned lemmatization implementation
• Caveat: poor quality part-of-speech over short query text
• Stems are not (necessarily) real words
• e.g. Speaking -> speak, spoke -> spoke, stigmata -> stigmata 
(Porter stemmer)
• produced via algorithm

Penn Treebank part of speech tags 
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
CC Coordinating conjunction
CD Cardinal number
DT Determiner
EX Existential there
FW Foreign word
IN Preposition/subordinating conjunction
JJ Adjective
JJR Adjective, comparative
JJS Adjective, superlative
LS List item marker
MD Modal
NN Noun, singular or mass
NNS Noun, plural
NNP Proper noun, singular
NNPS Proper noun, plural
PDT Predeterminer
POS Possessive ending
PRP Personal pronoun
PRP$ Possessive pronoun
RB Adverb
RBR Adverb, comparative
RBS Adverb, superlative
RP Particle
SYM Symbol
TO to
UH Interjection
VB Verb, base form
VBD Verb, past tense
VBG Verb, gerund or present participle
VBN Verb, past participle
VBP Verb, non-3rd person singular present
VBZ Verb, 3rd person singular present
WDT Wh-determiner
WP Wh-pronoun
WP$ Possessive wh-pronoun
WRB Wh-adverb

Solr OpenNLP integration
• Put jars on classpath
• Add required resources to configset:
• models
• lemmatization dictionary
• Add field type(s) using OpenNLP-based analysis
components, then fields using these field types

Put jars on classpath
• Add to conﬁgset’s solrconfig.xml:
<lib dir="${solr.install.dir:../../../..} 
/contrib/analysis-extras/lucene-libs"
regex=".*.jar" />
<lib dir="${solr.install.dir:../../../..} 
/contrib/analysis-extras/lib"
regex="opennlp-.*.jar"/>

Add required resources to conﬁgset
• Download models from  
http://opennlp.sourceforge.net/models-1.5/
• Download lemma dictionary from  
http://ixa2.si.ehu.es/ragerri/lemmatizer-dicts.tgz
conf/ 
opennlp/ 
en-ner-person.bin 
en-pos-maxent.bin 
en-sent.bin 
en-token.bin 
language-tool-en-lemmatizer.txt

Add ﬁeld types and ﬁelds
curl -X POST http://localhost:8983/solr/opennlp/schema  
-H 'Content-type: application/json' --data-binary '{ 
"add-field-type":{ 
"name":"text_lemma", 
"class":"solr.TextField", 
"positionIncrementGap":"100", 
"analyzer":{ 
"tokenizer":{ 
"class":"solr.OpenNLPTokenizerFactory", 
"sentenceModel":"opennlp/en-sent.bin", 
"tokenizerModel":"opennlp/en-token.bin" 
}, 
"filters":[{ 
"class":"solr.OpenNLPFilterFactory", 
"posTaggerModel":"opennlp/en-pos-maxent.bin" 
},{ 
"class":"solr.OpenNLPLemmatizerFilterFactory", 
"dictionary":"opennlp/language-tool-en-lemmatizer.txt" 
}]}}, 
"add-field":{ 
"name":"content_lemma", 
"type":"text_lemma", 
“stored":true } 
}'

Next steps
• Switch tags from payloads to token “type” attribute
• Make Solr update request processors for named
entity extraction, maybe phrase chunker
• Commit/release LUCENE-2899!

Using OpenNLP with Solr to improve search relevance and to extract named entities

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (8)

Similaire à Using OpenNLP with Solr to improve search relevance and to extract named entities

Similaire à Using OpenNLP with Solr to improve search relevance and to extract named entities (20)

Dernier

Dernier (20)

Using OpenNLP with Solr to improve search relevance and to extract named entities