3. OpenNLP and Solr
for Superior Relevance
Steve Rowe
@steven_a_rowe
Sr. Software Engineer, Lucidworks
4. • Previously worked at the Center for Natural Language Processing
(CNLP) at Syracuse University
• Sr. Software Engineer at Lucidworks
• Committer on Apache Lucene/Solr project
• Committer on JFlex scanner generator project
About me
5. • OpenNLP capabilities
• OpenNLP / Solr integration
• OpenNLP models: training and licensing
• Part-of-speech: what is it good for (absolutely/RB something/NN)
• Lemmatization versus stemming
• Solr configuration and demonstration of lemmatization and
named entity extraction
• Future Work
Agenda
6. • Machine learning: maximum entropy and perceptron based
• Sentence segmentation
• Tokenization
• Part-of-speech (POS) tagging
• Lemmatization
• Named entity recognition (NER)
• Phrase chunking
• Parsing
• Co-reference resolution
• Document classification
Apache OpenNLP capabilities
7. • OpenNLP isn’t integrated with Solr in any release (yet)
• TDD (talk driven development)
• LUCENE-2899: WIP patch, builds, works (demo later)
• Currently implemented:
• Sentence segmentation and tokenization
• Part-of-speech (POS) tagging
• Phrase chunking
• Dictionary-based lemmatization
• Named entity recognition (NER)
OpenNLP Solr integration
8. • Most OpenNLP phases can be trained, but each phase depends on the previous
ones.
• Publicly available models are based on data with non-free licenses.
• You can train your own models, and you very likely want to for production use.
• Example workflow:
• Use an existing model to tag your training data
• Modify the tagged data according to your needs
• One way to do that: the brat rapid annotation tool (OpenNLP understands its
output format)
• Run OpenNLP command-line training tools to create a model
• Run OpenNLP command-line evaluation tools to test model performance
• Repeat until you get the quality you want
OpenNLP models: training & licensing
9. • Created: 30/Jan/11 10:44 <- over 5 years old
• Lance Norskog wrote the bulk of the implementation
• I modernized Lance’s patch and added support for dictionary-based
lemmatization
LUCENE-2899
10. • Both can be used with search to increase recall
• Lemmas are real words: infinitive verbs, singular nouns
• e.g. Speaking/VBG, spoke/VB -> speak; stigmata/NNS -> stigma
• Can be produced by algorithm and/or known-item dictionary
• OpenNLP 1.6.1 will include a machine-learned lemmatization implementation
• Caveat: poor quality part-of-speech over short query text
• Stems are not (necessarily) real words
• e.g. Speaking -> speak, spoke -> spoke, stigmata -> stigmata
(Porter stemmer)
• produced via algorithm
Lemmatization vs. Stemming
11. Penn Treebank part of speech tags
PRP$ Possessive pronoun
RB Adverb
RBR Adverb, comparative
RBS Adverb, superlative
RP Particle
SYM Symbol
TO to
UH Interjection
VB Verb, base form
VBD Verb, past tense
VBG Verb, gerund or present participle
VBN Verb, past participle
VBP Verb, non-3rd person singular present
VBZ Verb, 3rd person singular present
WDT Wh-determiner
WP Wh-pronoun
WP$ Possessive wh-pronoun
WRB Wh-adverb
CC Coordinating conjunction
CD Cardinal number
DT Determiner
EX Existential there
FW Foreign word
IN Preposition/subordinating conjunction
JJ Adjective
JJR Adjective, comparative
JJS Adjective, superlative
LS List item marker
MD Modal
NN Noun, singular or mass
NNS Noun, plural
NNP Proper noun, singular
NNPS Proper noun, plural
PDT Predeterminer
POS Possessive ending
PRP Personal pronoun
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
12. Solr configuration
• Put jars on classpath
• Add required resources to configset:
• models
• lemmatization dictionary
• Add field type(s) using OpenNLP-based analysis components,
then fields using these field types
13. Put jars on the classpath
• Add to configset’s solrconfig.xml:
<lib dir="${solr.install.dir:../../../..}
/contrib/analysis-extras/lucene-libs" regex=".*.jar"/>
<lib dir="${solr.install.dir:../../../..}
/contrib/analysis-extras/lib" regex="opennlp-.*.jar"/>
14. Add required resources to configset
• Download models from
http://opennlp.sourceforge.net/models-1.5/
• Download lemma dictionary from
http://ixa2.si.ehu.es/ragerri/lemmatizer-dicts.tgz
conf/
opennlp/
en-ner-person.bin
en-pos-maxent.bin
en-sent.bin
en-token.bin
language-tool-en-lemmatizer.txt
16. • (Switch to http://localhost:8983/solr here)
Demo
17. • Make Solr update request processors for named entity recognition,
maybe phrase chunker.
• Optimize memory usage to only process one sentence at a time.
• Commit/release LUCENE-2899!
Future Work