SlideShare une entreprise Scribd logo
1  sur  12
Télécharger pour lire hors ligne
Using OpenNLP with Solr to improve search
relevance and to extract named entities
Steve Rowe
Lucidworks
About me
• Previously worked at the Center for Natural
Language Processing at Syracuse University
• Sr. Software Engineer at Lucidworks
• Committer on Apache Lucene/Solr project
• Committer on JFlex scanner generator project
Apache OpenNLP
• sentence segmentation
• tokenization
• part-of-speech tagging
• lemmatization
• named entity extraction
• phrase chunking
• parsing
• coreference resolution
• machine learning: maximum entropy and perceptron based
• caveat: model licensing: not Apache
Expectation Management
• OpenNLP isn’t integrated with Solr in any release
• LUCENE-2899: patches
• TDD (talk driven development)
• No Spanish - OpenNLP doesn’t publish Spanish
models for sentence splitting, tokenization, or part-
of-speech.
• No precision/recall/F-measure/MAP testing
LUCENE-2899
• Created: 30/Jan/11 10:44 <- over 5 years old
• Lance Norskog wrote the bulk of the
implementation
• I modernized Lance’s patch and added
lemmatization support
Lemmatization vs. stemming
• Both can be used with search to increase recall
• Lemmas are real words: infinitive verbs, singular nouns
• e.g. Speaking/VBG, spoke/VB -> speak; stigmata/NNS -> stigma
• Can be produced by algorithm and/or known-item dictionary
• OpenNLP 1.6.1 will include a machine-learned lemmatization implementation
• Caveat: poor quality part-of-speech over short query text
• Stems are not (necessarily) real words
• e.g. Speaking -> speak, spoke -> spoke, stigmata -> stigmata

(Porter stemmer)
• produced via algorithm
Penn Treebank part of speech tags

https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
CC Coordinating conjunction
CD Cardinal number
DT Determiner
EX Existential there
FW Foreign word
IN Preposition/subordinating conjunction
JJ Adjective
JJR Adjective, comparative
JJS Adjective, superlative
LS List item marker
MD Modal
NN Noun, singular or mass
NNS Noun, plural
NNP Proper noun, singular
NNPS Proper noun, plural
PDT Predeterminer
POS Possessive ending
PRP Personal pronoun
PRP$ Possessive pronoun
RB Adverb
RBR Adverb, comparative
RBS Adverb, superlative
RP Particle
SYM Symbol
TO to
UH Interjection
VB Verb, base form
VBD Verb, past tense
VBG Verb, gerund or present participle
VBN Verb, past participle
VBP Verb, non-3rd person singular present
VBZ Verb, 3rd person singular present
WDT Wh-determiner
WP Wh-pronoun
WP$ Possessive wh-pronoun
WRB Wh-adverb
Solr OpenNLP integration
• Put jars on classpath
• Add required resources to configset:
• models
• lemmatization dictionary
• Add field type(s) using OpenNLP-based analysis
components, then fields using these field types
Put jars on classpath
• Add to configset’s solrconfig.xml:
<lib dir="${solr.install.dir:../../../..}

/contrib/analysis-extras/lucene-libs"
regex=".*.jar" />
<lib dir="${solr.install.dir:../../../..}

/contrib/analysis-extras/lib"
regex="opennlp-.*.jar"/>
Add required resources to configset
• Download models from 

http://opennlp.sourceforge.net/models-1.5/
• Download lemma dictionary from 

http://ixa2.si.ehu.es/ragerri/lemmatizer-dicts.tgz
conf/

opennlp/

en-ner-person.bin

en-pos-maxent.bin

en-sent.bin

en-token.bin

language-tool-en-lemmatizer.txt
Add field types and fields
curl -X POST http://localhost:8983/solr/opennlp/schema 

-H 'Content-type: application/json' --data-binary '{

"add-field-type":{

"name":"text_lemma",

"class":"solr.TextField",

"positionIncrementGap":"100",

"analyzer":{

"tokenizer":{

"class":"solr.OpenNLPTokenizerFactory",

"sentenceModel":"opennlp/en-sent.bin",

"tokenizerModel":"opennlp/en-token.bin"

},

"filters":[{

"class":"solr.OpenNLPFilterFactory",

"posTaggerModel":"opennlp/en-pos-maxent.bin"

},{

"class":"solr.OpenNLPLemmatizerFilterFactory",

"dictionary":"opennlp/language-tool-en-lemmatizer.txt"

}]}},

"add-field":{

"name":"content_lemma",

"type":"text_lemma",

“stored":true }

}'
Next steps
• Switch tags from payloads to token “type” attribute
• Make Solr update request processors for named
entity extraction, maybe phrase chunker
• Commit/release LUCENE-2899!

Contenu connexe

Tendances

Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 

Tendances (20)

SPARQL 사용법
SPARQL 사용법SPARQL 사용법
SPARQL 사용법
 
Linked Data의 RDF 어휘 이해하고 체험하기 - FOAF, SIOC, SKOS를 중심으로 -
Linked Data의 RDF 어휘 이해하고 체험하기 - FOAF, SIOC, SKOS를 중심으로 -Linked Data의 RDF 어휘 이해하고 체험하기 - FOAF, SIOC, SKOS를 중심으로 -
Linked Data의 RDF 어휘 이해하고 체험하기 - FOAF, SIOC, SKOS를 중심으로 -
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
CKAN - the open source data portal platform
CKAN - the open source data portal platformCKAN - the open source data portal platform
CKAN - the open source data portal platform
 
Accelerate Your ML Pipeline with AutoML and MLflow
Accelerate Your ML Pipeline with AutoML and MLflowAccelerate Your ML Pipeline with AutoML and MLflow
Accelerate Your ML Pipeline with AutoML and MLflow
 
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Powering Interactive BI Analytics with Presto and Delta Lake
Powering Interactive BI Analytics with Presto and Delta LakePowering Interactive BI Analytics with Presto and Delta Lake
Powering Interactive BI Analytics with Presto and Delta Lake
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 
Vectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at FacebookVectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at Facebook
 
"Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow""Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow"
 
ELK Stack
ELK StackELK Stack
ELK Stack
 
Understanding oracle rac internals part 1 - slides
Understanding oracle rac internals   part 1 - slidesUnderstanding oracle rac internals   part 1 - slides
Understanding oracle rac internals part 1 - slides
 
Python and Oracle : allies for best of data management
Python and Oracle : allies for best of data managementPython and Oracle : allies for best of data management
Python and Oracle : allies for best of data management
 
Apache Spark at Airbnb
Apache Spark at AirbnbApache Spark at Airbnb
Apache Spark at Airbnb
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedA Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons Learned
 

En vedette

Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and ProfitHacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profit
lucenerevolution
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
Rahul Jain
 

En vedette (8)

Webinar: Natural Language Search with Solr
Webinar: Natural Language Search with SolrWebinar: Natural Language Search with Solr
Webinar: Natural Language Search with Solr
 
Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and ProfitHacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profit
 
Semantic & Multilingual Strategies in Lucene/Solr
Semantic & Multilingual Strategies in Lucene/SolrSemantic & Multilingual Strategies in Lucene/Solr
Semantic & Multilingual Strategies in Lucene/Solr
 
Shrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLPShrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLP
 
Open nlp presentationss
Open nlp presentationssOpen nlp presentationss
Open nlp presentationss
 
Language support and linguistics in lucene solr & its eco system
Language support and linguistics in lucene solr & its eco systemLanguage support and linguistics in lucene solr & its eco system
Language support and linguistics in lucene solr & its eco system
 
OpenNLP demo
OpenNLP demoOpenNLP demo
OpenNLP demo
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 

Similaire à Using OpenNLP with Solr to improve search relevance and to extract named entities

The noun phrase introducers of npChapter 4the noun phr.docx
The noun phrase  introducers of npChapter 4the noun phr.docxThe noun phrase  introducers of npChapter 4the noun phr.docx
The noun phrase introducers of npChapter 4the noun phr.docx
arnoldmeredith47041
 
The noun phrase introducers of npChapter 4the noun phr.docx
The noun phrase  introducers of npChapter 4the noun phr.docxThe noun phrase  introducers of npChapter 4the noun phr.docx
The noun phrase introducers of npChapter 4the noun phr.docx
dennisa15
 
MEBI 591C/598 – Data and Text Mining in Biomedical Informatics
MEBI 591C/598 – Data and Text Mining in Biomedical InformaticsMEBI 591C/598 – Data and Text Mining in Biomedical Informatics
MEBI 591C/598 – Data and Text Mining in Biomedical Informatics
butest
 

Similaire à Using OpenNLP with Solr to improve search relevance and to extract named entities (20)

Webinar: OpenNLP and Solr for Superior Relevance
Webinar: OpenNLP and Solr for Superior RelevanceWebinar: OpenNLP and Solr for Superior Relevance
Webinar: OpenNLP and Solr for Superior Relevance
 
6 POS SA.pptx
6 POS SA.pptx6 POS SA.pptx
6 POS SA.pptx
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Morphological Analysis
Morphological AnalysisMorphological Analysis
Morphological Analysis
 
Bay Area NLP Reading Group - 7.12.16
Bay Area NLP Reading Group - 7.12.16 Bay Area NLP Reading Group - 7.12.16
Bay Area NLP Reading Group - 7.12.16
 
The noun phrase introducers of npChapter 4the noun phr.docx
The noun phrase  introducers of npChapter 4the noun phr.docxThe noun phrase  introducers of npChapter 4the noun phr.docx
The noun phrase introducers of npChapter 4the noun phr.docx
 
The noun phrase introducers of npChapter 4the noun phr.docx
The noun phrase  introducers of npChapter 4the noun phr.docxThe noun phrase  introducers of npChapter 4the noun phr.docx
The noun phrase introducers of npChapter 4the noun phr.docx
 
Natural Language parsing.pptx
Natural Language parsing.pptxNatural Language parsing.pptx
Natural Language parsing.pptx
 
Syntax.ppt
Syntax.pptSyntax.ppt
Syntax.ppt
 
MEBI 591C/598 – Data and Text Mining in Biomedical Informatics
MEBI 591C/598 – Data and Text Mining in Biomedical InformaticsMEBI 591C/598 – Data and Text Mining in Biomedical Informatics
MEBI 591C/598 – Data and Text Mining in Biomedical Informatics
 
OpenLogos Semantico-Syntactic Knowledge-Rich Bilingual Dictionaries
OpenLogos Semantico-Syntactic Knowledge-Rich Bilingual DictionariesOpenLogos Semantico-Syntactic Knowledge-Rich Bilingual Dictionaries
OpenLogos Semantico-Syntactic Knowledge-Rich Bilingual Dictionaries
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
NLP 1.pptx
NLP 1.pptxNLP 1.pptx
NLP 1.pptx
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
 
Pycon India 2018 Natural Language Processing Workshop
Pycon India 2018   Natural Language Processing WorkshopPycon India 2018   Natural Language Processing Workshop
Pycon India 2018 Natural Language Processing Workshop
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4
 
NLP todo
NLP todoNLP todo
NLP todo
 
A Corpus-based Approach to Tracking L2 Development
A Corpus-based Approach to Tracking L2 DevelopmentA Corpus-based Approach to Tracking L2 Development
A Corpus-based Approach to Tracking L2 Development
 

Dernier

Dernier (20)

Community is Just as Important as Code by Andrea Goulet
Community is Just as Important as Code by Andrea GouletCommunity is Just as Important as Code by Andrea Goulet
Community is Just as Important as Code by Andrea Goulet
 
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-CloudAlluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
 
Effective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeConEffective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeCon
 
Prompt Engineering - an Art, a Science, or your next Job Title?
Prompt Engineering - an Art, a Science, or your next Job Title?Prompt Engineering - an Art, a Science, or your next Job Title?
Prompt Engineering - an Art, a Science, or your next Job Title?
 
^Clinic ^%[+27788225528*Abortion Pills For Sale In birch acres
^Clinic ^%[+27788225528*Abortion Pills For Sale In birch acres^Clinic ^%[+27788225528*Abortion Pills For Sale In birch acres
^Clinic ^%[+27788225528*Abortion Pills For Sale In birch acres
 
The mythical technical debt. (Brooke, please, forgive me)
The mythical technical debt. (Brooke, please, forgive me)The mythical technical debt. (Brooke, please, forgive me)
The mythical technical debt. (Brooke, please, forgive me)
 
Transformer Neural Network Use Cases with Links
Transformer Neural Network Use Cases with LinksTransformer Neural Network Use Cases with Links
Transformer Neural Network Use Cases with Links
 
BusinessGPT - Security and Governance for Generative AI
BusinessGPT  - Security and Governance for Generative AIBusinessGPT  - Security and Governance for Generative AI
BusinessGPT - Security and Governance for Generative AI
 
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
 
Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024
 
How to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabberHow to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabber
 
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
 
The Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdf
The Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdfThe Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdf
The Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdf
 
From Theory to Practice: Utilizing SpiraPlan's REST API
From Theory to Practice: Utilizing SpiraPlan's REST APIFrom Theory to Practice: Utilizing SpiraPlan's REST API
From Theory to Practice: Utilizing SpiraPlan's REST API
 
Evolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI EraEvolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI Era
 
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
 
Software Engineering - Introduction + Process Models + Requirements Engineering
Software Engineering - Introduction + Process Models + Requirements EngineeringSoftware Engineering - Introduction + Process Models + Requirements Engineering
Software Engineering - Introduction + Process Models + Requirements Engineering
 
Abortion Clinic In Polokwane ](+27832195400*)[ 🏥 Safe Abortion Pills in Polok...
Abortion Clinic In Polokwane ](+27832195400*)[ 🏥 Safe Abortion Pills in Polok...Abortion Clinic In Polokwane ](+27832195400*)[ 🏥 Safe Abortion Pills in Polok...
Abortion Clinic In Polokwane ](+27832195400*)[ 🏥 Safe Abortion Pills in Polok...
 
OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024
 
^Clinic ^%[+27788225528*Abortion Pills For Sale In soweto
^Clinic ^%[+27788225528*Abortion Pills For Sale In soweto^Clinic ^%[+27788225528*Abortion Pills For Sale In soweto
^Clinic ^%[+27788225528*Abortion Pills For Sale In soweto
 

Using OpenNLP with Solr to improve search relevance and to extract named entities

  • 1. Using OpenNLP with Solr to improve search relevance and to extract named entities Steve Rowe Lucidworks
  • 2. About me • Previously worked at the Center for Natural Language Processing at Syracuse University • Sr. Software Engineer at Lucidworks • Committer on Apache Lucene/Solr project • Committer on JFlex scanner generator project
  • 3. Apache OpenNLP • sentence segmentation • tokenization • part-of-speech tagging • lemmatization • named entity extraction • phrase chunking • parsing • coreference resolution • machine learning: maximum entropy and perceptron based • caveat: model licensing: not Apache
  • 4. Expectation Management • OpenNLP isn’t integrated with Solr in any release • LUCENE-2899: patches • TDD (talk driven development) • No Spanish - OpenNLP doesn’t publish Spanish models for sentence splitting, tokenization, or part- of-speech. • No precision/recall/F-measure/MAP testing
  • 5. LUCENE-2899 • Created: 30/Jan/11 10:44 <- over 5 years old • Lance Norskog wrote the bulk of the implementation • I modernized Lance’s patch and added lemmatization support
  • 6. Lemmatization vs. stemming • Both can be used with search to increase recall • Lemmas are real words: infinitive verbs, singular nouns • e.g. Speaking/VBG, spoke/VB -> speak; stigmata/NNS -> stigma • Can be produced by algorithm and/or known-item dictionary • OpenNLP 1.6.1 will include a machine-learned lemmatization implementation • Caveat: poor quality part-of-speech over short query text • Stems are not (necessarily) real words • e.g. Speaking -> speak, spoke -> spoke, stigmata -> stigmata
 (Porter stemmer) • produced via algorithm
  • 7. Penn Treebank part of speech tags
 https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html CC Coordinating conjunction CD Cardinal number DT Determiner EX Existential there FW Foreign word IN Preposition/subordinating conjunction JJ Adjective JJR Adjective, comparative JJS Adjective, superlative LS List item marker MD Modal NN Noun, singular or mass NNS Noun, plural NNP Proper noun, singular NNPS Proper noun, plural PDT Predeterminer POS Possessive ending PRP Personal pronoun PRP$ Possessive pronoun RB Adverb RBR Adverb, comparative RBS Adverb, superlative RP Particle SYM Symbol TO to UH Interjection VB Verb, base form VBD Verb, past tense VBG Verb, gerund or present participle VBN Verb, past participle VBP Verb, non-3rd person singular present VBZ Verb, 3rd person singular present WDT Wh-determiner WP Wh-pronoun WP$ Possessive wh-pronoun WRB Wh-adverb
  • 8. Solr OpenNLP integration • Put jars on classpath • Add required resources to configset: • models • lemmatization dictionary • Add field type(s) using OpenNLP-based analysis components, then fields using these field types
  • 9. Put jars on classpath • Add to configset’s solrconfig.xml: <lib dir="${solr.install.dir:../../../..}
 /contrib/analysis-extras/lucene-libs" regex=".*.jar" /> <lib dir="${solr.install.dir:../../../..}
 /contrib/analysis-extras/lib" regex="opennlp-.*.jar"/>
  • 10. Add required resources to configset • Download models from 
 http://opennlp.sourceforge.net/models-1.5/ • Download lemma dictionary from 
 http://ixa2.si.ehu.es/ragerri/lemmatizer-dicts.tgz conf/
 opennlp/
 en-ner-person.bin
 en-pos-maxent.bin
 en-sent.bin
 en-token.bin
 language-tool-en-lemmatizer.txt
  • 11. Add field types and fields curl -X POST http://localhost:8983/solr/opennlp/schema 
 -H 'Content-type: application/json' --data-binary '{
 "add-field-type":{
 "name":"text_lemma",
 "class":"solr.TextField",
 "positionIncrementGap":"100",
 "analyzer":{
 "tokenizer":{
 "class":"solr.OpenNLPTokenizerFactory",
 "sentenceModel":"opennlp/en-sent.bin",
 "tokenizerModel":"opennlp/en-token.bin"
 },
 "filters":[{
 "class":"solr.OpenNLPFilterFactory",
 "posTaggerModel":"opennlp/en-pos-maxent.bin"
 },{
 "class":"solr.OpenNLPLemmatizerFilterFactory",
 "dictionary":"opennlp/language-tool-en-lemmatizer.txt"
 }]}},
 "add-field":{
 "name":"content_lemma",
 "type":"text_lemma",
 “stored":true }
 }'
  • 12. Next steps • Switch tags from payloads to token “type” attribute • Make Solr update request processors for named entity extraction, maybe phrase chunker • Commit/release LUCENE-2899!