SlideShare une entreprise Scribd logo
1  sur  18
Télécharger pour lire hors ligne
2016
OCTOBER 11-14

BOSTON, MA
http://lucenerevolution.com
OpenNLP and Solr
for Superior Relevance
Steve Rowe
@steven_a_rowe
Sr. Software Engineer, Lucidworks
• Previously worked at the Center for Natural Language Processing
(CNLP) at Syracuse University
• Sr. Software Engineer at Lucidworks
• Committer on Apache Lucene/Solr project
• Committer on JFlex scanner generator project
About me
• OpenNLP capabilities
• OpenNLP / Solr integration
• OpenNLP models: training and licensing
• Part-of-speech: what is it good for (absolutely/RB something/NN)
• Lemmatization versus stemming
• Solr configuration and demonstration of lemmatization and
named entity extraction
• Future Work
Agenda
• Machine learning: maximum entropy and perceptron based
• Sentence segmentation
• Tokenization
• Part-of-speech (POS) tagging
• Lemmatization
• Named entity recognition (NER)
• Phrase chunking
• Parsing
• Co-reference resolution
• Document classification
Apache OpenNLP capabilities
• OpenNLP isn’t integrated with Solr in any release (yet)
• TDD (talk driven development)
• LUCENE-2899: WIP patch, builds, works (demo later)
• Currently implemented:
• Sentence segmentation and tokenization
• Part-of-speech (POS) tagging
• Phrase chunking
• Dictionary-based lemmatization
• Named entity recognition (NER)
OpenNLP Solr integration
• Most OpenNLP phases can be trained, but each phase depends on the previous
ones.
• Publicly available models are based on data with non-free licenses.
• You can train your own models, and you very likely want to for production use.
• Example workflow:
• Use an existing model to tag your training data
• Modify the tagged data according to your needs
• One way to do that: the brat rapid annotation tool (OpenNLP understands its
output format)
• Run OpenNLP command-line training tools to create a model
• Run OpenNLP command-line evaluation tools to test model performance
• Repeat until you get the quality you want
OpenNLP models: training & licensing
• Created: 30/Jan/11 10:44 <- over 5 years old
• Lance Norskog wrote the bulk of the implementation
• I modernized Lance’s patch and added support for dictionary-based
lemmatization
LUCENE-2899
• Both can be used with search to increase recall
• Lemmas are real words: infinitive verbs, singular nouns
• e.g. Speaking/VBG, spoke/VB -> speak; stigmata/NNS -> stigma
• Can be produced by algorithm and/or known-item dictionary
• OpenNLP 1.6.1 will include a machine-learned lemmatization implementation
• Caveat: poor quality part-of-speech over short query text
• Stems are not (necessarily) real words
• e.g. Speaking -> speak, spoke -> spoke, stigmata -> stigmata

(Porter stemmer)
• produced via algorithm
Lemmatization vs. Stemming
Penn Treebank part of speech tags
PRP$ Possessive pronoun
RB Adverb
RBR Adverb, comparative
RBS Adverb, superlative
RP Particle
SYM Symbol
TO to
UH Interjection
VB Verb, base form
VBD Verb, past tense
VBG Verb, gerund or present participle
VBN Verb, past participle
VBP Verb, non-3rd person singular present
VBZ Verb, 3rd person singular present
WDT Wh-determiner
WP Wh-pronoun
WP$ Possessive wh-pronoun
WRB Wh-adverb
CC Coordinating conjunction
CD Cardinal number
DT Determiner
EX Existential there
FW Foreign word
IN Preposition/subordinating conjunction
JJ Adjective
JJR Adjective, comparative
JJS Adjective, superlative
LS List item marker
MD Modal
NN Noun, singular or mass
NNS Noun, plural
NNP Proper noun, singular
NNPS Proper noun, plural
PDT Predeterminer
POS Possessive ending
PRP Personal pronoun
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
Solr configuration
• Put jars on classpath
• Add required resources to configset:
• models
• lemmatization dictionary
• Add field type(s) using OpenNLP-based analysis components,
then fields using these field types
Put jars on the classpath
• Add to configset’s solrconfig.xml:
<lib dir="${solr.install.dir:../../../..}

/contrib/analysis-extras/lucene-libs" regex=".*.jar"/>
<lib dir="${solr.install.dir:../../../..}

/contrib/analysis-extras/lib" regex="opennlp-.*.jar"/>
Add required resources to configset
• Download models from 

http://opennlp.sourceforge.net/models-1.5/
• Download lemma dictionary from 

http://ixa2.si.ehu.es/ragerri/lemmatizer-dicts.tgz
conf/

opennlp/

en-ner-person.bin

en-pos-maxent.bin

en-sent.bin

en-token.bin

language-tool-en-lemmatizer.txt
Add field type and fields
curl -X POST http://localhost:8983/solr/opennlp/schema 

-H 'Content-type: application/json' --data-binary '{

"add-field-type":{

"name":"text_lemma",

"class":"solr.TextField",

"positionIncrementGap":"100",

"analyzer":{

"tokenizer":{

"class":"solr.OpenNLPTokenizerFactory",

"sentenceModel":"opennlp/en-sent.bin",

"tokenizerModel":"opennlp/en-token.bin"

},

"filters":[{

"class":"solr.OpenNLPFilterFactory",

"posTaggerModel":"opennlp/en-pos-maxent.bin"

},{

"class":"solr.OpenNLPLemmatizerFilterFactory",

"dictionary":"opennlp/language-tool-en-lemmatizer.txt"

}]}},

"add-field":{

"name":"content_lemma",

"type":"text_lemma",

“stored":true }

}'
• (Switch to http://localhost:8983/solr here)
Demo
• Make Solr update request processors for named entity recognition,
maybe phrase chunker.
• Optimize memory usage to only process one sentence at a time.
• Commit/release LUCENE-2899!
Future Work
Resources
Solr: http://lucene.apache.org/solr
OpenNLP: http://opennlp.apache.org
LUCENE-2899: https://issues.apache.org/jira/browse/LUCENE-2899
OpenNLP pre-trained models: http://opennlp.sourceforge.net/models-1.5/
brat rapid annotation tool: http://brat.nlplab.org/index.html
LanguageTool lemma dictionaries: http://ixa2.si.ehu.es/ragerri/lemmatizer-dicts.tgz
Company: http://www.lucidworks.com
Our blog: http://www.lucidworks.com/blog
Twitter: @steven_a_rowe

Contenu connexe

Tendances

"Hands Off! Best Practices for Code Hand Offs"
"Hands Off!  Best Practices for Code Hand Offs""Hands Off!  Best Practices for Code Hand Offs"
"Hands Off! Best Practices for Code Hand Offs"
Naomi Dushay
 
Perl%20Tutorial.!Picking%20Up%20Perl
Perl%20Tutorial.!Picking%20Up%20PerlPerl%20Tutorial.!Picking%20Up%20Perl
Perl%20Tutorial.!Picking%20Up%20Perl
tutorialsruby
 
Search Lucene
Search LuceneSearch Lucene
Search Lucene
Jeremy Coates
 
Intelligent Stream Filtering Using MongoDB
Intelligent Stream Filtering Using MongoDBIntelligent Stream Filtering Using MongoDB
Intelligent Stream Filtering Using MongoDB
Mihnea Giurgea
 
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
gagravarr
 

Tendances (20)

Netflix Global Search - Lucene Revolution
Netflix Global Search - Lucene RevolutionNetflix Global Search - Lucene Revolution
Netflix Global Search - Lucene Revolution
 
Ld4 l triannon
Ld4 l triannonLd4 l triannon
Ld4 l triannon
 
"Hands Off! Best Practices for Code Hand Offs"
"Hands Off!  Best Practices for Code Hand Offs""Hands Off!  Best Practices for Code Hand Offs"
"Hands Off! Best Practices for Code Hand Offs"
 
Perl%20Tutorial.!Picking%20Up%20Perl
Perl%20Tutorial.!Picking%20Up%20PerlPerl%20Tutorial.!Picking%20Up%20Perl
Perl%20Tutorial.!Picking%20Up%20Perl
 
Search Lucene
Search LuceneSearch Lucene
Search Lucene
 
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technology
 
Intelligent Stream Filtering Using MongoDB
Intelligent Stream Filtering Using MongoDBIntelligent Stream Filtering Using MongoDB
Intelligent Stream Filtering Using MongoDB
 
Get the most out of Solr search with PHP
Get the most out of Solr search with PHPGet the most out of Solr search with PHP
Get the most out of Solr search with PHP
 
Improved Search with Lucene 4.0 - Robert Muir
Improved Search with Lucene 4.0 - Robert MuirImproved Search with Lucene 4.0 - Robert Muir
Improved Search with Lucene 4.0 - Robert Muir
 
Alexey Golub - Writing parsers in c# | 3Shape Meetup
Alexey Golub - Writing parsers in c# | 3Shape MeetupAlexey Golub - Writing parsers in c# | 3Shape Meetup
Alexey Golub - Writing parsers in c# | 3Shape Meetup
 
Neural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionNeural Architectures for Named Entity Recognition
Neural Architectures for Named Entity Recognition
 
Quepy
QuepyQuepy
Quepy
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
 
Advanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache LuceneAdvanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache Lucene
 
Building a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineBuilding a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engine
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
 
OUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information ExtractionOUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information Extraction
 

En vedette

Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Lucidworks
 
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Lucidworks
 

En vedette (20)

Sentiment Analysis Using Solr
Sentiment Analysis Using SolrSentiment Analysis Using Solr
Sentiment Analysis Using Solr
 
Solr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by CaseSolr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by Case
 
Advanced Search with Solr & django-haystack
Advanced Search with Solr & django-haystackAdvanced Search with Solr & django-haystack
Advanced Search with Solr & django-haystack
 
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
 
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
 
Downtown SF Lucene/Solr Meetup: Developing Scalable User Search for PlayStati...
Downtown SF Lucene/Solr Meetup: Developing Scalable User Search for PlayStati...Downtown SF Lucene/Solr Meetup: Developing Scalable User Search for PlayStati...
Downtown SF Lucene/Solr Meetup: Developing Scalable User Search for PlayStati...
 
Webinar: Search and Recommenders
Webinar: Search and RecommendersWebinar: Search and Recommenders
Webinar: Search and Recommenders
 
Webinar: Fusion for Business Intelligence
Webinar: Fusion for Business IntelligenceWebinar: Fusion for Business Intelligence
Webinar: Fusion for Business Intelligence
 
Apache Solr 5.0 and beyond
Apache Solr 5.0 and beyondApache Solr 5.0 and beyond
Apache Solr 5.0 and beyond
 
Understanding the Solr security framework - Lucene Solr Revolution 2015
Understanding the Solr security framework - Lucene Solr Revolution 2015Understanding the Solr security framework - Lucene Solr Revolution 2015
Understanding the Solr security framework - Lucene Solr Revolution 2015
 
Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6
 
What's New in Apache Solr 4.10
What's New in Apache Solr 4.10What's New in Apache Solr 4.10
What's New in Apache Solr 4.10
 
What's new in Solr 5.0
What's new in Solr 5.0What's new in Solr 5.0
What's new in Solr 5.0
 
Building a Solr Continuous Delivery Pipeline with Jenkins: Presented by James...
Building a Solr Continuous Delivery Pipeline with Jenkins: Presented by James...Building a Solr Continuous Delivery Pipeline with Jenkins: Presented by James...
Building a Solr Continuous Delivery Pipeline with Jenkins: Presented by James...
 
Webinar: Fusion 2.3 Preview - Enhanced Features with Solr & Spark
Webinar: Fusion 2.3 Preview - Enhanced Features with Solr & SparkWebinar: Fusion 2.3 Preview - Enhanced Features with Solr & Spark
Webinar: Fusion 2.3 Preview - Enhanced Features with Solr & Spark
 
Solr JDBC: Presented by Kevin Risden, Avalon Consulting
Solr JDBC: Presented by Kevin Risden, Avalon ConsultingSolr JDBC: Presented by Kevin Risden, Avalon Consulting
Solr JDBC: Presented by Kevin Risden, Avalon Consulting
 
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
 
Scaling SolrCloud to a large number of Collections
Scaling SolrCloud to a large number of CollectionsScaling SolrCloud to a large number of Collections
Scaling SolrCloud to a large number of Collections
 
Presentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membasePresentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membase
 
Ease of use in Apache Solr
Ease of use in Apache SolrEase of use in Apache Solr
Ease of use in Apache Solr
 

Similaire à Webinar: OpenNLP and Solr for Superior Relevance

Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
Abdullah al Mamun
 
NLP2RDF Wortschatz and Linguistic LOD draft
NLP2RDF Wortschatz and Linguistic LOD draftNLP2RDF Wortschatz and Linguistic LOD draft
NLP2RDF Wortschatz and Linguistic LOD draft
Sebastian Hellmann
 

Similaire à Webinar: OpenNLP and Solr for Superior Relevance (20)

Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
How the Lucene More Like This Works
How the Lucene More Like This WorksHow the Lucene More Like This Works
How the Lucene More Like This Works
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
 
Bay Area NLP Reading Group - 7.12.16
Bay Area NLP Reading Group - 7.12.16 Bay Area NLP Reading Group - 7.12.16
Bay Area NLP Reading Group - 7.12.16
 
Natural Language Processing using Java
Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
 
Recent trends in natural language processing
Recent trends in natural language processingRecent trends in natural language processing
Recent trends in natural language processing
 
A Corpus-based Approach to Tracking L2 Development
A Corpus-based Approach to Tracking L2 DevelopmentA Corpus-based Approach to Tracking L2 Development
A Corpus-based Approach to Tracking L2 Development
 
CO620
CO620CO620
CO620
 
The Semantic Web meets the Code of Federal Regulations
The Semantic Web meets the Code of Federal RegulationsThe Semantic Web meets the Code of Federal Regulations
The Semantic Web meets the Code of Federal Regulations
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
 
introtonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdfintrotonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdf
 
தமிழ்க்கணிமை கட்டமைப்பு
தமிழ்க்கணிமை கட்டமைப்புதமிழ்க்கணிமை கட்டமைப்பு
தமிழ்க்கணிமை கட்டமைப்பு
 
Building NLP solutions for Davidson ML Group
Building NLP solutions for Davidson ML GroupBuilding NLP solutions for Davidson ML Group
Building NLP solutions for Davidson ML Group
 
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
 
How to expand your nlp solution to new languages using transfer learning
How to expand your nlp solution to new languages using transfer learningHow to expand your nlp solution to new languages using transfer learning
How to expand your nlp solution to new languages using transfer learning
 
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...
 
An Introduction to NLP4L
An Introduction to NLP4LAn Introduction to NLP4L
An Introduction to NLP4L
 
NLP2RDF Wortschatz and Linguistic LOD draft
NLP2RDF Wortschatz and Linguistic LOD draftNLP2RDF Wortschatz and Linguistic LOD draft
NLP2RDF Wortschatz and Linguistic LOD draft
 

Plus de Lucidworks

Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Lucidworks
 

Plus de Lucidworks (20)

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce Strategy
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in Salesforce
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant Products
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized Experiences
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and Rosette
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - Europe
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 Research
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise Search
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and Beyond
 

Dernier

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Dernier (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Webinar: OpenNLP and Solr for Superior Relevance

  • 1.
  • 3. OpenNLP and Solr for Superior Relevance Steve Rowe @steven_a_rowe Sr. Software Engineer, Lucidworks
  • 4. • Previously worked at the Center for Natural Language Processing (CNLP) at Syracuse University • Sr. Software Engineer at Lucidworks • Committer on Apache Lucene/Solr project • Committer on JFlex scanner generator project About me
  • 5. • OpenNLP capabilities • OpenNLP / Solr integration • OpenNLP models: training and licensing • Part-of-speech: what is it good for (absolutely/RB something/NN) • Lemmatization versus stemming • Solr configuration and demonstration of lemmatization and named entity extraction • Future Work Agenda
  • 6. • Machine learning: maximum entropy and perceptron based • Sentence segmentation • Tokenization • Part-of-speech (POS) tagging • Lemmatization • Named entity recognition (NER) • Phrase chunking • Parsing • Co-reference resolution • Document classification Apache OpenNLP capabilities
  • 7. • OpenNLP isn’t integrated with Solr in any release (yet) • TDD (talk driven development) • LUCENE-2899: WIP patch, builds, works (demo later) • Currently implemented: • Sentence segmentation and tokenization • Part-of-speech (POS) tagging • Phrase chunking • Dictionary-based lemmatization • Named entity recognition (NER) OpenNLP Solr integration
  • 8. • Most OpenNLP phases can be trained, but each phase depends on the previous ones. • Publicly available models are based on data with non-free licenses. • You can train your own models, and you very likely want to for production use. • Example workflow: • Use an existing model to tag your training data • Modify the tagged data according to your needs • One way to do that: the brat rapid annotation tool (OpenNLP understands its output format) • Run OpenNLP command-line training tools to create a model • Run OpenNLP command-line evaluation tools to test model performance • Repeat until you get the quality you want OpenNLP models: training & licensing
  • 9. • Created: 30/Jan/11 10:44 <- over 5 years old • Lance Norskog wrote the bulk of the implementation • I modernized Lance’s patch and added support for dictionary-based lemmatization LUCENE-2899
  • 10. • Both can be used with search to increase recall • Lemmas are real words: infinitive verbs, singular nouns • e.g. Speaking/VBG, spoke/VB -> speak; stigmata/NNS -> stigma • Can be produced by algorithm and/or known-item dictionary • OpenNLP 1.6.1 will include a machine-learned lemmatization implementation • Caveat: poor quality part-of-speech over short query text • Stems are not (necessarily) real words • e.g. Speaking -> speak, spoke -> spoke, stigmata -> stigmata
 (Porter stemmer) • produced via algorithm Lemmatization vs. Stemming
  • 11. Penn Treebank part of speech tags PRP$ Possessive pronoun RB Adverb RBR Adverb, comparative RBS Adverb, superlative RP Particle SYM Symbol TO to UH Interjection VB Verb, base form VBD Verb, past tense VBG Verb, gerund or present participle VBN Verb, past participle VBP Verb, non-3rd person singular present VBZ Verb, 3rd person singular present WDT Wh-determiner WP Wh-pronoun WP$ Possessive wh-pronoun WRB Wh-adverb CC Coordinating conjunction CD Cardinal number DT Determiner EX Existential there FW Foreign word IN Preposition/subordinating conjunction JJ Adjective JJR Adjective, comparative JJS Adjective, superlative LS List item marker MD Modal NN Noun, singular or mass NNS Noun, plural NNP Proper noun, singular NNPS Proper noun, plural PDT Predeterminer POS Possessive ending PRP Personal pronoun https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
  • 12. Solr configuration • Put jars on classpath • Add required resources to configset: • models • lemmatization dictionary • Add field type(s) using OpenNLP-based analysis components, then fields using these field types
  • 13. Put jars on the classpath • Add to configset’s solrconfig.xml: <lib dir="${solr.install.dir:../../../..}
 /contrib/analysis-extras/lucene-libs" regex=".*.jar"/> <lib dir="${solr.install.dir:../../../..}
 /contrib/analysis-extras/lib" regex="opennlp-.*.jar"/>
  • 14. Add required resources to configset • Download models from 
 http://opennlp.sourceforge.net/models-1.5/ • Download lemma dictionary from 
 http://ixa2.si.ehu.es/ragerri/lemmatizer-dicts.tgz conf/
 opennlp/
 en-ner-person.bin
 en-pos-maxent.bin
 en-sent.bin
 en-token.bin
 language-tool-en-lemmatizer.txt
  • 15. Add field type and fields curl -X POST http://localhost:8983/solr/opennlp/schema 
 -H 'Content-type: application/json' --data-binary '{
 "add-field-type":{
 "name":"text_lemma",
 "class":"solr.TextField",
 "positionIncrementGap":"100",
 "analyzer":{
 "tokenizer":{
 "class":"solr.OpenNLPTokenizerFactory",
 "sentenceModel":"opennlp/en-sent.bin",
 "tokenizerModel":"opennlp/en-token.bin"
 },
 "filters":[{
 "class":"solr.OpenNLPFilterFactory",
 "posTaggerModel":"opennlp/en-pos-maxent.bin"
 },{
 "class":"solr.OpenNLPLemmatizerFilterFactory",
 "dictionary":"opennlp/language-tool-en-lemmatizer.txt"
 }]}},
 "add-field":{
 "name":"content_lemma",
 "type":"text_lemma",
 “stored":true }
 }'
  • 16. • (Switch to http://localhost:8983/solr here) Demo
  • 17. • Make Solr update request processors for named entity recognition, maybe phrase chunker. • Optimize memory usage to only process one sentence at a time. • Commit/release LUCENE-2899! Future Work
  • 18. Resources Solr: http://lucene.apache.org/solr OpenNLP: http://opennlp.apache.org LUCENE-2899: https://issues.apache.org/jira/browse/LUCENE-2899 OpenNLP pre-trained models: http://opennlp.sourceforge.net/models-1.5/ brat rapid annotation tool: http://brat.nlplab.org/index.html LanguageTool lemma dictionaries: http://ixa2.si.ehu.es/ragerri/lemmatizer-dicts.tgz Company: http://www.lucidworks.com Our blog: http://www.lucidworks.com/blog Twitter: @steven_a_rowe