Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Building an enterprise Natural Language Search
Engine with ElasticSearch and Facebook’s DrQA
Louis Baligand, Debmalya Bisw...
PMI INFORMATION SERVICES 2016
About
2
https://github.com/philipmorrisintl
Debmalya Biswas
Louis Baligand
“Forrester defines cognitive search and knowledge discovery solutions as
A new generation of enterprise search solutions t...
““ The average interaction worker spends
[...] nearly 20 percent (of the workweek)
looking for internal information.”
-MGI...
PMI INFORMATION SERVICES 2016
Enterprise Search vs. Web Search
6
Enterprise Search Web Searchvs.
Multiple content types
Li...
PMI INFORMATION SERVICES 2016
Natural Language Search (NLS)
7
Knowledge Graph
PMI INFORMATION SERVICES 2016
Chatbots and Natural Language Search
Natural Language Search
(Neural Networks)
 Works on do...
PMI INFORMATION SERVICES 2016
Chatbots and Natural Language Search (2)
3- tier strategy:
 A Chatbot with its pre-
defined...
PMI INFORMATION SERVICES 2016
• End-user searching for
products (not answer)
• Filter-Oriented
• Rates, Review
10
Position...
PMI INFORMATION SERVICES 2016
Philip Morris’ Use
case: Operator
Trainings
• Hundreds to thousands of operators
• Long manu...
PMI INFORMATION SERVICES 2016
Example of fine-grained results
12
Q. How many knives are there on the drums?
PMI INFORMATION SERVICES 2016
Question Answering?
• Squad Dataset: a reference in
Question Answering
• 100,000+ Q&A on Wik...
PMI INFORMATION SERVICES 2016
DrQA Overview
• Facebook AI Research, ACL
2017, Reading Wikipedia to
answer Open-Domain
Ques...
PMI INFORMATION SERVICES 2016
DrQA Overview
16
Bigram
TFIDF
Bi-direct.
RNN
PMI INFORMATION SERVICES 2016
DrQA is easy to use on your own corpus!
17
$ pythonbuild_db.py /path/to/data /path/to/saved/...
PMI INFORMATION SERVICES 2016
DrQA to answer Operator’s questions?
18
• Java toolkit to extract text + metadata from DOCX,...
PMI INFORMATION SERVICES 2016
DrQA to answer Operator’s questions?
19
https://github.com/facebookresearch/DrQA
P@5: 76%
• ...
PMI INFORMATION SERVICES 2016
Introducing Elasticsearch
• Open source distributed
• Highly scalable
• RESTful API on top o...
PMI INFORMATION SERVICES 2016
Integrating Elasticsearch to DrQA’s pipeline
22
Index
PMI INFORMATION SERVICES 2016
Integrating Elasticsearch to DrQA’s pipeline
23
>>>fromdrqa.pipeline importDrQA
>>>fromdrqa....
PMI INFORMATION SERVICES 2016
The pipeline performance
24
P@5: 76% 84%
P@5 ref.: 78%
(DrQA)
F1 score: 42%
F1 score ref.: 7...
PMI INFORMATION SERVICES 2016
Take aways
 Address pain points by combining
Document Retrieval with Question
Answering
 I...
PMI INFORMATION SERVICES 2016
Future work – Extend pipeline with BERT*
 A general-purpose architecture to train models fo...
Thank you.
Prochain SlideShare
Chargement dans…5
×

Building an enterprise Natural Language Search Engine with ElasticSearch and Facebook’s DrQA

185 vues

Publié le

Presented at Berlin Buzzwords 2019
https://berlinbuzzwords.de/19/session/building-enterprise-natural-language-search-engine-elasticsearch-and-facebooks-drqa

Publié dans : Logiciels
  • Soyez le premier à commenter

Building an enterprise Natural Language Search Engine with ElasticSearch and Facebook’s DrQA

  1. 1. Building an enterprise Natural Language Search Engine with ElasticSearch and Facebook’s DrQA Louis Baligand, Debmalya Biswas Berlin Buzzwords, 17 June 2019 Enterprise Architecture
  2. 2. PMI INFORMATION SERVICES 2016 About 2 https://github.com/philipmorrisintl Debmalya Biswas Louis Baligand
  3. 3. “Forrester defines cognitive search and knowledge discovery solutions as A new generation of enterprise search solutions that employ AI technologies such as natural language processing and machine learning to ingest, understand, organize, and query digital content from multiple data sources. 3
  4. 4. ““ The average interaction worker spends [...] nearly 20 percent (of the workweek) looking for internal information.” -MGI Report, 2012. 4 Half (54%) of global information workers said, "My work gets interrupted because I can't find or get access to information I need to complete my tasks" a few times a month or more often. -Forrester Data Global Business Technographics Devices And Security Workforce Survey, 2016.
  5. 5. PMI INFORMATION SERVICES 2016 Enterprise Search vs. Web Search 6 Enterprise Search Web Searchvs. Multiple content types Limited tagging/metadata management Role-based content trimming Small amount of content Single source (web pages) Large investments in SEO (*) (*): Search Engine Optimization No visibility restrictions (public pages) Enormous amount of content No team in charge of Search Experience Search xxperience as core business Employees are the end-users WWW users
  6. 6. PMI INFORMATION SERVICES 2016 Natural Language Search (NLS) 7 Knowledge Graph
  7. 7. PMI INFORMATION SERVICES 2016 Chatbots and Natural Language Search Natural Language Search (Neural Networks)  Works on documents.  Users can ask any question from the documents.  Both the documents and questions are passed through the same Neural Network, producing the matching answer. Intent based Chatbots (Statistical Methods)  Requires Q&A knowledge.  Able to scale with respect to question variants by applying Statististical Clustering Methods, e.g. tf- idf, Bag-of-Words - to cluster question variants into ‘intents’. (Rules based) FAQs .  Works only for specific hardcoded questions.  The only way to scale with respect to question variants, is to extend the knowledgebase by manually adding variants of a question. “How do I replace the heating component of my iQoS?” = “Tell me how to change the heating component of my iQoS” <Q> how replace heating component iQoS <Q> how change heating component iQoS Same Intent #repairIQOS Document data base Neural Network <Q> Neural Network (Offline) (Real-time) <A>
  8. 8. PMI INFORMATION SERVICES 2016 Chatbots and Natural Language Search (2) 3- tier strategy:  A Chatbot with its pre- defined Q&A set remains the entry point – think of it as the 1st line of defense.  If the bot encounters a user query which cannot be mapped to one of its pre- configured intents, it performs a NLS over its KB. This is the 2nd line of defense.  If the user is not satisfied even with search results, plan for a final handover to a live agent. Ref: “Chatbots & Natural Language Search: 2 sides of the same coin?” (link)
  9. 9. PMI INFORMATION SERVICES 2016 • End-user searching for products (not answer) • Filter-Oriented • Rates, Review 10 Positioning vs e-commerce search
  10. 10. PMI INFORMATION SERVICES 2016 Philip Morris’ Use case: Operator Trainings • Hundreds to thousands of operators • Long manuals with specific terminology • A 1min downtime of a machine would lead to 20,000 cigarettes unmade • Typical Full text Search (Boolean search, no relevancy score) • Document Management System Manually classified • On-boarding difficulty 11
  11. 11. PMI INFORMATION SERVICES 2016 Example of fine-grained results 12 Q. How many knives are there on the drums?
  12. 12. PMI INFORMATION SERVICES 2016 Question Answering? • Squad Dataset: a reference in Question Answering • 100,000+ Q&A on Wikipedia articles • State of the art is beating Human Performance 14
  13. 13. PMI INFORMATION SERVICES 2016 DrQA Overview • Facebook AI Research, ACL 2017, Reading Wikipedia to answer Open-Domain Questions. • Open source, BSD License https://github.com/facebookr esearch/DrQA • Pre-trained model available 15 https://github.com/facebookresearch/DrQA
  14. 14. PMI INFORMATION SERVICES 2016 DrQA Overview 16 Bigram TFIDF Bi-direct. RNN
  15. 15. PMI INFORMATION SERVICES 2016 DrQA is easy to use on your own corpus! 17 $ pythonbuild_db.py /path/to/data /path/to/saved/db.db $ pythonbuild_tfidf.py /path/to/doc/db /path/to/output/dir 0.06 0.02 0.03 0.08 Terms Docs $ pythoninteractive.py –reader-modelmultitask.mdl –retriever-modelpath/to/tfidf –doc-db path/to/saved/db.db >>>process('Whatis theanswertolife,the universe,andeverything?’) Top Predictions: +------+--------+---------------------------------------------------+--------------+-----------+ | Rank| Answer| Doc | AnswerScore|DocScore| +------+--------+---------------------------------------------------+--------------+-----------+ | 1 | 42 | PhrasesfromThe Hitchhiker'sGuide tothe Galaxy | 47242 | 141.26 | +------+--------+---------------------------------------------------+--------------+-----------+ Pre-trained model open sourced
  16. 16. PMI INFORMATION SERVICES 2016 DrQA to answer Operator’s questions? 18 • Java toolkit to extract text + metadata from DOCX, PPT, XLS, PDF, JPEG, etc… • Apache Software Foundation • OCR
  17. 17. PMI INFORMATION SERVICES 2016 DrQA to answer Operator’s questions? 19 https://github.com/facebookresearch/DrQA P@5: 76% • Not a voice assistant • End user needs at least ~95% • Full control on the retriever • First stage to prioritize
  18. 18. PMI INFORMATION SERVICES 2016 Introducing Elasticsearch • Open source distributed • Highly scalable • RESTful API on top of Lucene capabilities • Support for Full Text search (best of bread) • Easy to configure + extend • Seamlessly manage conflicts • Active community & popular 21
  19. 19. PMI INFORMATION SERVICES 2016 Integrating Elasticsearch to DrQA’s pipeline 22 Index
  20. 20. PMI INFORMATION SERVICES 2016 Integrating Elasticsearch to DrQA’s pipeline 23 >>>fromdrqa.pipeline importDrQA >>>fromdrqa.retrieverimportElasticDocRanker >>>model= DrQA(reader_model=‘reader_model.mdl’, ranker_config={'class':ElasticDocRanker, 'options':{'elastic_url':'127.0.0.1:9200’, 'elastic_index':'mini’, 'elastic_fields':'content’, 'elastic_field_doc_name':['file','filename’], 'elastic_field_content': 'content’}}) >>>model.process(’Howthe tensioningoftheV-belts shouldbe done?’) Directly point to your server hosting Elastic Enable to search in any fields, e.g. uni-grams, bi- grams, title, metadata, etc…
  21. 21. PMI INFORMATION SERVICES 2016 The pipeline performance 24 P@5: 76% 84% P@5 ref.: 78% (DrQA) F1 score: 42% F1 score ref.: 79% (DrQA) • DrQA span +- 10 tokens: 94% of 1st result contains true answer
  22. 22. PMI INFORMATION SERVICES 2016 Take aways  Address pain points by combining Document Retrieval with Question Answering  If not answered, it will provide much more granular insights of the data  User elicitation & user experience: a top down approach  End user does not know what to ask 25
  23. 23. PMI INFORMATION SERVICES 2016 Future work – Extend pipeline with BERT*  A general-purpose architecture to train models for multiple NLP tasks (sentiment analysis, etc…)  State of the art for SQuAD  Open source, published in Oct. 2018 by Google AI Research  High memory required: GPU with at least 12GB of RAM (Base model)  Enable to multi-language queries 26 *https://arxiv.org/abs/1810.04805, https://github.com/google-research/bert • Add one layer to compute Pstart(“token”) & Pend(“token”) for each tokens • Find the best pair by maximizing Pstart(“token1”) * Pend(“token2”)
  24. 24. Thank you.

×