SlideShare une entreprise Scribd logo
1  sur  29
Télécharger pour lire hors ligne
Text Analytics in Enterprise Search
         Daniel Ling (Findwise)
What will I cover?
   Intro
   About Text Analytics
   Benefits and possibilities
   Examples
   Solution Techniques to Examples
   Conclusions




                            3
My Background
   Daniel Ling
   Findwise
   Enterprise Search and Findability Consultant
   Experience and expertise
      5+ years of Enterprise Search Experience
      20+ enterprise search implementations, ranging industries
      Lucene, FAST ESP, Solr
      Apache Solr my primary search platform
      Focus areas includes Findability and Search Architecture and
       Implementation, Text Analytics, Document Processing.




                                    4
About Text Analytics




          5
Text Analytics in the Enterprise
Challenges:
 80% of data in the Enterprise is unstructured.
 Reduce the time looking for information (currently 9.6 hours per week)
 Reduce the time reading documents / e-mails (currently 14.5 hours per
  week)

Benefits:
 More predictable scale and domain
 Well-understood domain
 Supporting content for analytics can be identified




                                   6
Text Analytics
The definition


   A set of linguistic, statistical and machine learning techniques
   used to model and structure information content of textual
   source.

      - Wikipedia.org




                                7
Types of Applications


•   Entity Extraction
•   Document Categorization
•   Sentiment Analysis
•   Summarization




                              8
Frameworks and Techniques


Framework                          Techniques

Solr                               Statistics, Lingustics

Mallet, Classifier4j, etc, etc..   Statistical natural language processing

Mahout (Hadoop)                    Machine Learning, Statistics

GATE                               General language processing framework

UIMA                               Content analytics, text mining, pipeline

OpenNLP                            Machine learning toolkit for NLP


                                              9
Benefits and possibilities




            10
Benefits and possibilities

 Text analytics can bring some structure to the unstructured content
 Enhance discovery and findability of content
   • Works well together with search
 Increase relevance and precision with extracted keywords and meta-
  data
 Generating content for dynamic pages / topic pages
   • Selection of documents and extracts from documents
 Track and discover sentiments
 Reduce the time for user to analyze content




                                 11
Examples




   12
Entity Extraction

 Types of Entities for Extraction:
   • Dates
   • Places
   • Companies
   • Objects (Product names, etc)
   • People
   • Events




                                  13
Example – Presenting the data




               14
Example – Presenting the data




              15
Example – Facets on the data




               16
Example Solution: Entity Extraction
 Rule-based entity extraction
    Combination of lists and regular expressions
 Works within well-understood domains.
 Requires maintaining lists.
 Lists from: Country lists from World Factbook, Public Companies from
  Google Finance, Customers from CRM.
 Workflow: Document for indexing > Update Request Handler >
  Update Chain (lookup and match entities) > Writes to index



             Update Chain
                     (processor)                                   Lucene Index
        (lists | input fields | entity fields)
                                                 (entity fields)




                                                          17
Example Solution: Entity Extraction
 Register a custom class to lookup resources and extract found entities
  to specific Solr fields, setup in solrconfig.xml:




                                     18
Document Categorization

   To assign a label to the document / content / data.
   Labels for the category or for the sentiment.
   Threshold values for matching a category before labeling.
   Statistics and “knowledge” from previous examples can be used.




                                  19
Example – Facets from Categories




                 20
Example Solution: Document
                Categorization


                                               *

 Training the component, Mallet (Machine Learning for Language
  Toolkit).
   • Alternative components includes Lucene (TFIDF) index
      (MoreLikeThis), OpenNLP, Textcat, Classifier4j.
 Running the new documents against the model/index of trained
  documents.
 Training from interface, adhoc, or index pre-categorized.

* Figure from the book Taming Text.


                                      21
Example Solution: Document
             Categorization
 Mallet and the process of setup and train:




                                   22
Example Solution: Document
              Categorization
 Evaluation of new document:




 Setting the evaluated category tag to the document in pipeline:


            Update Chain
                 (processor)                        Lucene Index
              (input document)
                                 (category field)




                                            23
Document Summarization

 Summarize a document, at index time or on-demand.
 Leverage from the knowledge and term statistics of the document
  and the index.
 Picks the “most important” sentences based on the statistics and
  displays those.




                                 24
Example – Summarize content


Static Summaries




Dynamic Summaries




                    25
Example – Summarize content - 1




                   26
Example – Summarize content - 2




                  27
Example Solution: Document
           Summarization
 Custom RequestHandler that receives document ID and field to
  summarize.
 Custom Search Component making the selection of top sentences.
 Selecting a subset of sentences and sends these back in a field.




               RequestHandler                         Lucene Index
          (SearchComponent for summariziation)




                                                 28
Wrap Up

• Examples: Entity Extraction, Document Categorization,
  Summarization.
• Technology: You can take small steps and get a great
  deal of gain, since you can leverage from features and
  components of Solr and Lucene (as well as other open
  source NLP frameworks).
• Value: Benefits from text analytics includes the increase
  in discovery, findability and productivity from the
  solution.




                                29
Questions ?



daniel.ling@findwise.com
www.findabilityblog.com




            30

Contenu connexe

Tendances

Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
KU Leuven
 
Tdm recent trends
Tdm recent trendsTdm recent trends
Tdm recent trends
KU Leuven
 
Techniques of information retrieval
Techniques of information retrieval Techniques of information retrieval
Techniques of information retrieval
Tariq Hassan
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
Trey Grainger
 

Tendances (19)

Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
 
Tdm recent trends
Tdm recent trendsTdm recent trends
Tdm recent trends
 
Techniques of information retrieval
Techniques of information retrieval Techniques of information retrieval
Techniques of information retrieval
 
Text Indexing and Retrieval
Text Indexing and RetrievalText Indexing and Retrieval
Text Indexing and Retrieval
 
Multidimensioal database
Multidimensioal  databaseMultidimensioal  database
Multidimensioal database
 
Text mining presentation in Data mining Area
Text mining presentation in Data mining AreaText mining presentation in Data mining Area
Text mining presentation in Data mining Area
 
ATLAS.ti training presentation: Covering the basics
ATLAS.ti training presentation: Covering the basics ATLAS.ti training presentation: Covering the basics
ATLAS.ti training presentation: Covering the basics
 
ATLAS.ti Training - Covering the Basics (Mac edition)
ATLAS.ti Training - Covering the Basics (Mac edition)ATLAS.ti Training - Covering the Basics (Mac edition)
ATLAS.ti Training - Covering the Basics (Mac edition)
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
 
Text mining
Text miningText mining
Text mining
 
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsIntent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systems
 
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformExtending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
 
The Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesThe Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation Engines
 
Information retrieval concept, practice and challenge
Information retrieval   concept, practice and challengeInformation retrieval   concept, practice and challenge
Information retrieval concept, practice and challenge
 
Candidate selection tutorial
Candidate selection tutorialCandidate selection tutorial
Candidate selection tutorial
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...
 

Similaire à Text Analytics in Enterprise Search - Daniel Ling

Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search Engine
Salford Systems
 

Similaire à Text Analytics in Enterprise Search - Daniel Ling (20)

Jeroen Kleinhoven (Treparel), Turn Big Content into Business Insights - Data ...
Jeroen Kleinhoven (Treparel), Turn Big Content into Business Insights - Data ...Jeroen Kleinhoven (Treparel), Turn Big Content into Business Insights - Data ...
Jeroen Kleinhoven (Treparel), Turn Big Content into Business Insights - Data ...
 
intro.ppt
intro.pptintro.ppt
intro.ppt
 
Scoping Level of Effort and Getting the Right Resources for the Job
Scoping Level of Effort and Getting the Right Resources for the JobScoping Level of Effort and Getting the Right Resources for the Job
Scoping Level of Effort and Getting the Right Resources for the Job
 
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search Engine
 
qualitative.ppt
qualitative.pptqualitative.ppt
qualitative.ppt
 
Using Computer as a Research Assistant in Qualitative Research
Using Computer as a Research Assistant in Qualitative ResearchUsing Computer as a Research Assistant in Qualitative Research
Using Computer as a Research Assistant in Qualitative Research
 
CNI 2018: A Research Object Authoring Tool for the Data Commons
CNI 2018: A Research Object Authoring Tool for the Data CommonsCNI 2018: A Research Object Authoring Tool for the Data Commons
CNI 2018: A Research Object Authoring Tool for the Data Commons
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docx
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_Habib
 
Applying ocr to extract information : Text mining
Applying ocr to extract information  : Text miningApplying ocr to extract information  : Text mining
Applying ocr to extract information : Text mining
 
Search Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By DesignSearch Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By Design
 
IR with lucene
IR with luceneIR with lucene
IR with lucene
 
Text mining and analytics v6 - p1
Text mining and analytics   v6 - p1Text mining and analytics   v6 - p1
Text mining and analytics v6 - p1
 
Welsh Government Workshop
Welsh Government WorkshopWelsh Government Workshop
Welsh Government Workshop
 
Abacá: Technically Assisted Sensitivity Review of Digital Records
Abacá: Technically Assisted Sensitivity Review of Digital RecordsAbacá: Technically Assisted Sensitivity Review of Digital Records
Abacá: Technically Assisted Sensitivity Review of Digital Records
 
Dissertation literature search
Dissertation literature searchDissertation literature search
Dissertation literature search
 
Lecture2 big data life cycle
Lecture2 big data life cycleLecture2 big data life cycle
Lecture2 big data life cycle
 
Final presentation
Final presentationFinal presentation
Final presentation
 
Frameworks provide structure. The core objective of the Big Data Framework is...
Frameworks provide structure. The core objective of the Big Data Framework is...Frameworks provide structure. The core objective of the Big Data Framework is...
Frameworks provide structure. The core objective of the Big Data Framework is...
 
Prototype Design of Open Access Institutional Repository
Prototype Design of Open Access Institutional RepositoryPrototype Design of Open Access Institutional Repository
Prototype Design of Open Access Institutional Repository
 

Plus de lucenerevolution

Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
lucenerevolution
 

Plus de lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Text Analytics in Enterprise Search - Daniel Ling

  • 1. Text Analytics in Enterprise Search Daniel Ling (Findwise)
  • 2. What will I cover?  Intro  About Text Analytics  Benefits and possibilities  Examples  Solution Techniques to Examples  Conclusions 3
  • 3. My Background  Daniel Ling  Findwise  Enterprise Search and Findability Consultant  Experience and expertise  5+ years of Enterprise Search Experience  20+ enterprise search implementations, ranging industries  Lucene, FAST ESP, Solr  Apache Solr my primary search platform  Focus areas includes Findability and Search Architecture and Implementation, Text Analytics, Document Processing. 4
  • 5. Text Analytics in the Enterprise Challenges:  80% of data in the Enterprise is unstructured.  Reduce the time looking for information (currently 9.6 hours per week)  Reduce the time reading documents / e-mails (currently 14.5 hours per week) Benefits:  More predictable scale and domain  Well-understood domain  Supporting content for analytics can be identified 6
  • 6. Text Analytics The definition A set of linguistic, statistical and machine learning techniques used to model and structure information content of textual source. - Wikipedia.org 7
  • 7. Types of Applications • Entity Extraction • Document Categorization • Sentiment Analysis • Summarization 8
  • 8. Frameworks and Techniques Framework Techniques Solr Statistics, Lingustics Mallet, Classifier4j, etc, etc.. Statistical natural language processing Mahout (Hadoop) Machine Learning, Statistics GATE General language processing framework UIMA Content analytics, text mining, pipeline OpenNLP Machine learning toolkit for NLP 9
  • 10. Benefits and possibilities  Text analytics can bring some structure to the unstructured content  Enhance discovery and findability of content • Works well together with search  Increase relevance and precision with extracted keywords and meta- data  Generating content for dynamic pages / topic pages • Selection of documents and extracts from documents  Track and discover sentiments  Reduce the time for user to analyze content 11
  • 11. Examples 12
  • 12. Entity Extraction  Types of Entities for Extraction: • Dates • Places • Companies • Objects (Product names, etc) • People • Events 13
  • 13. Example – Presenting the data 14
  • 14. Example – Presenting the data 15
  • 15. Example – Facets on the data 16
  • 16. Example Solution: Entity Extraction  Rule-based entity extraction  Combination of lists and regular expressions  Works within well-understood domains.  Requires maintaining lists.  Lists from: Country lists from World Factbook, Public Companies from Google Finance, Customers from CRM.  Workflow: Document for indexing > Update Request Handler > Update Chain (lookup and match entities) > Writes to index Update Chain (processor) Lucene Index (lists | input fields | entity fields) (entity fields) 17
  • 17. Example Solution: Entity Extraction  Register a custom class to lookup resources and extract found entities to specific Solr fields, setup in solrconfig.xml: 18
  • 18. Document Categorization  To assign a label to the document / content / data.  Labels for the category or for the sentiment.  Threshold values for matching a category before labeling.  Statistics and “knowledge” from previous examples can be used. 19
  • 19. Example – Facets from Categories 20
  • 20. Example Solution: Document Categorization *  Training the component, Mallet (Machine Learning for Language Toolkit). • Alternative components includes Lucene (TFIDF) index (MoreLikeThis), OpenNLP, Textcat, Classifier4j.  Running the new documents against the model/index of trained documents.  Training from interface, adhoc, or index pre-categorized. * Figure from the book Taming Text. 21
  • 21. Example Solution: Document Categorization  Mallet and the process of setup and train: 22
  • 22. Example Solution: Document Categorization  Evaluation of new document:  Setting the evaluated category tag to the document in pipeline: Update Chain (processor) Lucene Index (input document) (category field) 23
  • 23. Document Summarization  Summarize a document, at index time or on-demand.  Leverage from the knowledge and term statistics of the document and the index.  Picks the “most important” sentences based on the statistics and displays those. 24
  • 24. Example – Summarize content Static Summaries Dynamic Summaries 25
  • 25. Example – Summarize content - 1 26
  • 26. Example – Summarize content - 2 27
  • 27. Example Solution: Document Summarization  Custom RequestHandler that receives document ID and field to summarize.  Custom Search Component making the selection of top sentences.  Selecting a subset of sentences and sends these back in a field. RequestHandler Lucene Index (SearchComponent for summariziation) 28
  • 28. Wrap Up • Examples: Entity Extraction, Document Categorization, Summarization. • Technology: You can take small steps and get a great deal of gain, since you can leverage from features and components of Solr and Lucene (as well as other open source NLP frameworks). • Value: Benefits from text analytics includes the increase in discovery, findability and productivity from the solution. 29