SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
Text categorization with
   Lucene and Solr
     Tommaso Teofili
     tommaso [at] apache [dot] org




                                     1
About me
l      ASF member having fun with:
  l    Lucene / Solr
  l    Hama
  l    UIMA
  l    Stanbol
  l    … some others
l      SW engineer @ Adobe R&D




                                      2
Agenda
l    Classification
l    Lucene classification module
l    Solr text categorization services
l    Conclusions




                                          3
Classification
l    Let the algorithm assign one or more labels
      (classes) to some item given some
      previous knowledge
      l    Spam filter
      l    Tagging system
      l    Digit recognition system
      l    Text categorization
      l    etc.




                                           4
Classification?
why with Lucene?




                   5
The short story
l      Lucene already has a lot of features for
        common information retrieval needs
  l    Postings
  l    Term vectors
  l    Statistics
  l    Positions
  l    TF / IDF
  l    maybe Payloads
  l    etc.
l      We may avoid bringing in new components
        to do classification just leveraging what we
        get for free from Lucene
                                              6
The (slightly) longer story #1
l      While playing with NLP stuff
l      Need to implement a naïve bayes classifier
  l    Not possible to plug in stuff requiring touching
        the architecture
  l    Not really interested in (near) real time
        performance
l      Iteration 1
  l    Plain in memory Java stuff
l      Iteration 2
  l    Same stuff but using Lucene instead of loading
        things into memory
         l    Too much faster J
                                                    7
The (slightly) longer story #2
l      So I realized
  l    Lucene has so many features stored you can
        take advantage of for free
  l    Therefore writing the classification algorithm is
        relatively simple
  l    In many cases you’re just not adding anything to
        the architecture
         l    Your Lucene index was already there for searching
  l    Lucene index is, to some extent, already a model
        which we just need to “query” with the proper
        algorithm
  l    And it is fast enough
                                                           8
Lucene classification module
l      Work in progress on trunk
l      LUCENE-4345
  l    Establishing classification API
  l    With currently two implementations
         l    Naïve bayes
         l    K nearest neighbor




                                             9
Lucene classification module
l      Classifier API
  l    Training
  l    void train(atomicReader, contentField,
        classField, analyzer) throws IOException

         l    atomicReader : the reader on the Lucene index to
               use for classification
               l    still unsure if IR’d be better
         l    textFieldName : the name of the field which contains
               documents’ texts
         l    classFieldName : the name of the field which contains
               the class assigned to existing documents
         l    analyzer : the item used for analyzing the unseen
               texts
                                                            10
Lucene classification module
l      Classifier API
  l    Classifying
  l    ClassificationResult assignClass(String text)
        throws IOException
         l    text: the unseen text of the document to classify
         l    ClassificationResult : the object containing the
               assigned class along with the related score




                                                              11
K Nearest neighbor classifier
l    Fairly simple classification algorithm
l    Given some new unseen item
l    I search in my knowledge base the k items
      which are nearer to the new one
l    I get the k classes assigned to the k
      nearest items
l    I assign to the new item the class that is
      most frequent in the k returned items



                                           12
K Nearest neighbor classifier
l      How can we do this in Lucene?
  l    We have VSM for representing documents as
        vectors and eventually find distances
  l    Lucene MoreLikeThis module can do a lot for it
  l    Given a new document
         l    It’s represented as a MoreLikeThisQuery which filters
               out too frequent words and helps on keeping only the
               relevant tokens for finding the neighbors
         l    The query is executed returning only the first k results
         l    The result is then browsed in order to find the most
               frequent class and that is then assigned with a score
               of classFreq / k


                                                               13
Naïve Bayes classifier
l      Slightly more complicated
l      Based on probabilities
l      C = argmax( P(d|c) * P(c) )
  l    P(d|c) : likelihood
  l    P(c) : prior
  l    With some assumptions:
         l    bag of words assumption: positions don't matter
         l    conditional independence: the feature probabilities
               are independent given a class




                                                             14
Naïve Bayes classifier
l      Prior calculation is easy
  l    It’s the relative frequency of each class
        #docsWithClassC / #docs
l      Likelihood is easy too because of the bag
        of words assumption
  l    P(d|c) := P(x1,..,xn|c) == P(x1|c)*...P(xn|c)
  l    So we just need probabilities of single terms
         l    P(x|c) := (tf of x in documents with class c + 1)/
               (#terms in docs with class c + #docs)




                                                                15
Naïve Bayes classifier
l      Does the bag of words assumption affect
        the classifier’s precision?
  l    Yes in theory
         l    in text documents (nearby) words are strictly
               correlated
  l    Not always in practice
         l    depending on your index data it may or not have an
               impact




                                                               16
Using different indexes
l      The Classifier API makes usage of an
        AtomicReader to get the data for the
        training
l      It must not be the very same index used for
        every day index / search
  l    For performance reasons
  l    For enhancing classifier effectiveness
         l    Using more specific analyzers
         l    Indexing data in a different way
                l    e.g. one big document for each class and use kNN (with a
                      small k) or TF-IDF similarity



                                                                       17
Things to consider - bootstrapping
l      How are your first documents classified?
  l    Manually
         l    Categories are already there in the documents
         l    Someone is explicitly charged to do that (e.g. article
               authors) at some point in time
  l    (semi) automatically
         l    Using some existing service / library
                l    With or without human supervision

  l    In either case the classifier needs something to
        be fed with to be effective



                                                               18
Things to consider – tokenizing
l      How are your content field tokenized?
  l    Whitespace
         l    It doesn’t work for each language
  l    Standard
  l    Sentence
  l    What about using N-Grams?
  l    What about using Shingles?




                                                   19
Things to consider - filtering
l      Some words may / should be filtered while
  l    Training
  l    Classifying
l      Often
  l    Stopwords
  l    Punctuation
  l    Not relevant PoS tagged tokens



                                            20
Raw benchmarking
l      Tried both algorithms on ~1M docs index
  l    Naïve bayes is affected by the # of classes
  l    kNN is affected by k being large
l      None of them took more than 1-2m to train
        even with great number of classes or large
        k values




                                                  21
From Lucene to Solr
l      The Lucene classifiers can be easily used
        in Solr
  l    As specific search services
         l    A classification based more like this
  l    While indexing
         l    For automatic text categorization




                                                       22
Classification based MLT
l      Use case:
  l    “give me all the documents that belong to the
        same category of a new not indexed document”
  l    Slightly different from basic MLT since it does
        not return the nearest docs
  l    That is useful if the user doesn’t want / need to
        index the document and still want to find all the
        other documents of the same category, whatever
        this category means


                                                   23
Classification based MLT
l      ClassificationMLTHandler
  l    String d = req.getParams().get(DOC);
  l    ClassificationResult r = classifier.assignClass(d);
  l    String c = r.getAssignedClass();
  l    req.getSearcher().search(new TermQuery(new
        Term(classFieldName, c)), rows);




                                                    24
Automatic text categorization
l      Once a doc reaches Solr
l      We can use the Lucene classifiers to
        automate assigning document’s category
l      We can leverage existing Solr facilites for
        enhancing the indexing pipeline
  l    An UpdateChain can be decorated with one or
        more UpdateRequestProcessors




                                               25
Automatic text categorization
l      Configuration
  l    <updateRequestProcessorChain name=“ctgr">
  l    <processor
        class=”solr.CategorizationUpdateRequestProces
        sorFactory">
  l    <processor
        class="solr.RunUpdateProcessorFactory" />
  l    </updateRequestProcessorChain>



                                               26
Automatic text categorization
l      CategorizationUpdateRequestProcessor
l      void processAdd(AddUpdateCommand
        cmd) throws IOException
         l    String text = solrInputDocument.getFieldValue(“text”);
         l    String class = classifier.assignClass(text);
         l    solrInputDocument.addField(“cat”, class);
  l    Every now and then need to retrain to get latest
        stuff in the current index, but that can be done in
        the background without affecting performances


                                                             27
Automatic text categorization
l      CategorizationUpdateRequestProcessor
l      Finer grained control
  l    Use automatic text categorization only if a value
        does not exist for the “cat” field
  l    Add the classifier output class to the “cat” field
        only if it’s above a certain score




                                                      28
Wrap up
l      Simple classifiers with no or little effort
l      No architecture change
l      Both available to Lucene and Solr
l      Still reasonably fast
l      A lot more can be done
  l    Implement a MaxEnt Lucene based classifier
         l    which takes into account words correlation



                                                            29
Thanks!




          30

Contenu connexe

Tendances

Matrix Factorization Techniques For Recommender Systems
Matrix Factorization Techniques For Recommender SystemsMatrix Factorization Techniques For Recommender Systems
Matrix Factorization Techniques For Recommender Systems
Lei Guo
 
12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS
koolkampus
 

Tendances (20)

Data partitioning
Data partitioningData partitioning
Data partitioning
 
Matrix Factorization Techniques For Recommender Systems
Matrix Factorization Techniques For Recommender SystemsMatrix Factorization Techniques For Recommender Systems
Matrix Factorization Techniques For Recommender Systems
 
Abstract Data Types
Abstract Data TypesAbstract Data Types
Abstract Data Types
 
Python Programming | JNTUA | UNIT 2 | Fruitful Functions |
Python Programming | JNTUA | UNIT 2 | Fruitful Functions | Python Programming | JNTUA | UNIT 2 | Fruitful Functions |
Python Programming | JNTUA | UNIT 2 | Fruitful Functions |
 
Text classification-php-v4
Text classification-php-v4Text classification-php-v4
Text classification-php-v4
 
Modules and packages in python
Modules and packages in pythonModules and packages in python
Modules and packages in python
 
Apprendre Solr en deux heures
Apprendre Solr en deux heuresApprendre Solr en deux heures
Apprendre Solr en deux heures
 
Packages In Python Tutorial
Packages In Python TutorialPackages In Python Tutorial
Packages In Python Tutorial
 
Artificial Intelligence (AI) | Prepositional logic (PL)and first order predic...
Artificial Intelligence (AI) | Prepositional logic (PL)and first order predic...Artificial Intelligence (AI) | Prepositional logic (PL)and first order predic...
Artificial Intelligence (AI) | Prepositional logic (PL)and first order predic...
 
Functional dependencies in Database Management System
Functional dependencies in Database Management SystemFunctional dependencies in Database Management System
Functional dependencies in Database Management System
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
[Pgday.Seoul 2017] 6. GIN vs GiST 인덱스 이야기 - 박진우
[Pgday.Seoul 2017] 6. GIN vs GiST 인덱스 이야기 - 박진우[Pgday.Seoul 2017] 6. GIN vs GiST 인덱스 이야기 - 박진우
[Pgday.Seoul 2017] 6. GIN vs GiST 인덱스 이야기 - 박진우
 
RoFormer: Enhanced Transformer with Rotary Position Embedding
RoFormer: Enhanced Transformer with Rotary Position EmbeddingRoFormer: Enhanced Transformer with Rotary Position Embedding
RoFormer: Enhanced Transformer with Rotary Position Embedding
 
Binary Search
Binary SearchBinary Search
Binary Search
 
12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS
 
DBMS Unit - 8 - Database Security
DBMS Unit - 8 - Database SecurityDBMS Unit - 8 - Database Security
DBMS Unit - 8 - Database Security
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
 
Indexing and Hashing
Indexing and HashingIndexing and Hashing
Indexing and Hashing
 
Core python programming tutorial
Core python programming tutorialCore python programming tutorial
Core python programming tutorial
 
What are Data structures in Python? | List, Dictionary, Tuple Explained | Edu...
What are Data structures in Python? | List, Dictionary, Tuple Explained | Edu...What are Data structures in Python? | List, Dictionary, Tuple Explained | Edu...
What are Data structures in Python? | List, Dictionary, Tuple Explained | Edu...
 

Similaire à Text categorization with Lucene and Solr

Torch: a scientific computing framework for machine-learning practitioners
Torch: a scientific computing framework for machine-learning practitionersTorch: a scientific computing framework for machine-learning practitioners
Torch: a scientific computing framework for machine-learning practitioners
Hoffman Lab
 

Similaire à Text categorization with Lucene and Solr (20)

Oop basic concepts
Oop basic conceptsOop basic concepts
Oop basic concepts
 
Java Notes
Java NotesJava Notes
Java Notes
 
Object Oriented Programming.pptx
Object Oriented Programming.pptxObject Oriented Programming.pptx
Object Oriented Programming.pptx
 
Programming in Scala - Lecture One
Programming in Scala - Lecture OneProgramming in Scala - Lecture One
Programming in Scala - Lecture One
 
Object oriented programming in python
Object oriented programming in pythonObject oriented programming in python
Object oriented programming in python
 
Proyecto
ProyectoProyecto
Proyecto
 
Summer Training Project On Python Programming
Summer Training Project On Python ProgrammingSummer Training Project On Python Programming
Summer Training Project On Python Programming
 
What To Leave Implicit
What To Leave ImplicitWhat To Leave Implicit
What To Leave Implicit
 
Introduction to Scala
Introduction to ScalaIntroduction to Scala
Introduction to Scala
 
NLP and LSA getting started
NLP and LSA getting startedNLP and LSA getting started
NLP and LSA getting started
 
Seminar dm
Seminar dmSeminar dm
Seminar dm
 
Object Oriented Programming Tutorial.pptx
Object Oriented Programming Tutorial.pptxObject Oriented Programming Tutorial.pptx
Object Oriented Programming Tutorial.pptx
 
Scala Days NYC 2016
Scala Days NYC 2016Scala Days NYC 2016
Scala Days NYC 2016
 
Conventional and Reasonable
Conventional and ReasonableConventional and Reasonable
Conventional and Reasonable
 
PRESENTATION ON PYTHON.pptx
PRESENTATION ON PYTHON.pptxPRESENTATION ON PYTHON.pptx
PRESENTATION ON PYTHON.pptx
 
Torch: a scientific computing framework for machine-learning practitioners
Torch: a scientific computing framework for machine-learning practitionersTorch: a scientific computing framework for machine-learning practitioners
Torch: a scientific computing framework for machine-learning practitioners
 
Fundamental of deep learning
Fundamental of deep learningFundamental of deep learning
Fundamental of deep learning
 
Learning puppet chapter 2
Learning puppet chapter 2Learning puppet chapter 2
Learning puppet chapter 2
 
Lotusphere 2007 BP301 Advanced Object Oriented Programming for LotusScript
Lotusphere 2007 BP301 Advanced Object Oriented Programming for LotusScriptLotusphere 2007 BP301 Advanced Object Oriented Programming for LotusScript
Lotusphere 2007 BP301 Advanced Object Oriented Programming for LotusScript
 
EKON 12 Running OpenLDAP
EKON 12 Running OpenLDAP EKON 12 Running OpenLDAP
EKON 12 Running OpenLDAP
 

Plus de Tommaso Teofili

Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industry
Tommaso Teofili
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr
Tommaso Teofili
 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache Hama
Tommaso Teofili
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
Tommaso Teofili
 

Plus de Tommaso Teofili (19)

Affect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRAffect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IR
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit Oak
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in Sling
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industry
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr
 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache Hama
 
Adapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiAdapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGi
 
Oak / Solr integration
Oak / Solr integrationOak / Solr integration
Oak / Solr integration
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and Clerezza
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Apache UIMA - Hands on code
Apache UIMA - Hands on codeApache UIMA - Hands on code
Apache UIMA - Hands on code
 
Apache UIMA Introduction
Apache UIMA IntroductionApache UIMA Introduction
Apache UIMA Introduction
 
OSS Enterprise Search EU Tour
OSS Enterprise Search EU TourOSS Enterprise Search EU Tour
OSS Enterprise Search EU Tour
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
Information Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesInformation Extraction with UIMA - Usecases
Information Extraction with UIMA - Usecases
 
Apache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationApache UIMA and Metadata Generation
Apache UIMA and Metadata Generation
 
Data and Information Extraction on the Web
Data and Information Extraction on the WebData and Information Extraction on the Web
Data and Information Extraction on the Web
 
Apache UIMA and Semantic Search
Apache UIMA and Semantic SearchApache UIMA and Semantic Search
Apache UIMA and Semantic Search
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Text categorization with Lucene and Solr

  • 1. Text categorization with Lucene and Solr Tommaso Teofili tommaso [at] apache [dot] org 1
  • 2. About me l  ASF member having fun with: l  Lucene / Solr l  Hama l  UIMA l  Stanbol l  … some others l  SW engineer @ Adobe R&D 2
  • 3. Agenda l  Classification l  Lucene classification module l  Solr text categorization services l  Conclusions 3
  • 4. Classification l  Let the algorithm assign one or more labels (classes) to some item given some previous knowledge l  Spam filter l  Tagging system l  Digit recognition system l  Text categorization l  etc. 4
  • 6. The short story l  Lucene already has a lot of features for common information retrieval needs l  Postings l  Term vectors l  Statistics l  Positions l  TF / IDF l  maybe Payloads l  etc. l  We may avoid bringing in new components to do classification just leveraging what we get for free from Lucene 6
  • 7. The (slightly) longer story #1 l  While playing with NLP stuff l  Need to implement a naïve bayes classifier l  Not possible to plug in stuff requiring touching the architecture l  Not really interested in (near) real time performance l  Iteration 1 l  Plain in memory Java stuff l  Iteration 2 l  Same stuff but using Lucene instead of loading things into memory l  Too much faster J 7
  • 8. The (slightly) longer story #2 l  So I realized l  Lucene has so many features stored you can take advantage of for free l  Therefore writing the classification algorithm is relatively simple l  In many cases you’re just not adding anything to the architecture l  Your Lucene index was already there for searching l  Lucene index is, to some extent, already a model which we just need to “query” with the proper algorithm l  And it is fast enough 8
  • 9. Lucene classification module l  Work in progress on trunk l  LUCENE-4345 l  Establishing classification API l  With currently two implementations l  Naïve bayes l  K nearest neighbor 9
  • 10. Lucene classification module l  Classifier API l  Training l  void train(atomicReader, contentField, classField, analyzer) throws IOException l  atomicReader : the reader on the Lucene index to use for classification l  still unsure if IR’d be better l  textFieldName : the name of the field which contains documents’ texts l  classFieldName : the name of the field which contains the class assigned to existing documents l  analyzer : the item used for analyzing the unseen texts 10
  • 11. Lucene classification module l  Classifier API l  Classifying l  ClassificationResult assignClass(String text) throws IOException l  text: the unseen text of the document to classify l  ClassificationResult : the object containing the assigned class along with the related score 11
  • 12. K Nearest neighbor classifier l  Fairly simple classification algorithm l  Given some new unseen item l  I search in my knowledge base the k items which are nearer to the new one l  I get the k classes assigned to the k nearest items l  I assign to the new item the class that is most frequent in the k returned items 12
  • 13. K Nearest neighbor classifier l  How can we do this in Lucene? l  We have VSM for representing documents as vectors and eventually find distances l  Lucene MoreLikeThis module can do a lot for it l  Given a new document l  It’s represented as a MoreLikeThisQuery which filters out too frequent words and helps on keeping only the relevant tokens for finding the neighbors l  The query is executed returning only the first k results l  The result is then browsed in order to find the most frequent class and that is then assigned with a score of classFreq / k 13
  • 14. Naïve Bayes classifier l  Slightly more complicated l  Based on probabilities l  C = argmax( P(d|c) * P(c) ) l  P(d|c) : likelihood l  P(c) : prior l  With some assumptions: l  bag of words assumption: positions don't matter l  conditional independence: the feature probabilities are independent given a class 14
  • 15. Naïve Bayes classifier l  Prior calculation is easy l  It’s the relative frequency of each class #docsWithClassC / #docs l  Likelihood is easy too because of the bag of words assumption l  P(d|c) := P(x1,..,xn|c) == P(x1|c)*...P(xn|c) l  So we just need probabilities of single terms l  P(x|c) := (tf of x in documents with class c + 1)/ (#terms in docs with class c + #docs) 15
  • 16. Naïve Bayes classifier l  Does the bag of words assumption affect the classifier’s precision? l  Yes in theory l  in text documents (nearby) words are strictly correlated l  Not always in practice l  depending on your index data it may or not have an impact 16
  • 17. Using different indexes l  The Classifier API makes usage of an AtomicReader to get the data for the training l  It must not be the very same index used for every day index / search l  For performance reasons l  For enhancing classifier effectiveness l  Using more specific analyzers l  Indexing data in a different way l  e.g. one big document for each class and use kNN (with a small k) or TF-IDF similarity 17
  • 18. Things to consider - bootstrapping l  How are your first documents classified? l  Manually l  Categories are already there in the documents l  Someone is explicitly charged to do that (e.g. article authors) at some point in time l  (semi) automatically l  Using some existing service / library l  With or without human supervision l  In either case the classifier needs something to be fed with to be effective 18
  • 19. Things to consider – tokenizing l  How are your content field tokenized? l  Whitespace l  It doesn’t work for each language l  Standard l  Sentence l  What about using N-Grams? l  What about using Shingles? 19
  • 20. Things to consider - filtering l  Some words may / should be filtered while l  Training l  Classifying l  Often l  Stopwords l  Punctuation l  Not relevant PoS tagged tokens 20
  • 21. Raw benchmarking l  Tried both algorithms on ~1M docs index l  Naïve bayes is affected by the # of classes l  kNN is affected by k being large l  None of them took more than 1-2m to train even with great number of classes or large k values 21
  • 22. From Lucene to Solr l  The Lucene classifiers can be easily used in Solr l  As specific search services l  A classification based more like this l  While indexing l  For automatic text categorization 22
  • 23. Classification based MLT l  Use case: l  “give me all the documents that belong to the same category of a new not indexed document” l  Slightly different from basic MLT since it does not return the nearest docs l  That is useful if the user doesn’t want / need to index the document and still want to find all the other documents of the same category, whatever this category means 23
  • 24. Classification based MLT l  ClassificationMLTHandler l  String d = req.getParams().get(DOC); l  ClassificationResult r = classifier.assignClass(d); l  String c = r.getAssignedClass(); l  req.getSearcher().search(new TermQuery(new Term(classFieldName, c)), rows); 24
  • 25. Automatic text categorization l  Once a doc reaches Solr l  We can use the Lucene classifiers to automate assigning document’s category l  We can leverage existing Solr facilites for enhancing the indexing pipeline l  An UpdateChain can be decorated with one or more UpdateRequestProcessors 25
  • 26. Automatic text categorization l  Configuration l  <updateRequestProcessorChain name=“ctgr"> l  <processor class=”solr.CategorizationUpdateRequestProces sorFactory"> l  <processor class="solr.RunUpdateProcessorFactory" /> l  </updateRequestProcessorChain> 26
  • 27. Automatic text categorization l  CategorizationUpdateRequestProcessor l  void processAdd(AddUpdateCommand cmd) throws IOException l  String text = solrInputDocument.getFieldValue(“text”); l  String class = classifier.assignClass(text); l  solrInputDocument.addField(“cat”, class); l  Every now and then need to retrain to get latest stuff in the current index, but that can be done in the background without affecting performances 27
  • 28. Automatic text categorization l  CategorizationUpdateRequestProcessor l  Finer grained control l  Use automatic text categorization only if a value does not exist for the “cat” field l  Add the classifier output class to the “cat” field only if it’s above a certain score 28
  • 29. Wrap up l  Simple classifiers with no or little effort l  No architecture change l  Both available to Lucene and Solr l  Still reasonably fast l  A lot more can be done l  Implement a MaxEnt Lucene based classifier l  which takes into account words correlation 29
  • 30. Thanks! 30