SlideShare a Scribd company logo
1 of 14
Probabilistic Methods for Structured Document Classification at INEX´07




                     Probabilistic Methods for Structured
                     Document Classification at INEX´07

                                               ´
               Luis M. de Campos, Juan M. Fernandez-Luna, Juan F.
                            Huete, Alfonso E. Romero

                                                           ´
                   Departamento de Ciencias de la Computacion e Inteligencia Artificial
                                       Universidad de Granada
                                   {lci,jmfluna,jhg,aeromero}@decsai.ugr.es


                INEX 2007, Dagstuhl (Germany) December 18, 2007
Probabilistic Methods for Structured Document Classification at INEX´07




                                                       Abstract

               We present here the result of our participation in the
               Document Mining track at INEX´07. We submitted several
               runs for this track (only classification). This is the first year
               we apply for, with relative good results.
Probabilistic Methods for Structured Document Classification at INEX´07




      Our aim




               Finding good XML to flat-document transformations...
               ...in order to apply probabilistic flat-text and,
               to improve flat classifiers result.
Probabilistic Methods for Structured Document Classification at INEX´07




      Our aim




               Finding good XML to flat-document transformations...
               ...in order to apply probabilistic flat-text and,
               to improve flat classifiers result.
Probabilistic Methods for Structured Document Classification at INEX´07




      Our aim




               Finding good XML to flat-document transformations...
               ...in order to apply probabilistic flat-text and,
               to improve flat classifiers result.
Probabilistic Methods for Structured Document Classification at INEX´07
  The models
    Multinomial Naive Bayes


      The models: Multinomial Naive Bayes

               Extensively used in text classification (see paper by
               MacCallum).
               Posterior probability of a class is computed using the
               bayes rule:

                                                    p(ci )p(dj |ci )
                                 p(ci |dj ) =                        ∝ p(ci )p(dj |ci )   (1)
                                                        p(dj )

               Probabilities p(dj |ci ) are supossed to follow a multinomial
               distribution:
                                                  p(tk |ci )njk
                                   p(dj |ci ) ∝                            (2)
                                                                tk ∈dj

               The estimation of the needed values is carried out in the
                                                              N
               following way: p(tk |ci ) = Nik+M and p(ci ) = Ni,doc .
                                              +1
                                           Ni                   doc
Probabilistic Methods for Structured Document Classification at INEX´07
  The models
    OR Gate Bayesian Network Classifier


      The OR Gate Bayesian Network Classifier


               Tries to model rules of the following kind:
               IF (ti ∈ d) ∨ (tj ∈ d) ∨ ... THEN classify by c.
               The canonical model used is a noisy OR gate, wich is
               related with the notion of causality.
               The appearance of some terms is the cause for the
               assignation of a certain class.
               The general expression of the probability distributions is
               this:
                       p(ci |pa(Ci )) = 1 −      (1 − w(Tk , Ci ))
                                                             Tk ∈R(pa(Ci ))

                                       p(c i |pa(Ci )) = 1 − p(ci |pa(Ci ))
Probabilistic Methods for Structured Document Classification at INEX´07
  The models
    OR Gate Bayesian Network Classifier


      The OR Gate Bayesian Network Classifier (2)

               We instantiate all terms of a document: if
               tk ∈ dj , p(tk |dj ) = 1, 0 otherwise.
               We replicate each term node k with its frequency in the
               document njk .
               The computation of the posterior probability is very easy
               (only depending on the terms appearing on the document):

                                                                         (1 − w(Tk , Ci ))njk .
                               p(ci |dj ) = 1 −                                                           (3)
                                                     Tk ∈Pa(Ci )∩dj

               Two estimation schemes for the weights:
                                                                             Nik
                       Maximum likelihood: w(Tk , Ci ) =                     Nk .
                                                                                            (Ni −Nih )N
                                                                            = Nik   ×
                       Better approximation: w(Tk , Ci )                                h=k (N−Nh )Ni
                                                                              Nk
               See paper for details!
Probabilistic Methods for Structured Document Classification at INEX´07
  Document representation
    Example XML file


      Document representation: example XML file


       <book>
        <title>El ingenioso hidalgo Don Quijote
         de la Mancha</title>
        <author>Miguel de Cervantes Saavedra</author>
        <contents>
          <chapter>Uno</chapter>
           <text>En un lugar de La Mancha de cuyo
            nombre no quiero
           acordarme...</text>
        </contents>
       </book>
Probabilistic Methods for Structured Document Classification at INEX´07
  Document representation
    Method: “only text”


      Document representation




           1 Only text (removing all tags).

       Quijote
       El ingenioso hidalgo Don Quijote de la Mancha Miguel de
       Cervantes Saavedra Uno En un lugar de La Mancha de cuyo
       nombre no quiero acordarme...
Probabilistic Methods for Structured Document Classification at INEX´07
  Document representation
    Method: “text replication”


      Document representation


           5 Text replication (term frequencies are replicated several
             times, depending on the tag containing the terms).

       Replication values
       title 1 author 0 chapter 0 text 2

       Quijote
       El ingenioso hidalgo Don Quijote de la Mancha En En un un
       lugar lugar de de La La Mancha Mancha de de cuyo cuyo
       nombre nombre no no quiero quiero acordarme acordarme...
Probabilistic Methods for Structured Document Classification at INEX´07
  Results



      Results

               Five runs were submitted to the track:
        (1) Naive Bayes, only text, no term selection. Microaverage:
            0.77630. Macroaverage: 0.58536.
        (2) Naive Bayes, replication (id=2), no term selection.
            Microaverage: 0.78107 (+0.6%). Macroaverage: 0.6373
            (+8.9%).
        (3) Or gate, maximum likelihood, replication (id=8), selection
            by MI. Microaverage: 0.75097. Macroaverage: 0.61973.
        (4) Or gate, maximum likelihood, replication (id=5), selection
            by MI. Microaverage: 0.75354. Macroaverage: 0.61298.
        (5) Or gate, better approximation, only text, ≥ 2.
            Microaverage: 0.78998. Macroaverage: 0.76054.
               See paper for text replication values (id=2,5,8).
Probabilistic Methods for Structured Document Classification at INEX´07
  Conclusion



      Conclusions

       Also reached by our previous experiments!!
               Naive Bayes works bad with few populated categories (low
               macroaverage).
               Tagging and adding seems not to work well without feature
               selection.
               Text replication improves macroaverage (good for Naive
               Bayes! ).
               OR gates by ML needs of a per-class feature selection
               method (mutual information).
               The better approximation for the OR gate is our best
               classifier in previous experiments over flat text.
Probabilistic Methods for Structured Document Classification at INEX´07
  Conclusion



      Thank you very much!




                     Questions or comments?

More Related Content

What's hot

Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
My invited talk at the 2018 Annual Meeting of SIAM (Society of Industrial and...
My invited talk at the 2018 Annual Meeting of SIAM (Society of Industrial and...My invited talk at the 2018 Annual Meeting of SIAM (Society of Industrial and...
My invited talk at the 2018 Annual Meeting of SIAM (Society of Industrial and...Anirbit Mukherjee
 
Generative Adversarial Networks GAN - Santiago Pascual - UPC Barcelona 2018
Generative Adversarial Networks GAN - Santiago Pascual - UPC Barcelona 2018Generative Adversarial Networks GAN - Santiago Pascual - UPC Barcelona 2018
Generative Adversarial Networks GAN - Santiago Pascual - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Talk icml
Talk icmlTalk icml
Talk icmlBo Li
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligencevini89
 
Chromatic Sparse Learning
Chromatic Sparse LearningChromatic Sparse Learning
Chromatic Sparse LearningDatabricks
 
Lecture11
Lecture11Lecture11
Lecture11Bo Li
 
Non-parametric regressions & Neural Networks
Non-parametric regressions & Neural NetworksNon-parametric regressions & Neural Networks
Non-parametric regressions & Neural NetworksGiuseppe Broccolo
 
IVR - Chapter 7 - Patch models and dictionary learning
IVR - Chapter 7 - Patch models and dictionary learningIVR - Chapter 7 - Patch models and dictionary learning
IVR - Chapter 7 - Patch models and dictionary learningCharles Deledalle
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Sean Golliher
 
Interactive Latent Dirichlet Allocation
Interactive Latent Dirichlet AllocationInteractive Latent Dirichlet Allocation
Interactive Latent Dirichlet AllocationQuentin Pleplé
 
Neural Networks: Support Vector machines
Neural Networks: Support Vector machinesNeural Networks: Support Vector machines
Neural Networks: Support Vector machinesMostafa G. M. Mostafa
 

What's hot (15)

Nominal Schema DL 2011
Nominal Schema DL 2011Nominal Schema DL 2011
Nominal Schema DL 2011
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Topic Models
Topic ModelsTopic Models
Topic Models
 
My invited talk at the 2018 Annual Meeting of SIAM (Society of Industrial and...
My invited talk at the 2018 Annual Meeting of SIAM (Society of Industrial and...My invited talk at the 2018 Annual Meeting of SIAM (Society of Industrial and...
My invited talk at the 2018 Annual Meeting of SIAM (Society of Industrial and...
 
Generative Adversarial Networks GAN - Santiago Pascual - UPC Barcelona 2018
Generative Adversarial Networks GAN - Santiago Pascual - UPC Barcelona 2018Generative Adversarial Networks GAN - Santiago Pascual - UPC Barcelona 2018
Generative Adversarial Networks GAN - Santiago Pascual - UPC Barcelona 2018
 
Talk icml
Talk icmlTalk icml
Talk icml
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Chromatic Sparse Learning
Chromatic Sparse LearningChromatic Sparse Learning
Chromatic Sparse Learning
 
Causal Bayesian Networks
Causal Bayesian NetworksCausal Bayesian Networks
Causal Bayesian Networks
 
Lecture11
Lecture11Lecture11
Lecture11
 
Non-parametric regressions & Neural Networks
Non-parametric regressions & Neural NetworksNon-parametric regressions & Neural Networks
Non-parametric regressions & Neural Networks
 
IVR - Chapter 7 - Patch models and dictionary learning
IVR - Chapter 7 - Patch models and dictionary learningIVR - Chapter 7 - Patch models and dictionary learning
IVR - Chapter 7 - Patch models and dictionary learning
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
 
Interactive Latent Dirichlet Allocation
Interactive Latent Dirichlet AllocationInteractive Latent Dirichlet Allocation
Interactive Latent Dirichlet Allocation
 
Neural Networks: Support Vector machines
Neural Networks: Support Vector machinesNeural Networks: Support Vector machines
Neural Networks: Support Vector machines
 

Viewers also liked

Aprendizaje Supervisado con DauroLab
Aprendizaje Supervisado con DauroLabAprendizaje Supervisado con DauroLab
Aprendizaje Supervisado con DauroLabAlfonso E. Romero
 
Link-based document classification using Bayesian Networks
Link-based document classification using Bayesian NetworksLink-based document classification using Bayesian Networks
Link-based document classification using Bayesian NetworksAlfonso E. Romero
 
Recuperación de Información y el modelo de Espacio Vectorial
Recuperación de Información y el modelo de Espacio VectorialRecuperación de Información y el modelo de Espacio Vectorial
Recuperación de Información y el modelo de Espacio VectorialAlfonso E. Romero
 
Document classification models based on Bayesian networks
Document classification models based on Bayesian networksDocument classification models based on Bayesian networks
Document classification models based on Bayesian networksAlfonso E. Romero
 
Resolviendo problemas con algoritmos geneticos
Resolviendo problemas con algoritmos geneticosResolviendo problemas con algoritmos geneticos
Resolviendo problemas con algoritmos geneticosMauro Parra-Miranda
 
REDES NEURONALES Algoritmos de Aprendizaje
REDES NEURONALES Algoritmos  de AprendizajeREDES NEURONALES Algoritmos  de Aprendizaje
REDES NEURONALES Algoritmos de AprendizajeESCOM
 

Viewers also liked (9)

Aprendizaje Supervisado con DauroLab
Aprendizaje Supervisado con DauroLabAprendizaje Supervisado con DauroLab
Aprendizaje Supervisado con DauroLab
 
Link-based document classification using Bayesian Networks
Link-based document classification using Bayesian NetworksLink-based document classification using Bayesian Networks
Link-based document classification using Bayesian Networks
 
Presentation alfonso romero
Presentation alfonso romeroPresentation alfonso romero
Presentation alfonso romero
 
Recuperación de Información y el modelo de Espacio Vectorial
Recuperación de Información y el modelo de Espacio VectorialRecuperación de Información y el modelo de Espacio Vectorial
Recuperación de Información y el modelo de Espacio Vectorial
 
Document classification models based on Bayesian networks
Document classification models based on Bayesian networksDocument classification models based on Bayesian networks
Document classification models based on Bayesian networks
 
Metodos Predictivos
Metodos PredictivosMetodos Predictivos
Metodos Predictivos
 
Resolviendo problemas con algoritmos geneticos
Resolviendo problemas con algoritmos geneticosResolviendo problemas con algoritmos geneticos
Resolviendo problemas con algoritmos geneticos
 
REDES NEURONALES Algoritmos de Aprendizaje
REDES NEURONALES Algoritmos  de AprendizajeREDES NEURONALES Algoritmos  de Aprendizaje
REDES NEURONALES Algoritmos de Aprendizaje
 
AI: Learning in AI
AI: Learning in AI AI: Learning in AI
AI: Learning in AI
 

Similar to Inex07

lecture15-supervised.ppt
lecture15-supervised.pptlecture15-supervised.ppt
lecture15-supervised.pptIndra Hermawan
 
Naive Bayes for the Superbowl
Naive Bayes for the SuperbowlNaive Bayes for the Superbowl
Naive Bayes for the SuperbowlJohn Liu
 
A probabilistic model for recursive factorized image features ppt
A probabilistic model for recursive factorized image features pptA probabilistic model for recursive factorized image features ppt
A probabilistic model for recursive factorized image features pptirisshicat
 
[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare EventsTaegyun Jeon
 
Bayesian Neural Networks
Bayesian Neural NetworksBayesian Neural Networks
Bayesian Neural Networksm.a.kirn
 
Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)Mostafa G. M. Mostafa
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text ClassificationSai Srinivas Kotni
 
Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Kira
 
Probablistic information retrieval
Probablistic information retrievalProbablistic information retrieval
Probablistic information retrievalNisha Arankandath
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream MiningAlbert Bifet
 
. An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic .... An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic ...butest
 
Methods from Mathematical Data Mining (Supported by Optimization)
Methods from Mathematical Data Mining (Supported by Optimization)Methods from Mathematical Data Mining (Supported by Optimization)
Methods from Mathematical Data Mining (Supported by Optimization)SSA KPI
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorizationmidi
 
Kernel estimation(ref)
Kernel estimation(ref)Kernel estimation(ref)
Kernel estimation(ref)Zahra Amini
 

Similar to Inex07 (20)

lecture15-supervised.ppt
lecture15-supervised.pptlecture15-supervised.ppt
lecture15-supervised.ppt
 
Bayesian Core: Chapter 8
Bayesian Core: Chapter 8Bayesian Core: Chapter 8
Bayesian Core: Chapter 8
 
Naive Bayes for the Superbowl
Naive Bayes for the SuperbowlNaive Bayes for the Superbowl
Naive Bayes for the Superbowl
 
ML.pptx
ML.pptxML.pptx
ML.pptx
 
A probabilistic model for recursive factorized image features ppt
A probabilistic model for recursive factorized image features pptA probabilistic model for recursive factorized image features ppt
A probabilistic model for recursive factorized image features ppt
 
[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events
 
kcde
kcdekcde
kcde
 
Bayesian Neural Networks
Bayesian Neural NetworksBayesian Neural Networks
Bayesian Neural Networks
 
Encoding survey
Encoding surveyEncoding survey
Encoding survey
 
Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text Classification
 
DL for molecules
DL for moleculesDL for molecules
DL for molecules
 
Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)
 
Probablistic information retrieval
Probablistic information retrievalProbablistic information retrieval
Probablistic information retrieval
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream Mining
 
. An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic .... An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic ...
 
Methods from Mathematical Data Mining (Supported by Optimization)
Methods from Mathematical Data Mining (Supported by Optimization)Methods from Mathematical Data Mining (Supported by Optimization)
Methods from Mathematical Data Mining (Supported by Optimization)
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorization
 
Kernel estimation(ref)
Kernel estimation(ref)Kernel estimation(ref)
Kernel estimation(ref)
 
Clustering: A Survey
Clustering: A SurveyClustering: A Survey
Clustering: A Survey
 

Recently uploaded

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Recently uploaded (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Inex07

  • 1. Probabilistic Methods for Structured Document Classification at INEX´07 Probabilistic Methods for Structured Document Classification at INEX´07 ´ Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Alfonso E. Romero ´ Departamento de Ciencias de la Computacion e Inteligencia Artificial Universidad de Granada {lci,jmfluna,jhg,aeromero}@decsai.ugr.es INEX 2007, Dagstuhl (Germany) December 18, 2007
  • 2. Probabilistic Methods for Structured Document Classification at INEX´07 Abstract We present here the result of our participation in the Document Mining track at INEX´07. We submitted several runs for this track (only classification). This is the first year we apply for, with relative good results.
  • 3. Probabilistic Methods for Structured Document Classification at INEX´07 Our aim Finding good XML to flat-document transformations... ...in order to apply probabilistic flat-text and, to improve flat classifiers result.
  • 4. Probabilistic Methods for Structured Document Classification at INEX´07 Our aim Finding good XML to flat-document transformations... ...in order to apply probabilistic flat-text and, to improve flat classifiers result.
  • 5. Probabilistic Methods for Structured Document Classification at INEX´07 Our aim Finding good XML to flat-document transformations... ...in order to apply probabilistic flat-text and, to improve flat classifiers result.
  • 6. Probabilistic Methods for Structured Document Classification at INEX´07 The models Multinomial Naive Bayes The models: Multinomial Naive Bayes Extensively used in text classification (see paper by MacCallum). Posterior probability of a class is computed using the bayes rule: p(ci )p(dj |ci ) p(ci |dj ) = ∝ p(ci )p(dj |ci ) (1) p(dj ) Probabilities p(dj |ci ) are supossed to follow a multinomial distribution: p(tk |ci )njk p(dj |ci ) ∝ (2) tk ∈dj The estimation of the needed values is carried out in the N following way: p(tk |ci ) = Nik+M and p(ci ) = Ni,doc . +1 Ni doc
  • 7. Probabilistic Methods for Structured Document Classification at INEX´07 The models OR Gate Bayesian Network Classifier The OR Gate Bayesian Network Classifier Tries to model rules of the following kind: IF (ti ∈ d) ∨ (tj ∈ d) ∨ ... THEN classify by c. The canonical model used is a noisy OR gate, wich is related with the notion of causality. The appearance of some terms is the cause for the assignation of a certain class. The general expression of the probability distributions is this: p(ci |pa(Ci )) = 1 − (1 − w(Tk , Ci )) Tk ∈R(pa(Ci )) p(c i |pa(Ci )) = 1 − p(ci |pa(Ci ))
  • 8. Probabilistic Methods for Structured Document Classification at INEX´07 The models OR Gate Bayesian Network Classifier The OR Gate Bayesian Network Classifier (2) We instantiate all terms of a document: if tk ∈ dj , p(tk |dj ) = 1, 0 otherwise. We replicate each term node k with its frequency in the document njk . The computation of the posterior probability is very easy (only depending on the terms appearing on the document): (1 − w(Tk , Ci ))njk . p(ci |dj ) = 1 − (3) Tk ∈Pa(Ci )∩dj Two estimation schemes for the weights: Nik Maximum likelihood: w(Tk , Ci ) = Nk . (Ni −Nih )N = Nik × Better approximation: w(Tk , Ci ) h=k (N−Nh )Ni Nk See paper for details!
  • 9. Probabilistic Methods for Structured Document Classification at INEX´07 Document representation Example XML file Document representation: example XML file <book> <title>El ingenioso hidalgo Don Quijote de la Mancha</title> <author>Miguel de Cervantes Saavedra</author> <contents> <chapter>Uno</chapter> <text>En un lugar de La Mancha de cuyo nombre no quiero acordarme...</text> </contents> </book>
  • 10. Probabilistic Methods for Structured Document Classification at INEX´07 Document representation Method: “only text” Document representation 1 Only text (removing all tags). Quijote El ingenioso hidalgo Don Quijote de la Mancha Miguel de Cervantes Saavedra Uno En un lugar de La Mancha de cuyo nombre no quiero acordarme...
  • 11. Probabilistic Methods for Structured Document Classification at INEX´07 Document representation Method: “text replication” Document representation 5 Text replication (term frequencies are replicated several times, depending on the tag containing the terms). Replication values title 1 author 0 chapter 0 text 2 Quijote El ingenioso hidalgo Don Quijote de la Mancha En En un un lugar lugar de de La La Mancha Mancha de de cuyo cuyo nombre nombre no no quiero quiero acordarme acordarme...
  • 12. Probabilistic Methods for Structured Document Classification at INEX´07 Results Results Five runs were submitted to the track: (1) Naive Bayes, only text, no term selection. Microaverage: 0.77630. Macroaverage: 0.58536. (2) Naive Bayes, replication (id=2), no term selection. Microaverage: 0.78107 (+0.6%). Macroaverage: 0.6373 (+8.9%). (3) Or gate, maximum likelihood, replication (id=8), selection by MI. Microaverage: 0.75097. Macroaverage: 0.61973. (4) Or gate, maximum likelihood, replication (id=5), selection by MI. Microaverage: 0.75354. Macroaverage: 0.61298. (5) Or gate, better approximation, only text, ≥ 2. Microaverage: 0.78998. Macroaverage: 0.76054. See paper for text replication values (id=2,5,8).
  • 13. Probabilistic Methods for Structured Document Classification at INEX´07 Conclusion Conclusions Also reached by our previous experiments!! Naive Bayes works bad with few populated categories (low macroaverage). Tagging and adding seems not to work well without feature selection. Text replication improves macroaverage (good for Naive Bayes! ). OR gates by ML needs of a per-class feature selection method (mutual information). The better approximation for the OR gate is our best classifier in previous experiments over flat text.
  • 14. Probabilistic Methods for Structured Document Classification at INEX´07 Conclusion Thank you very much! Questions or comments?