SlideShare une entreprise Scribd logo
1  sur  20
Creating a taxonomy
for Wikipedia
Patrick Nicolas
Feb 11, 2012
http://patricknicolas.blogspot.com
http://www.slideshare.net/pnicolas
https://github.com/prnicolas
Introduction
The goal of the study is to build a Taxonomy Graph for the 3+
millions Wikipedia entries by leveraging the WordNet
hyponyms as a training set.
This model can used in a wide variety of commercial
applications from extracting context extraction and
automated Wiki classification to text summarization.

Notes:
• Definitions and notations are defined in the appendices
• The presentation assumes the reader has basic knowledge in
information retrieval, Natural Language Processing and Machine
Learning.

Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com

2
Process

The computation flow for the generation of taxonomy for
Wikipedia is summarized in the following 5 simple steps.
1. Extract abstract & categories from Wikipedia datasets
2. Generate the Hypernyms lineages for Wikipedia entries
which overlap with WordNet synsets
3. Extract, reduce and ordered N-Grams and their tags
(NNP, NN,.) from each Wikipedia abstract.
4. Create a training set of weighted graphs of each Wikipedia
abstract that has a corresponding hypernyms hierarchy
5. Optimize and apply the model for generating taxonomy
lineages for each Wikipedia entry

Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com

3
Semantic Data Sources
Terms Frequency Corpora
Reuters corpus and Google N-Grams frequency is used to
compute the inverse document frequency values.

Word Net Hypernyms
WordNet database of Synsets is used to generate hierarchy of
hypernyms.
entity/physical
entity/object/location/region/district/country/European country/Italy

Wikipedia Datasets
Entry (label), long abstract and categories are to be extracted
from the Wikipedia reference database.

Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com

4
N-Grams Extraction Model
The relevancy (or weight ω) of a N-Gram to the context of a
document depends on syntactic, semantic and probabilistic features.
Frequency N-Gram in
document

Similarity of N-Gram
with Categories

β

fD
N-Gram
tag

N-Gram

α
Term 1

Semantic
Definition?

…

Frequency
of terms

ρ Frequency N-Gram in
categories abstracts

Term n

idf

φ

Contained in
1st sentence?

Frequency N-Gram in
Universe (Corpus)

Fig. 1 Illustration of features of N-Gram Extraction Model

Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com

5
Computation Flow
The computation flow is broken down in ‘plug & play’ processing units to
enable design of experiments and audit.
N-Grams
Corpus

idf

Freq.
N-Grams

Abstract

Wikipedia
Datasets

WordNet
Synsets

Categories

Weighted N-Grams
N-Gram tags

Abstract
Semantic match

label
Labeled
Lineage

Normalized
N-Gram Weights

Hypernyms
Taxonomy Graph
Trained Model

Fig. 2 Typical computation flow for generation of taxonomy

Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com

6
NGrams Frequency Analysis
Let’s define an N-Gram, w(n) (i.e. w(3) for a 3-Gram). The frequency of
the N-Gram within the corpus C is expressed as.

The inverse document frequency (IDF) is computed as

Let w(n) be a N-Gram with a frequency count(w(n)) composed of terms
wj j =1,n with a frequency count(wj) with a document D. The
frequency of the N-Gram is computed as

Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com

7
Weighting N-Grams

Most of Wikipedia concepts are well described in the first sentence
of each abstract. Therefore we can attribute a great weight to NGrams that are contained in the first sentence. The frequency f lD of
a N-Grams in the 1st sentence of a document is defined as

A simple regression analysis showed that a square root function
provide a more accurate contribution (weight) of a N-Gram in a
document D.

Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com

8
Tagging N-Grams

Although Conditional Random Fields is the predominant discriminative
classifier to predict sentence boundaries, token tags we found out that
the Maximum Entropy for binary features were more appropriate to
classify the first term in a sentence (NNP or NN).
The model features functions ft (w) => {0,1} are extracted by
maximizing the entropy H(p) of the probability of a word, w, has a
specific tag t.

Subjected to the constraints..

Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com

9
Wikipedia Tags Distribution
We extract the tags of Wikipedia
entries (1 to 4-Grams) in the
context of their abstracts. The
distribution of the frequency of
the tags shows that the proper
nouns (stemmed as NNP tags)
are the predominant tags.

The frequency distribution is used as
prior probability for finding a
Wikipedia entry of a specific tag.
Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com

10
Tag Predictive Model

We use a multinomial Naïve Bayes to predict the tag of
any given Wikipedia entry.
Let’s defined a set of classes Ck = { w(n) | tg(w(n)) = k } of
Wikipedia entries of specific tags (CNNP NN) & p(t| Ck)
the prior probability of a tag t to belong to a class.
The likelihood a given Wikipedia entry as a tag k is

Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com

11
Taxonomy Weighted Graph

Let’s define:
• taxonomy class (or Taxa) as a
graph node representing a
Hypernym (i.e. class=‘person’)
• taxonomy instance as entity
name (i.e. instance=‘Peter’ or
Peter IS-A a Person)
• Taxonomy lineage as the list
of ancestors (Hypernyms) of
an instance
Fig. Example of taxonomy lineage

Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com

12
Document taxonomy

Any document can be represented as a weighted
graph of taxonomy classes and instances.

Fig. Example of taxonomy graph

Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com

13
Propagation Rule for Taxonomy Weights

The flow model is applied to the taxonomy weighted graph to compute
the weight of each taxonomy class from the normalized weight of
semantic N-Grams. The weights of taxonomy classes are normalized
with the root ‘entity’ (ω =1 ). The taxonomy instances (N-Grams) are
ordered & normalized by their respective weights ω( wk(n) )

Fig. Weight propagation in Taxonomy Graph

Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com

14
Normalized Taxonomy Weight in Wikipedia

We analyze the
distribution of
weights along the
taxonomy lineage
for all Wikipedia
entries

Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com

15
Lineage Weights Estimator

The training using the initial set of WordNet hypernyms shows
that the distribution of normalized weights ωkalong the taxonomy
lineage for a specific similarity class C, can be approximated with
polynomial function (spline).

This estimator is used in the classification of the taxonomy
lineages of a Wikipedia abstract.

Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com

16
Similarity Metrics

In order to train a model using labeled WordNet hypernyms, a
similarity (or distance) metrics need to be defined. Let’s consider 2
taxonomy lineages Vi and Vk of respective length n(k) and n(j)

Cosine Distance
Shortest Path Distance

Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com

17
Taxonomy Generation Model
Let consider m classes of taxonomy lineage similarity and labeled
lineage VH . A class Ciis defined by

A taxonomy lineage Vj is classified using Naïve Bayes.

Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com

18
Appendix: notation

Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com

19
Appendix: References

• “Introduction to Information Retrieval”C. Manning, P Raghavan,
H Schūtze Cambridge University Press
• “Elements of Statistical Learning”
T Hastie, R Tibshirani, J
Friedman Springer
• “Semantic Taxonomy Induction from Heterogeneous Evidence” R
Snow, D Jurafsky, A Ng
• “A Study on Linking Wikipedia Categories to WordNet synsets
using text similarity” A Toral, O Fernandez, E Agirre, R Muňoz
• “Regularization Predicts While Discovery Taxonomy” Y. Mroueh, T
Poggio, L Rosasco
• “Natural Language Semantics Term Project” M Tao.
• “A Maximum Entropy Approach to Natural Language Processing”
A Berger, V Della Pietra, S Della Pietra.

Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com

20

Contenu connexe

Tendances

Introduction to objective c
Introduction to objective cIntroduction to objective c
Introduction to objective c
Mayank Jalotra
 
Matlab programming project
Matlab programming projectMatlab programming project
Matlab programming project
Assignmentpedia
 
Presentation 1st
Presentation 1stPresentation 1st
Presentation 1st
Connex
 

Tendances (20)

Matlab file
Matlab fileMatlab file
Matlab file
 
Introduction to objective c
Introduction to objective cIntroduction to objective c
Introduction to objective c
 
Introduction to matlab
Introduction to matlabIntroduction to matlab
Introduction to matlab
 
Matlab programming project
Matlab programming projectMatlab programming project
Matlab programming project
 
Matlab for beginners, Introduction, signal processing
Matlab for beginners, Introduction, signal processingMatlab for beginners, Introduction, signal processing
Matlab for beginners, Introduction, signal processing
 
17515
1751517515
17515
 
C sharp chap5
C sharp chap5C sharp chap5
C sharp chap5
 
classes and objects in C++
classes and objects in C++classes and objects in C++
classes and objects in C++
 
Introduction to matlab lecture 1 of 4
Introduction to matlab lecture 1 of 4Introduction to matlab lecture 1 of 4
Introduction to matlab lecture 1 of 4
 
Matlab tutorial
Matlab tutorialMatlab tutorial
Matlab tutorial
 
Classes and data abstraction
Classes and data abstractionClasses and data abstraction
Classes and data abstraction
 
Matlab day 1: Introduction to MATLAB
Matlab day 1: Introduction to MATLABMatlab day 1: Introduction to MATLAB
Matlab day 1: Introduction to MATLAB
 
Kotlin- Basic to Advance
Kotlin- Basic to Advance Kotlin- Basic to Advance
Kotlin- Basic to Advance
 
Object Oriented Programming Lab Manual
Object Oriented Programming Lab Manual Object Oriented Programming Lab Manual
Object Oriented Programming Lab Manual
 
Farhaan Ahmed, BCA 2nd Year
Farhaan Ahmed, BCA 2nd YearFarhaan Ahmed, BCA 2nd Year
Farhaan Ahmed, BCA 2nd Year
 
Presentation 1st
Presentation 1stPresentation 1st
Presentation 1st
 
Inheritance
InheritanceInheritance
Inheritance
 
Introduction to database-Normalisation
Introduction to database-NormalisationIntroduction to database-Normalisation
Introduction to database-Normalisation
 
Scala google
Scala google Scala google
Scala google
 
C++ classes
C++ classesC++ classes
C++ classes
 

En vedette

Semantic Recommandation Sytems for Research 2.0
Semantic Recommandation Sytems for Research 2.0Semantic Recommandation Sytems for Research 2.0
Semantic Recommandation Sytems for Research 2.0
Educational Technology
 
Presentacion Dcai 2010
Presentacion Dcai 2010Presentacion Dcai 2010
Presentacion Dcai 2010
Victor Codina
 
Cvpr2007 object category recognition p1 - bag of words models
Cvpr2007 object category recognition   p1 - bag of words modelsCvpr2007 object category recognition   p1 - bag of words models
Cvpr2007 object category recognition p1 - bag of words models
zukun
 
A Comparison of Different Strategies for Automated Semantic Document Annotation
A Comparison of Different Strategies for Automated Semantic Document AnnotationA Comparison of Different Strategies for Automated Semantic Document Annotation
A Comparison of Different Strategies for Automated Semantic Document Annotation
Ansgar Scherp
 
Sequential pattern mining
Sequential pattern miningSequential pattern mining
Sequential pattern mining
kiran said
 
Patterns number and geometric
Patterns  number and geometricPatterns  number and geometric
Patterns number and geometric
amdzubinski
 

En vedette (20)

Semantic Recommandation Sytems for Research 2.0
Semantic Recommandation Sytems for Research 2.0Semantic Recommandation Sytems for Research 2.0
Semantic Recommandation Sytems for Research 2.0
 
Presentacion Dcai 2010
Presentacion Dcai 2010Presentacion Dcai 2010
Presentacion Dcai 2010
 
La lezione di Expo: comunicare con i grandi eventi
La lezione di Expo: comunicare con i grandi eventiLa lezione di Expo: comunicare con i grandi eventi
La lezione di Expo: comunicare con i grandi eventi
 
Social Network Analysis Applicata
Social Network Analysis ApplicataSocial Network Analysis Applicata
Social Network Analysis Applicata
 
Cvpr2007 object category recognition p1 - bag of words models
Cvpr2007 object category recognition   p1 - bag of words modelsCvpr2007 object category recognition   p1 - bag of words models
Cvpr2007 object category recognition p1 - bag of words models
 
Ngrams smoothing
Ngrams smoothingNgrams smoothing
Ngrams smoothing
 
Conceptual indexing
Conceptual indexingConceptual indexing
Conceptual indexing
 
Phrase Based Indexing and Information Retrivel
Phrase Based Indexing and Information RetrivelPhrase Based Indexing and Information Retrivel
Phrase Based Indexing and Information Retrivel
 
A Comparison of Different Strategies for Automated Semantic Document Annotation
A Comparison of Different Strategies for Automated Semantic Document AnnotationA Comparison of Different Strategies for Automated Semantic Document Annotation
A Comparison of Different Strategies for Automated Semantic Document Annotation
 
Rule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slidesRule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slides
 
Sentiment Analysis - Settore Fashion
Sentiment Analysis - Settore FashionSentiment Analysis - Settore Fashion
Sentiment Analysis - Settore Fashion
 
Building a semantic website
Building a semantic websiteBuilding a semantic website
Building a semantic website
 
L'applicazione delle tecniche di data mining alla personalizzazione dei siti ...
L'applicazione delle tecniche di data mining alla personalizzazione dei siti ...L'applicazione delle tecniche di data mining alla personalizzazione dei siti ...
L'applicazione delle tecniche di data mining alla personalizzazione dei siti ...
 
Nuovi scenari sociali, reti digitali e accumulo di big data. Un approccio sis...
Nuovi scenari sociali, reti digitali e accumulo di big data. Un approccio sis...Nuovi scenari sociali, reti digitali e accumulo di big data. Un approccio sis...
Nuovi scenari sociali, reti digitali e accumulo di big data. Un approccio sis...
 
Concept-Based Information Retrieval using Explicit Semantic Analysis
Concept-Based Information Retrieval using Explicit Semantic AnalysisConcept-Based Information Retrieval using Explicit Semantic Analysis
Concept-Based Information Retrieval using Explicit Semantic Analysis
 
Sequential pattern mining
Sequential pattern miningSequential pattern mining
Sequential pattern mining
 
Case Study in Linked Data and Semantic Web: Human Genome
Case Study in Linked Data and Semantic Web: Human GenomeCase Study in Linked Data and Semantic Web: Human Genome
Case Study in Linked Data and Semantic Web: Human Genome
 
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and Phrases
 
Patterns number and geometric
Patterns  number and geometricPatterns  number and geometric
Patterns number and geometric
 

Similaire à Semantic Analysis using Wikipedia Taxonomy

Bt8903, c# programming
Bt8903, c# programmingBt8903, c# programming
Bt8903, c# programming
smumbahelp
 
From SMW to Rules
From SMW to RulesFrom SMW to Rules
From SMW to Rules
Jie Bao
 

Similaire à Semantic Analysis using Wikipedia Taxonomy (20)

Machine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionMachine Learning - Simple Linear Regression
Machine Learning - Simple Linear Regression
 
Eugene Burmako
Eugene BurmakoEugene Burmako
Eugene Burmako
 
Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...
Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...
Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...
 
Bt8903, c# programming
Bt8903, c# programmingBt8903, c# programming
Bt8903, c# programming
 
From SMW to Rules
From SMW to RulesFrom SMW to Rules
From SMW to Rules
 
PRG 421 Creative and Effective/newtonhelp.com
PRG 421 Creative and Effective/newtonhelp.comPRG 421 Creative and Effective/newtonhelp.com
PRG 421 Creative and Effective/newtonhelp.com
 
PRG 421 Extraordinary Life/newtonhelp.com 
PRG 421 Extraordinary Life/newtonhelp.com PRG 421 Extraordinary Life/newtonhelp.com 
PRG 421 Extraordinary Life/newtonhelp.com 
 
IRJET- Multi Label Document Classification Approach using Machine Learning Te...
IRJET- Multi Label Document Classification Approach using Machine Learning Te...IRJET- Multi Label Document Classification Approach using Machine Learning Te...
IRJET- Multi Label Document Classification Approach using Machine Learning Te...
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
Bytewise Approximate Match: Theory, Algorithms and Applications
Bytewise Approximate Match:  Theory, Algorithms and ApplicationsBytewise Approximate Match:  Theory, Algorithms and Applications
Bytewise Approximate Match: Theory, Algorithms and Applications
 
Facilitating Data Curation: a Solution Developed in the Toxicology Domain
Facilitating Data Curation: a Solution Developed in the Toxicology DomainFacilitating Data Curation: a Solution Developed in the Toxicology Domain
Facilitating Data Curation: a Solution Developed in the Toxicology Domain
 
PythonOO.pdf oo Object Oriented programming
PythonOO.pdf oo Object Oriented programmingPythonOO.pdf oo Object Oriented programming
PythonOO.pdf oo Object Oriented programming
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahout
 
For project
For projectFor project
For project
 
10.1.1.70.8789
10.1.1.70.878910.1.1.70.8789
10.1.1.70.8789
 
Deep learning with Keras
Deep learning with KerasDeep learning with Keras
Deep learning with Keras
 
Wi presentation
Wi presentationWi presentation
Wi presentation
 
Programming paradigms
Programming paradigmsProgramming paradigms
Programming paradigms
 
Probabilistic content models,
Probabilistic content models,Probabilistic content models,
Probabilistic content models,
 
Deep learning with Keras
Deep learning with KerasDeep learning with Keras
Deep learning with Keras
 

Plus de Patrick Nicolas

Plus de Patrick Nicolas (10)

Autonomous medical coding with discriminative transformers
Autonomous medical coding with discriminative transformersAutonomous medical coding with discriminative transformers
Autonomous medical coding with discriminative transformers
 
Open Source Lambda Architecture for deep learning
Open Source Lambda Architecture for deep learningOpen Source Lambda Architecture for deep learning
Open Source Lambda Architecture for deep learning
 
AI for electronic health records
AI for electronic health recordsAI for electronic health records
AI for electronic health records
 
Stock Market Prediction using Hidden Markov Models and Investor sentiment
Stock Market Prediction using Hidden Markov Models and Investor sentimentStock Market Prediction using Hidden Markov Models and Investor sentiment
Stock Market Prediction using Hidden Markov Models and Investor sentiment
 
Advanced Functional Programming in Scala
Advanced Functional Programming in ScalaAdvanced Functional Programming in Scala
Advanced Functional Programming in Scala
 
Adaptive Intrusion Detection Using Learning Classifiers
Adaptive Intrusion Detection Using Learning ClassifiersAdaptive Intrusion Detection Using Learning Classifiers
Adaptive Intrusion Detection Using Learning Classifiers
 
Data Modeling using Symbolic Regression
Data Modeling using Symbolic RegressionData Modeling using Symbolic Regression
Data Modeling using Symbolic Regression
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Taxonomy-based Contextual Ads Targeting
Taxonomy-based Contextual Ads TargetingTaxonomy-based Contextual Ads Targeting
Taxonomy-based Contextual Ads Targeting
 
Multi-tenancy in Private Clouds
Multi-tenancy in Private CloudsMulti-tenancy in Private Clouds
Multi-tenancy in Private Clouds
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

Semantic Analysis using Wikipedia Taxonomy

  • 1. Creating a taxonomy for Wikipedia Patrick Nicolas Feb 11, 2012 http://patricknicolas.blogspot.com http://www.slideshare.net/pnicolas https://github.com/prnicolas
  • 2. Introduction The goal of the study is to build a Taxonomy Graph for the 3+ millions Wikipedia entries by leveraging the WordNet hyponyms as a training set. This model can used in a wide variety of commercial applications from extracting context extraction and automated Wiki classification to text summarization. Notes: • Definitions and notations are defined in the appendices • The presentation assumes the reader has basic knowledge in information retrieval, Natural Language Processing and Machine Learning. Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 2
  • 3. Process The computation flow for the generation of taxonomy for Wikipedia is summarized in the following 5 simple steps. 1. Extract abstract & categories from Wikipedia datasets 2. Generate the Hypernyms lineages for Wikipedia entries which overlap with WordNet synsets 3. Extract, reduce and ordered N-Grams and their tags (NNP, NN,.) from each Wikipedia abstract. 4. Create a training set of weighted graphs of each Wikipedia abstract that has a corresponding hypernyms hierarchy 5. Optimize and apply the model for generating taxonomy lineages for each Wikipedia entry Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 3
  • 4. Semantic Data Sources Terms Frequency Corpora Reuters corpus and Google N-Grams frequency is used to compute the inverse document frequency values. Word Net Hypernyms WordNet database of Synsets is used to generate hierarchy of hypernyms. entity/physical entity/object/location/region/district/country/European country/Italy Wikipedia Datasets Entry (label), long abstract and categories are to be extracted from the Wikipedia reference database. Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 4
  • 5. N-Grams Extraction Model The relevancy (or weight ω) of a N-Gram to the context of a document depends on syntactic, semantic and probabilistic features. Frequency N-Gram in document Similarity of N-Gram with Categories β fD N-Gram tag N-Gram α Term 1 Semantic Definition? … Frequency of terms ρ Frequency N-Gram in categories abstracts Term n idf φ Contained in 1st sentence? Frequency N-Gram in Universe (Corpus) Fig. 1 Illustration of features of N-Gram Extraction Model Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 5
  • 6. Computation Flow The computation flow is broken down in ‘plug & play’ processing units to enable design of experiments and audit. N-Grams Corpus idf Freq. N-Grams Abstract Wikipedia Datasets WordNet Synsets Categories Weighted N-Grams N-Gram tags Abstract Semantic match label Labeled Lineage Normalized N-Gram Weights Hypernyms Taxonomy Graph Trained Model Fig. 2 Typical computation flow for generation of taxonomy Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 6
  • 7. NGrams Frequency Analysis Let’s define an N-Gram, w(n) (i.e. w(3) for a 3-Gram). The frequency of the N-Gram within the corpus C is expressed as. The inverse document frequency (IDF) is computed as Let w(n) be a N-Gram with a frequency count(w(n)) composed of terms wj j =1,n with a frequency count(wj) with a document D. The frequency of the N-Gram is computed as Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 7
  • 8. Weighting N-Grams Most of Wikipedia concepts are well described in the first sentence of each abstract. Therefore we can attribute a great weight to NGrams that are contained in the first sentence. The frequency f lD of a N-Grams in the 1st sentence of a document is defined as A simple regression analysis showed that a square root function provide a more accurate contribution (weight) of a N-Gram in a document D. Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 8
  • 9. Tagging N-Grams Although Conditional Random Fields is the predominant discriminative classifier to predict sentence boundaries, token tags we found out that the Maximum Entropy for binary features were more appropriate to classify the first term in a sentence (NNP or NN). The model features functions ft (w) => {0,1} are extracted by maximizing the entropy H(p) of the probability of a word, w, has a specific tag t. Subjected to the constraints.. Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 9
  • 10. Wikipedia Tags Distribution We extract the tags of Wikipedia entries (1 to 4-Grams) in the context of their abstracts. The distribution of the frequency of the tags shows that the proper nouns (stemmed as NNP tags) are the predominant tags. The frequency distribution is used as prior probability for finding a Wikipedia entry of a specific tag. Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 10
  • 11. Tag Predictive Model We use a multinomial Naïve Bayes to predict the tag of any given Wikipedia entry. Let’s defined a set of classes Ck = { w(n) | tg(w(n)) = k } of Wikipedia entries of specific tags (CNNP NN) & p(t| Ck) the prior probability of a tag t to belong to a class. The likelihood a given Wikipedia entry as a tag k is Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 11
  • 12. Taxonomy Weighted Graph Let’s define: • taxonomy class (or Taxa) as a graph node representing a Hypernym (i.e. class=‘person’) • taxonomy instance as entity name (i.e. instance=‘Peter’ or Peter IS-A a Person) • Taxonomy lineage as the list of ancestors (Hypernyms) of an instance Fig. Example of taxonomy lineage Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 12
  • 13. Document taxonomy Any document can be represented as a weighted graph of taxonomy classes and instances. Fig. Example of taxonomy graph Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 13
  • 14. Propagation Rule for Taxonomy Weights The flow model is applied to the taxonomy weighted graph to compute the weight of each taxonomy class from the normalized weight of semantic N-Grams. The weights of taxonomy classes are normalized with the root ‘entity’ (ω =1 ). The taxonomy instances (N-Grams) are ordered & normalized by their respective weights ω( wk(n) ) Fig. Weight propagation in Taxonomy Graph Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 14
  • 15. Normalized Taxonomy Weight in Wikipedia We analyze the distribution of weights along the taxonomy lineage for all Wikipedia entries Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 15
  • 16. Lineage Weights Estimator The training using the initial set of WordNet hypernyms shows that the distribution of normalized weights ωkalong the taxonomy lineage for a specific similarity class C, can be approximated with polynomial function (spline). This estimator is used in the classification of the taxonomy lineages of a Wikipedia abstract. Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 16
  • 17. Similarity Metrics In order to train a model using labeled WordNet hypernyms, a similarity (or distance) metrics need to be defined. Let’s consider 2 taxonomy lineages Vi and Vk of respective length n(k) and n(j) Cosine Distance Shortest Path Distance Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 17
  • 18. Taxonomy Generation Model Let consider m classes of taxonomy lineage similarity and labeled lineage VH . A class Ciis defined by A taxonomy lineage Vj is classified using Naïve Bayes. Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 18
  • 19. Appendix: notation Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 19
  • 20. Appendix: References • “Introduction to Information Retrieval”C. Manning, P Raghavan, H Schūtze Cambridge University Press • “Elements of Statistical Learning” T Hastie, R Tibshirani, J Friedman Springer • “Semantic Taxonomy Induction from Heterogeneous Evidence” R Snow, D Jurafsky, A Ng • “A Study on Linking Wikipedia Categories to WordNet synsets using text similarity” A Toral, O Fernandez, E Agirre, R Muňoz • “Regularization Predicts While Discovery Taxonomy” Y. Mroueh, T Poggio, L Rosasco • “Natural Language Semantics Term Project” M Tao. • “A Maximum Entropy Approach to Natural Language Processing” A Berger, V Della Pietra, S Della Pietra. Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 20