SlideShare une entreprise Scribd logo
1  sur  27
TRank: Ranking
Entity Types Using
the Web of Data
Alberto Tonon1, Michele Catasta2, Gianluca Demartini1,
Philippe Cudré-Mauroux1, and Karl Aberer2
1eXascale Infolab,
University of Fribourg, Switzerland
{alberto, demartini, phil}@exascale.info
ISWC– 25 October 2013
2Distributed Information Systems Laboratory
EPFL, Switzerland
{firstname.lastname}@epfl.ch
Why Entities?
• The Web is getting entity-centric!
• Entity-centric services
2
Google
…and Why Types?
• “Summarization” of texts
• Contextual entities summaries in Web-pages
• Disambiguation of other entities
• Diversification of search results
3
Article Title Entities Types
Bin Laden Relative Pleads Not
Guilty in Terrorism Case
Osama Bin Laden
Abu Ghaith
Lewis Kaplan
Manhattan
Al-QaedaPropagandists
Kuwaiti Al-Qaeda members
Judge
Borough (New York City)
Sulaiman Abu Ghaith, a son-in-law of Osama bin Laden
who once served as a spokesman for Al Qaeda
Al-Quaeda
Propagandist
Kuwaiti Al-Qaeda
members
Jihadist
Organizations
Entities May Have Many Types
4
Thing
American
Billionaires
People from
King County People
from
Seattle
Windows
People
Agent
Person
Living
People
American
People of
Scottish Descent
Harvard
University
People
American
Computer
Programmers
American
Philanthropists
People
from
Seattle
G: DBPedia 3.8
e: Bill Gates
c: «Microsoft was founded by Bill Gates
and Paul Allen on April 4, 1975.»
Our Task: Ranking Types Given a
Context
• Input: a knowledge base
G, an Entity e, a context c
in which e appears.
• Output: e’s types ranked
by relevance wrt the
context c.
• Evaluation:
crowdsourcing + MAP,
NDCG
5
Bill Gates
1. American Chief executive
2. American Computer Programmer
3. American Billionaires
4. …
TRank Pipeline
6
Type ranking
Type ranking
Type ranking
Text
extraction
(BoilerPipe)
Named Entity
Recognition
(Stanford NER)
List of
entity
labels
Entity linking
(inverted index:
DBpedia labels ⟹
resource URIs)
foreach
List of
entity
URIs
Type retrieval
(inverted index:
resource URIs ⟹
type URIs)
List of
type
URIs
Type ranking
Ranked
list of
types
Type Hierarchy
7
<owl:equivalentClass>
<owl:Thing>
MappingsYAGO/DBpedia (PARIS)
type: DBpedia schema.org Yago
subClassOf relationship:
explicit inferred from
<owl:equivalentClass>
manually
added
PARISontology
mapping
Ranking Algorithms
• Entity centric
• Hierarchy-based
• Context-aware (featuring type-hierarchy)
• Learning to Rank
8
Entity-Centric Ranking Approaches
(An Example)
9
• SAMEAS
Score(e, t) = number of
URIs representing e with
type t.
Hierarchy-Based Approaches
(An Example)
• ANCESTORS
Score(e, t) = number of t’s
ancestors in the type
hierarchy contained in Te.
10
Te often doesn’t
contain all super
types of a
specific type
Context-Aware Ranking Approaches
(An Example)
• SAMETYPE
Score(e, t, cT) = number of
times t appears among
the types of every other
entity in cT.
11
e'
Person
Actor
Actor
AmericanActor
Context
e''
Organization
Thing
e
Learning to Rank Entity Types
Determine an optimal combination of all our
approaches:
• Decision trees
• Linear regression models
• 10-fold cross validation
12
Avoiding SPARQL Queries with
Inverted Indices and Map/Reduce
• TRank is implemented with Hadoop and
Map/Reduce.
• All computations are done by using inverted
indices:
– Entity linking
– Path index
– Depth index
• The inverted indices are publicly available at
exascale.info/TRank
13
EXPERIMENTAL EVALUATION
14
Datasets
• 128 recent NYTimes articles split to create:
– Entity Collection
– Sentence Collection
– Paragraph Collection
– 3-Paragraphs Collection
• Ground-truth obtained by using crowdsourcing
– 3 workers per entity/context
– 4 levels of relevance for each type
– Overall cost: 190$
15
Effectiveness Evaluation
16
Check our paper or contact
us for a complete
description of all the
approaches we evaluated
Efficiency Evaluation
• Tested efficiency on a CommonCrawl sample
of 1TB
– 1,310,459 HTML pages
– 23GB compressed
• Map/Reduce on a cluster of 8 machines with
12 cores, 32GB of RAM and 3 SATA disks
• On average, 25 min. processing time (> 100
docs/node x sec)
17
Text Extraction NER Entity Linking Type Retrieval Type Ranking
18.9% 35.6% 29.5% 9.8% 6.2%
Conclusions
• New task: ranking entity types.
– Useful for: “summarization” of Web-documents,
entity summaries, disambiguation.
• Various approaches: entity-centric, context-
aware, hierarchy-based, learning to rank.
– Hierarchy-based and learning to rank are the most
effective.
• Hadoop, Map/Reduce, and inverted indices to
achieve scalability.
18
Grazie!
• Datasets (with relevance judgments!),
inverted indices, evaluation tools and more
material are available at exascale.info/Trank.
19
Thank you for
your attention!
Check out B-hist at
the SW Challenge!
Thanks to
for the Travel
Award!
TRank is open-
source!https://github.c
om/MEM0R1ES/TRank
20
Entity-Centric Ranking Approaches
• FREQ
Rank(e, t, ck) = number of triples <e> <rdfs:type> <t> in the
knowledge base.
• WIKILINK
Rank(e, t, ck) = number of e’s “neighbor entities” with type t.
• SAMEAS
Rank(e, t, ck) = number of URIs representing e with type t.
• LABEL
Rank(e, t, ck) = frequency of t among the top-10 most similar entities in
terms of label (thank you, Lucene  )
21
Create
Inverted
Index
"Tom Cruise"
label
...
"Tom Hanks"
label
...
"Bill Gates"
label
...
"Osama Bin Laden"
label
...
Knowledge Base
e1
e2
e3
eN
...
"Tom" e1 e3 . . .
"Cruise" e1 . . .
"Hanks" . . .
e3
"Bill" . . .
e2
Inverted Index
Entity-Centric Ranking Approaches
• LABEL
Rank(e, t, ck) =
frequency of t among
the top-10 most
similar entities in
terms of label.
Exploits an inverted
index.
22
...
"Tom" e1 e3 . . .
"Cruise" e1 . . .
"Hanks" . . .
e3
"Bill" . . .
e2
Inverted Index
Label(e) Query
TF-IDF
Ranking
e2
e3
.
.
.
TOP-10
Hierarchy-Based Ranking Approaches
• DEPTH
Rank(e, t, cH) = depth of t in
the type hierarchy.
• ANCESTORS
Rank(e, t, cH) = number of t’s
ancestors in the type
hierarchy contained in Te.
• ANC_DEPTH
Rank(e, t, cH) =
23
Te often doesn’t
contain all super
types of a
specific type
Context-Aware Ranking Approaches
• The context can help getting a better ranking
of types.
24
Italy’s rebellious voters, who opted for a flamboyant billionaire and a
clown, reminded us last week how deeply in crisis the Continent is.
Meanwhile, France is going it virtually alone in Mali, and Britain talks
openly of jumping the European ship altogether.
Landlocked Countries
Least Developed Countries
States And Territories Established In 1960
French-speaking Countries
World Trade Organization Member Economies
Country
African Union Member States
African Countries
Member States Of La Francophonie
African Union Member Economies
Populated Place
Place
• Which is the right type for Mali?
Context-Aware Ranking Approaches
PATH
• Suppose we have to compute Rank(t, e, cT).
• Consider each type t’ of each other entity e’ in c.
• P(t) = path from the root of the type hierarchy to
t.
25
???
Context-Aware Ranking Approaches
Ranking Tom Hank’s types when co-occurring with Tom
Cruise in some text.
26
1
2
3
4
4
1
1
1
Relevance Judgments
• Crowdsourced relevance
judgments.
• Anonymous Web-users
are TRank users.
• 3 workers per
entity/context.
• Overall cost: 190$
• Pilot study on task
design… mega-bubbles!
• Numbers of votes as
relevance score for a
type.
27

Contenu connexe

Tendances

Tendances (11)

Redis Day TLV 2018 - Graph Distribution
Redis Day TLV 2018 - Graph DistributionRedis Day TLV 2018 - Graph Distribution
Redis Day TLV 2018 - Graph Distribution
 
MappingBetweenRealWorldandComputerScience
MappingBetweenRealWorldandComputerScienceMappingBetweenRealWorldandComputerScience
MappingBetweenRealWorldandComputerScience
 
R packages
R packagesR packages
R packages
 
21 spam
21 spam21 spam
21 spam
 
Pig Latin, Data Model with Load and Store Functions
Pig Latin, Data Model with Load and Store FunctionsPig Latin, Data Model with Load and Store Functions
Pig Latin, Data Model with Load and Store Functions
 
Theano tutorial
Theano tutorialTheano tutorial
Theano tutorial
 
Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)
 
Java Extension Methods
Java Extension MethodsJava Extension Methods
Java Extension Methods
 
Ghost
GhostGhost
Ghost
 
R Introduction
R IntroductionR Introduction
R Introduction
 
Incremental View Maintenance for openCypher Queries
Incremental View Maintenance for openCypher QueriesIncremental View Maintenance for openCypher Queries
Incremental View Maintenance for openCypher Queries
 

Similaire à TRank ISWC2013

ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
eswcsummerschool
 
From Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesFrom Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science Tales
Bertram Ludäscher
 

Similaire à TRank ISWC2013 (20)

Entities for Augmented Intelligence
Entities for Augmented IntelligenceEntities for Augmented Intelligence
Entities for Augmented Intelligence
 
Binary Similarity : Theory, Algorithms and Tool Evaluation
Binary Similarity :  Theory, Algorithms and  Tool EvaluationBinary Similarity :  Theory, Algorithms and  Tool Evaluation
Binary Similarity : Theory, Algorithms and Tool Evaluation
 
Entity Retrieval (SIGIR 2013 tutorial)
Entity Retrieval (SIGIR 2013 tutorial)Entity Retrieval (SIGIR 2013 tutorial)
Entity Retrieval (SIGIR 2013 tutorial)
 
The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?
 
Web-scale semantic search
Web-scale semantic searchWeb-scale semantic search
Web-scale semantic search
 
What's "For Free" on Craigslist?
What's "For Free" on Craigslist? What's "For Free" on Craigslist?
What's "For Free" on Craigslist?
 
Type-Aware Entity Retrieval
Type-Aware Entity RetrievalType-Aware Entity Retrieval
Type-Aware Entity Retrieval
 
Elasticsearch - basics and beyond
Elasticsearch - basics and beyondElasticsearch - basics and beyond
Elasticsearch - basics and beyond
 
Recommender Systems and Linked Open Data
Recommender Systems and Linked Open DataRecommender Systems and Linked Open Data
Recommender Systems and Linked Open Data
 
Modern text mining – understanding a million comments in 60 minutes
Modern text mining – understanding a million comments in 60 minutesModern text mining – understanding a million comments in 60 minutes
Modern text mining – understanding a million comments in 60 minutes
 
Elasticsearch - DevNexus 2015
Elasticsearch - DevNexus 2015Elasticsearch - DevNexus 2015
Elasticsearch - DevNexus 2015
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
 
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine
 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
 
lecture1-intro.ppt
lecture1-intro.pptlecture1-intro.ppt
lecture1-intro.ppt
 
lecture1-intro.ppt
lecture1-intro.pptlecture1-intro.ppt
lecture1-intro.ppt
 
Big data
Big dataBig data
Big data
 
Introduction to search engine-building with Lucene
Introduction to search engine-building with LuceneIntroduction to search engine-building with Lucene
Introduction to search engine-building with Lucene
 
From Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesFrom Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science Tales
 
Big data
Big dataBig data
Big data
 

Plus de eXascale Infolab

HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
eXascale Infolab
 
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
eXascale Infolab
 
CIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition rankingCIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition ranking
eXascale Infolab
 

Plus de eXascale Infolab (20)

Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link PredictionBeyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
 
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
 
Representation Learning on Complex Graphs
Representation Learning on Complex GraphsRepresentation Learning on Complex Graphs
Representation Learning on Complex Graphs
 
A force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory mapA force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory map
 
Cikm 2018
Cikm 2018Cikm 2018
Cikm 2018
 
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
 
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
 
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans
Dependency-Driven Analytics: A Compass for Uncharted Data OceansDependency-Driven Analytics: A Compass for Uncharted Data Oceans
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans
 
Crowd scheduling www2016
Crowd scheduling www2016Crowd scheduling www2016
Crowd scheduling www2016
 
SANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutionSANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference Resolution
 
Efficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked DataEfficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked Data
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data Management
 
SSSW 2015 Sense Making
SSSW 2015 Sense MakingSSSW 2015 Sense Making
SSSW 2015 Sense Making
 
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked DataLDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
 
Executing Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web DataExecuting Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web Data
 
The Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task CrowdsourcingThe Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task Crowdsourcing
 
CIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition rankingCIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition ranking
 
OLTP-Bench
OLTP-BenchOLTP-Bench
OLTP-Bench
 
An Introduction to Big Data
An Introduction to Big DataAn Introduction to Big Data
An Introduction to Big Data
 
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
 

Dernier

Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 

Dernier (20)

Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 

TRank ISWC2013

  • 1. TRank: Ranking Entity Types Using the Web of Data Alberto Tonon1, Michele Catasta2, Gianluca Demartini1, Philippe Cudré-Mauroux1, and Karl Aberer2 1eXascale Infolab, University of Fribourg, Switzerland {alberto, demartini, phil}@exascale.info ISWC– 25 October 2013 2Distributed Information Systems Laboratory EPFL, Switzerland {firstname.lastname}@epfl.ch
  • 2. Why Entities? • The Web is getting entity-centric! • Entity-centric services 2 Google
  • 3. …and Why Types? • “Summarization” of texts • Contextual entities summaries in Web-pages • Disambiguation of other entities • Diversification of search results 3 Article Title Entities Types Bin Laden Relative Pleads Not Guilty in Terrorism Case Osama Bin Laden Abu Ghaith Lewis Kaplan Manhattan Al-QaedaPropagandists Kuwaiti Al-Qaeda members Judge Borough (New York City) Sulaiman Abu Ghaith, a son-in-law of Osama bin Laden who once served as a spokesman for Al Qaeda Al-Quaeda Propagandist Kuwaiti Al-Qaeda members Jihadist Organizations
  • 4. Entities May Have Many Types 4 Thing American Billionaires People from King County People from Seattle Windows People Agent Person Living People American People of Scottish Descent Harvard University People American Computer Programmers American Philanthropists People from Seattle
  • 5. G: DBPedia 3.8 e: Bill Gates c: «Microsoft was founded by Bill Gates and Paul Allen on April 4, 1975.» Our Task: Ranking Types Given a Context • Input: a knowledge base G, an Entity e, a context c in which e appears. • Output: e’s types ranked by relevance wrt the context c. • Evaluation: crowdsourcing + MAP, NDCG 5 Bill Gates 1. American Chief executive 2. American Computer Programmer 3. American Billionaires 4. …
  • 6. TRank Pipeline 6 Type ranking Type ranking Type ranking Text extraction (BoilerPipe) Named Entity Recognition (Stanford NER) List of entity labels Entity linking (inverted index: DBpedia labels ⟹ resource URIs) foreach List of entity URIs Type retrieval (inverted index: resource URIs ⟹ type URIs) List of type URIs Type ranking Ranked list of types
  • 7. Type Hierarchy 7 <owl:equivalentClass> <owl:Thing> MappingsYAGO/DBpedia (PARIS) type: DBpedia schema.org Yago subClassOf relationship: explicit inferred from <owl:equivalentClass> manually added PARISontology mapping
  • 8. Ranking Algorithms • Entity centric • Hierarchy-based • Context-aware (featuring type-hierarchy) • Learning to Rank 8
  • 9. Entity-Centric Ranking Approaches (An Example) 9 • SAMEAS Score(e, t) = number of URIs representing e with type t.
  • 10. Hierarchy-Based Approaches (An Example) • ANCESTORS Score(e, t) = number of t’s ancestors in the type hierarchy contained in Te. 10 Te often doesn’t contain all super types of a specific type
  • 11. Context-Aware Ranking Approaches (An Example) • SAMETYPE Score(e, t, cT) = number of times t appears among the types of every other entity in cT. 11 e' Person Actor Actor AmericanActor Context e'' Organization Thing e
  • 12. Learning to Rank Entity Types Determine an optimal combination of all our approaches: • Decision trees • Linear regression models • 10-fold cross validation 12
  • 13. Avoiding SPARQL Queries with Inverted Indices and Map/Reduce • TRank is implemented with Hadoop and Map/Reduce. • All computations are done by using inverted indices: – Entity linking – Path index – Depth index • The inverted indices are publicly available at exascale.info/TRank 13
  • 15. Datasets • 128 recent NYTimes articles split to create: – Entity Collection – Sentence Collection – Paragraph Collection – 3-Paragraphs Collection • Ground-truth obtained by using crowdsourcing – 3 workers per entity/context – 4 levels of relevance for each type – Overall cost: 190$ 15
  • 16. Effectiveness Evaluation 16 Check our paper or contact us for a complete description of all the approaches we evaluated
  • 17. Efficiency Evaluation • Tested efficiency on a CommonCrawl sample of 1TB – 1,310,459 HTML pages – 23GB compressed • Map/Reduce on a cluster of 8 machines with 12 cores, 32GB of RAM and 3 SATA disks • On average, 25 min. processing time (> 100 docs/node x sec) 17 Text Extraction NER Entity Linking Type Retrieval Type Ranking 18.9% 35.6% 29.5% 9.8% 6.2%
  • 18. Conclusions • New task: ranking entity types. – Useful for: “summarization” of Web-documents, entity summaries, disambiguation. • Various approaches: entity-centric, context- aware, hierarchy-based, learning to rank. – Hierarchy-based and learning to rank are the most effective. • Hadoop, Map/Reduce, and inverted indices to achieve scalability. 18
  • 19. Grazie! • Datasets (with relevance judgments!), inverted indices, evaluation tools and more material are available at exascale.info/Trank. 19 Thank you for your attention! Check out B-hist at the SW Challenge! Thanks to for the Travel Award! TRank is open- source!https://github.c om/MEM0R1ES/TRank
  • 20. 20
  • 21. Entity-Centric Ranking Approaches • FREQ Rank(e, t, ck) = number of triples <e> <rdfs:type> <t> in the knowledge base. • WIKILINK Rank(e, t, ck) = number of e’s “neighbor entities” with type t. • SAMEAS Rank(e, t, ck) = number of URIs representing e with type t. • LABEL Rank(e, t, ck) = frequency of t among the top-10 most similar entities in terms of label (thank you, Lucene  ) 21
  • 22. Create Inverted Index "Tom Cruise" label ... "Tom Hanks" label ... "Bill Gates" label ... "Osama Bin Laden" label ... Knowledge Base e1 e2 e3 eN ... "Tom" e1 e3 . . . "Cruise" e1 . . . "Hanks" . . . e3 "Bill" . . . e2 Inverted Index Entity-Centric Ranking Approaches • LABEL Rank(e, t, ck) = frequency of t among the top-10 most similar entities in terms of label. Exploits an inverted index. 22 ... "Tom" e1 e3 . . . "Cruise" e1 . . . "Hanks" . . . e3 "Bill" . . . e2 Inverted Index Label(e) Query TF-IDF Ranking e2 e3 . . . TOP-10
  • 23. Hierarchy-Based Ranking Approaches • DEPTH Rank(e, t, cH) = depth of t in the type hierarchy. • ANCESTORS Rank(e, t, cH) = number of t’s ancestors in the type hierarchy contained in Te. • ANC_DEPTH Rank(e, t, cH) = 23 Te often doesn’t contain all super types of a specific type
  • 24. Context-Aware Ranking Approaches • The context can help getting a better ranking of types. 24 Italy’s rebellious voters, who opted for a flamboyant billionaire and a clown, reminded us last week how deeply in crisis the Continent is. Meanwhile, France is going it virtually alone in Mali, and Britain talks openly of jumping the European ship altogether. Landlocked Countries Least Developed Countries States And Territories Established In 1960 French-speaking Countries World Trade Organization Member Economies Country African Union Member States African Countries Member States Of La Francophonie African Union Member Economies Populated Place Place • Which is the right type for Mali?
  • 25. Context-Aware Ranking Approaches PATH • Suppose we have to compute Rank(t, e, cT). • Consider each type t’ of each other entity e’ in c. • P(t) = path from the root of the type hierarchy to t. 25 ???
  • 26. Context-Aware Ranking Approaches Ranking Tom Hank’s types when co-occurring with Tom Cruise in some text. 26 1 2 3 4 4 1 1 1
  • 27. Relevance Judgments • Crowdsourced relevance judgments. • Anonymous Web-users are TRank users. • 3 workers per entity/context. • Overall cost: 190$ • Pilot study on task design… mega-bubbles! • Numbers of votes as relevance score for a type. 27

Notes de l'éditeur

  1. An entity is something that exists by itself, although it need not be of material existance
  2. LEGGI TIPI
  3. STATE OF THE ART NER AND LINKING FOCUS IS RANKING TYPES
  4. PARIS: VLDB2012 ontology alignment Yago super specific types
  5. Entity centric Use only the information connected to the entity Context-aware Exploit the types of entities that co-occur in the context (e.g. Bill Gates + Micr soft vs Bill Gates + Scotland) Hierarchy-based Exploit the type hierarchy Learning to Rank Combine evidences coming from all previous approaches in an optimal way
  6. we start from the node representing an entity, follow same-as links (we get other nodes representing the same entity) and we count how many “new” nodes feature the type we’re giving a score to
  7. The set of types associated to an entity in a knowledge base often doesn’t contain all super types
  8. C_T is the context given by the text
  9. 10 FOLD CROSS VALID DECISION TREE REGRESSION … Preliminary experiments showed that is the best performing model bla la
  10. Use Inverted indices to AVOID SPARQL QUERIES!!
  11. - Increasing granularities of context: from no-context (here is the entity, here are its types, rank them), one sentence/paragraph (rank the types of all entities in this sentence/paragraph) - 3 workers were asked to select the best type of each entity appearing in a given context
  12. ANCESTORS Is the real winner since it uses inverted indices which are faster, no machine learning yadda yadda 
  13. Only pages with schema.org
  14. HADOOP -> scalable, not efficient
  15. SEMANTIC WEB SCIENCE ASSOCIATION!
  16. Ck is the context given by the knowledge base