SlideShare une entreprise Scribd logo
1  sur  360
From Expert Finding to Entity
Search on the Web
Full-day Tutorial at ECIR 2012
1st April 2012
Gianluca Demartini, Peter Mika,
Thanh Tran, Arjen P. de Vries
http://diuf.unifr.ch/main/xi/EntitySearchTutorial
Presenters
• Dr. Gianluca Demartini
– eXascale Infolab, University of Fribourg, Switzerland
– Research Interests:
• Entity Search
• IR Evaluation
• Semantic Web
01 Apr 2012 2
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
Presenters
• Dr. Peter Mika
– Senior Researcher, Yahoo! Research, Barcelona
– Semantic Search group at Yahoo! Barcelona
– Semantic Search, Web Object Retrieval, Natural
Language Processing
01 Apr 2012 3
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
Presenters
• Dr. Thanh Tran
– (Institut AIFB, Universität
Karlsruhe, Germany)
– Semantic Search group at AIFB
– Semantic Search, Semantic Data
Management, Linked Data
Query Processing
01 Apr 2012 4
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
Presenters
• Prof.dr.ir. Arjen P. de Vries
– Interactive Information Access research group,
Centrum Wiskunde & Informatica (CWI); Delft
University of Technology; Spinque
– Research interest: the intersection between
information retrieval and databases
01 Apr 2012 5Van Rijsbergen, 1979
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
Entity
• An entity is a “proper noun”, “something that
is referred to”
Outcome of “definition” discussion reported in SIGIR Workshop Report
on The First International Workshop on Entity-Oriented Search (EOS),
Krisztian Balog, Arjen P. de Vries, Pavel Serdyukov, Ji-Rong Wen, ACM
SIGIR Forum, Vol. 45, No. 2, Dec. 2011, pp 43-50
01 Apr 2012 6
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
Entity Search
• All those search tasks that aim at retrieving as
answer to a user query an entity instead of a
document
– People, Countries, Movies, Restaurants, etc.
01 Apr 2012 7
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
Motivation
• Information is entity-centric
• Search for information is often conducted
around entities (Query log analysis)
– Many queries (50%) search for specific entities
instead of documents [Kumar&Tomkins09]
• Traditional search retrieves a list of blue links
• Novel web experiences may be designed
around entities instead
01 Apr 2012 8
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
Here, for one search query “Nicole Kidman”,
various entities make up the answer:
bio
photos
movies
trivia
quotes
(...)
01 Apr 2012 9
Entity-centric Applications
• Enterprise applications
• News portals
• Movie portals
• Product reviews
• Search Engines
01 Apr 2012 10
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
Entities in SERP
01 Apr 2012 11
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
Entities in SERP
01 Apr 2012 12
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
Entity Search: The Pipeline
• Entity Representation (DB/SW)
• Entity Extraction (NLP)
• Entity Linking and De-duplication (DB/SW)
• Entity Storage and Indexing (DB/SW)
• Entity Search and Ranking (IR)
• Result presentation (HCI)
• Evaluation (HCI/IR)
01 Apr 2012 13
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
Outline
• FULL DAY Tutorial (sorry, this is not a joke :)
• Morning
– Data (Peter)
– Data Management (Thanh)
• Afternoon
– Search and Ranking (Gianluca & Thanh)
– Evaluation (Arjen)
01 Apr 2012 14
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
Morning
• Data
– Structured vs. Unstructured:
– Entity Profiles: data models, entity identifiers,
standards
– Datasets (Desktop, Enterprise, Wikipedia, Web, RDF)
• Data Management
– Entity Extraction
– Entity de-duplication / data fusion
– Entity storage & indexing
01 Apr 2012 15
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
Afternoon
• Search and Ranking
– Expert Finding models
– Entity Ranking in Wikipedia
– Web Entity Retrieval
– Entity Search over Structured Data
– Relational Entity Search over Structured Data
• Evaluation
– TREC Enterprise
– INEX Entity Ranking
– TREC Entity
– SemSearch, Ad-hoc Object Retrieval
01 Apr 2012 16
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
Data for Semantic Search
Data
• Web data
– Information Extraction
– Semantic Web
• Non-web data
– Enterprise data
– Desktop data
– Email
– ...
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
18
Data on the Web
• Most web pages on the Web are generated from
structured data
– Data is stored in relational databases (typically)
– Queried through web forms
– Presented as tables or simply as unstructured text
• The structure and semantics (meaning) of the
data is not directly accessible to search engines
• Two solutions
– Information Extraction [see Part 2]
– Relying on publishers to use Semantic Web formats
• Linked Data vs. metadata in HTML
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
19
Semantic Web
• Sharing structured data across the Web
– Standard graph-based data model
• RDF
– A number of syntaxes (file formats)
• RDF/XML, RDFa
– Powerful, logic-based schema languages
• OWL, RIF
– Query languages and protocols
• HTTP, SPARQL
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
20
Resource Description Framework
(RDF)
• Each resource (thing, entity) is identified by a URI or
otherwise it’s a blank node
– URIs are globally unique
• Data is broken down into individual facts
– Triples of (subject, predicate, object)
• A set of triples is published together in an RDF document
example:roi
“Roi Blanco”
name
type
foaf:Person
RDF document
01 Apr 2012 21
An RDF graph
peter#123
“Peter Mika”
name
foaf:Person
sameAs
peter#456
worksWith
roi#234
“roi@yahoo-inc.com”
email
type
type
01 Apr 2012 22
OWL, the Web Ontology Language
• The schema language for the Semantic Web
– Classes, properties and restrictions on their usage
– Allows validation and inference
• Schema is also data
– Published just like any other RDF document
– Queries can refer to both schema and data
• e.g. taxonomy expansion: retrieve instances of a class
and instances of all subclasses
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
23
Publishing RDF and OWL
• Linked Data
– Data published as RDF documents linked to other RDF
documents
• Typically RDF/XML or Turtle
• Keep an eye on JSON-LD
– Community effort to (re-)publish open datasets
• Embedded metadata
– RDFa, microdata, microformats annotations inside webpages
– Recommended for site owners by Yahoo, Google, Facebook
• SPARQL endpoints
– Triple stores (RDF databases) that can be queried through the
web
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
24
Linked Data
• Interlinked datasets on the Web
– Often data from existing databases or APIs
• The four rules of Linked Data:
– Use URIs to identify things.
– Use HTTP URIs so that these things can be referred to
and accessed by people and crawlers.
– Use standard formats such as RDF to provide useful
information about the thing when its URI is accessed
– Include links to other datasets
• Most importantly: links to entities in other datasets that
describe the same entity
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
25
Peter’s homepage
Yahoo!
Friend-of-a-Friend ontology
Linked Data
peter#123
“Peter Mika”
name
foaf:Person
sameAs
peter#456
worksWith
roi#234
“roi@yahoo-inc.com”
email
type
type
01 Apr 2012 26
Linked (Open) Data = LOD
• Advantages:
– No change to the publishing of the HTML documents
– Data can be published by third party (e.g. Dbpedia)
• Disadvantages:
– Web servers need to be configured to properly handle URIs that
identify concepts instead of documents
– Not favored by search engines
• Lack of use cases
• Crawling needs to be changed
• Authority is difficult to determine
• Tools
– Triple stores (Virtuoso, Oracle etc.) and front-ends (Pubby)
– RDB-to-RDF mappers (e.g. D2RQ, Triplify)
– Validators (Vapour)
– Linked Data browsers (many)
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
27
Linked Data community
• Community effort to (re)publish open datasets
as Linked Data
– In particular, scientific and government data
– see linkeddata.org and ckan.org for developer
information and datasets
Linked Data in practice
• Fetching data dumps
– See catalogs such as thedatahub.org, linkeddata.org
• Crawling Linked Data
– Similar to HTML crawling, but the the crawler needs to
parse RDF/XML (and others) to extract URIs to be crawled
– Semantic Sitemap/VOID descriptions
– Existing crawls
• Billion Triples Challenge (2009-2011) datasets
• LOD cache
• Querying SPARQL endpoints
– See catalogs such as thedatahub.org, linkeddata.org
– Semantic Sitemap/VOID descriptions
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
29
Datasets
• Broad coverage datasets are linking hubs
– Dbpedia
– Freebase
– Starting in 2012: Wikidata
• Domain-specific datasets form clusters
– Biology
– Government
– Library
– Entertainment
– …
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
30
Wikipedia
31
Dbpedia
Using
the Dbpedia
ontology
Raw data
32
Metadata in HTML
• 1995: HTML meta tags
• 1998: RDF/XML
• 2003: Web 2.0
– Tagging
– Microformats
– Metadata in Wikipedia
– Machine tags in Flickr
• 2005: eRDF
• 2008: RDFa 1.0
• 2011: RDFa 1.1
• 2012: Microdata
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
33
HTML meta tags
<HTML>
<HEAD profile="http://dublincore.org/documents/dcq-html/">
<META name="DC.author" content="Peter Mika">
<LINK rel="DC.rights copyright"
href="http://www.example.org/rights.html" />
<LINK rel="meta" type="application/rdf+xml" title="FOAF"
href= "http://www.cs.vu.nl/~pmika/foaf.rdf">
</HEAD>
…
</HTML>
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
34
Microformats (μf)
• Agreements on the way to encode certain kinds metadata in HTML
– Reuse of semantic-bearing HTML elements
– Based on existing standards
– Minimality
• Microformats exist for a limited set of objects
– hCard (persons and organizations)
– hCalendar (events)
– hResume
– hProduct
– hRecipe
• Varying degrees of support and stability
– hCard and rel-tag are widely supported
• Community centered around microformats.org
– Specifications and discussions are hosted there
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
35
Microformats: limitations
• No shared syntax
– Each microformat has a separate syntax tailored to the
vocabulary
• No formal schemas
– Limited reuse, extensibility of schemas
– Unclear which combinations are allowed
• No datatypes
• No namespaces, unique identifiers (URIs)
– no interlinking
– mapping between instances is required
• Always appears in the HTML <body>
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
36
Example: the hCard microformat
<cite class="vcard">
<a class="fn url" rel="friend colleague met” href="http://meyerweb.com/">
Eric Meyer</a> </cite> wrote a post (<cite>
<a href="http://meyerweb.com/eric/thoughts/2005/12/16/tax-relief/">
Tax Relief</a></cite>) about an unintentionally humorous letter he received from
the <span class="vcard”> <a class="fn org url" href="http://irs.gov/">
Internal Revenue Service</a>
</span>.
<div class="vcard">
<a class="email fn" href="mailto:jfriday@host.com">Joe Friday</a>
<div class="tel">+1-919-555-7878</div>
<div class="title">Area Administrator, Assistant</div>
</div>
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
37
RDFa
• W3C standard for embedding RDF data in HTML documents
– A set of new HTML attributes to be used in head or body
– A specification of how to extract the data from these attributes
• RDFa is just a syntax, you have to choose a vocabulary separately
• RDFa 1.0 is a W3C Recommendation since October, 2008
– RDFa Primer
• RDFa 1.1 currently under standardization
– RDFa Core & RDFa Lite Working Draft as of January 31, 2012
– Updated version of the RDFa Primer
• RDFa API for accessing RDFa data in a webpage in the browser from
JavaScript
– Currently Working Draft (April 19, 2011)
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
38
RDFa 1.1
• Changes
– New vocab attribute to define the default
namespace for the document or subtree
– Syntax changes for ease of use
– RDFa Lite profile
• RDFa 1.1 is backward compatible with RDFa
1.0
– RDFa 1.1 is recommended if you want to use
HTML5
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
39
Microdata
• Currently under standardization at the W3C
– Working Draft (May 25, 2011)
• Microdata vs. RDFa
– Microdata is simpler to author
– Lacking some extension features such as co-typing
• HTML5 also has a number of “semantic”
elements such as <time>, <video>, <article>,
<section>…
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
40
Microdata example
<div itemscope itemid=“http://www.yahoo.com/resource/person”>
<p>My name is <span itemprop="name">Neil</span>.</p>
<p>My band is called
<span itemprop="band">Four Parts Water</span>.
I was born on
<time itemprop="birthday" datetime="2009-05-10">May 10th 2009</time>.
<img itemprop="image" src=”me.png" alt=”me”>
</p>
</div
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
41
Example: Facebook’s Like and the
Open Graph Protocol
• The ‘Like’ button provides publishers with a way to
promote their content on Facebook and build communities
– Shows up in profiles and news feed
– Site owners can later reach users who have liked an object
– Facebook Graph API allows 3rd party developers to access the
data
• Open Graph Protocol is an RDFa-based format that allows
to describe the object that the user ‘Likes’
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
42
Example: Facebook’s Open Graph
Protocol
• RDF vocabulary to be used in conjunction with RDFa
– Simplify the work of developers by restricting the freedom in RDFa
• Activities, Businesses, Groups, Organizations, People, Places, Products and
Entertainment
• Only HTML <head> accepted
<html xmlns:og="http://opengraphprotocol.org/schema/">
<head>
<title>The Rock (1996)</title>
<meta property="og:title" content="The Rock" />
<meta property="og:type" content="movie" />
<meta property="og:url" content="http://www.imdb.com/title/tt0117500/"
/>
<meta property="og:image" content="http://ia.media-
imdb.com/images/rock.jpg" /> …
</head> ... 43
Fragmentation of web markup
• Multiple schemas
– Yahoo!’s SearchMonkey – June, 2008
– Google announces Rich Snippets – June, 2009
• Faceted search for recipes – Feb, 2011
– Facebook’s Open Graph Protocol – April, 2010
• ‘Verbs’ added to OGP – September, 2010
– Bing tiles – Feb, 2011
• Different syntax
– Microformats, RDFa, microdata
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
44
Schema.org
• Agreement between Bing, Google, and Yahoo on
what markup webmasters should use
– Help adoption by reducing fragmentation
– Pre-competitive: each party will continue to build
competing products independently
• Schema.org covers areas of interest to all three
parties
– Business listings (local), creative works (video),
recipes, reviews
– Expected to open up also to external contributions for
non-core areas
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
45
Example: schema.org
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
46
Embedded metadata in practice
• 5-10% of webpages contain some explicit
metadata
– Statistics computed from commoncrawl.org give
different results
• Schema.org helped resolve fragmentation
– Except Facebook
• RDFa and microdata likely to co-exist
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
47
Non-web Data
Enterprise Data
• Unstructured
– Technical reports, Product Specification, etc.
• Semi-structured
– E-mail, Spreadsheets
• Structured
– Databases, Repositories
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
49
Enterprise Search
• Challenges
– Deal with data and format diversity
– Index/search diverse datasets
• Vertical vs Centralized systems
– Deal with security and access control
– Specific informational needs
• Expert Finding
• Writing an overview
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
50
Desktop Data
• Textual
– Unstructured
• Txt documents
– Semi-Structured
• E-mails, PDFs, Word files, etc. contain much metadata
• Multi-media
– Pictures, Videos, Audio
– Metadata
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
51
Desktop Search
• IR techniques over unstructured data
• Exploit
– the structure and metadata available
– user activity logs (browsing history, file access
patterns, etc.)
• Beagle++
– Hybrid search over inverted index and RDF store
– E-mail context and attachments
– Folder structure
– Browser cache Enrico Minack, Raluca Paiu, Stefania Costache, Gianluca
Demartini, Julien Gaugaz, Ekaterini Ioannou, Paul-Alexandru
Chirita, Wolfgang Nejdl: Leveraging personal metadata for
Desktop search: The Beagle++ system. J. Web Sem. 8(1): 37-
54 (2010)
01 Apr 2012
Tutorial Outline
• Morning
– Data (Peter)
– Data Management (Thanh)
• Afternoon
– Search and Ranking (Gianluca & Thanh)
– Evaluation (Arjen)
01 Apr 2012 53
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
Data Management
Agenda
• Knowledge/Entity Extraction
• Entity Linking
• Entity De-duplication
• Entity Storage & Indexing
… very high-level overview of problems and
solutions!
… see tutorials on the specific problems!
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
55
Knowledge/Entity Extraction
Source: Tadej Steiner from Jozef
Stefan Institute, Ljubljana, Slovenia
Problem definition
• Knowledge extraction:
– Extracting information from data and
– Adding it to a knowledge base
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
57
Problem definition
• Information extraction + knowledge
acquisition
(textual) data
extracted
infomation
knowledge
base
Information
extraction
Knowledge
acquisition
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
58
Information extraction
• From the advent of the WWW, there are huge
quantities of unstructured textual data,
where manual information extraction would
be infeasible
• How to extract information from text
automatically with human-comparable quality
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
59
Information extraction: early solutions
• Match manually defined patterns against text
• Example:
– Patterns like “Pay ? from ? in favor of ?”
– ATRANS (1986) inter-banking message exchange
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
60
Knowledge acquisition
• How to transform a world (or domain) model
from existing forms into a computer-friendly
form
– Conceptual knowledge (classes, rules, T-Box) VS.
– Instance information (instance data, resource
descriptions, data records, A-Box)
• Use cases for knowledge bases:
– Answering complex entity search queries /
questions in general:
• “which scientists are also politicians?”
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
61
Knowledge acquisition
• Constructing a knowledge base is expensive
– The Cyc KB was mostly manually constructed over
the last 20 years
• Coupling information extraction and
knowledge acquistion lets us construct a
knowledge base with no or little human effort
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
62
Challenges
• Human effort:
– Defining (domain-specific and domain-
independent) extraction patterns
– Especially, in case of bootstrapping approaches:
• Specifying relations
• Construction of training examples
– Maintaining knowledge base consistency
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
63
Related research areas
• Natural language processing
• Information extraction
• Machine learning
• Knowledge management
 Knowledge extraction tools can be compared
by these perspectives
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
64
General knowledge extraction tools
• WebKB
• TextRunner
• Cyc
• SOFIE with the corresponding YAGO
knowledge base
• Read The Web
• EntityCube
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
65
Natural language processing
• Employed by most modern approaches
• Part-of-speech tagging
• Noun phrase chunking, used for entity
extraction
• Abstraction of text
– From: “Slovenia borders Italy”
– To:“noun – verb – noun”
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
66
Information extraction: entities
• Entity extraction / Named Entity Recognition
– “Slovenia borders Italy”
• Entity resolution
– “Apple released a new Mac”.
– From “Apple”, “Mac”
– To Apple_Inc., Macintosh_(computer)
• Entity classification
– Into a set of predefined categories of interest
– Person, location, organization, date/time, e-mail
address, phone number, etc.
– E.g. <“Slovenia”, type, Country>
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
67
Some NER tools
• Java
– Stanford Named Entity Recognizer
• http://nlp.stanford.edu/software/CRF-NER.shtml
– GATE
• http://gate.ac.uk/
– LingPipe
• http://alias-i.com/lingpipe/
• C
– SuperSense Tagger
• http://sourceforge.net/projects/supersensetag/
• Python
– NLTK
• http://www.nltk.org
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
68
NER – list lookup
• Entities stored in lists (gazetteers)
– E.g., Countries and cities
• Plus: Simple, fast, cross-language
• Minus: list update, name variants (UPF,
Universitat Pompeu Fabra), ambiguity
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
69
List lookup – ambiguities
• Term level
– E.g. capitalized words: [All American Bank] vs. All
[State Police]
• Structure level
– “[Dolce and Gabbana]” vs “[Microsoft] and [Yahoo!]”
• Type level
– John Smith (organization vs. person)
– May (person vs. date vs. verb)
– Washington (person vs. location)
– 2015 (date vs. time)
 Gazetteers not end solution but sources of
background knowledge
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
70
NER methods
• Rule Based
– Regular expressions, e.g. capitalized word + {street, boulevard, avenue} indicates
location
– Engineered vs. learned rules
• NER can be formulated as classification tasks
– NE extraction: assign word mentions to tags (B beginning of an entity, I continues
the entity, O word outside the entity)
– NE classification: assign entity mentions to categories (Person, Organization, etc.)
– Use ML methods for classification: Decision trees, SVM, AdaBoost
– Standard classification assumes cases are disconnected (i.i.d)
• Probabilistic sequence models: HMM, CRF
– Each token in a sequence is assigned a label
– Labels of tokens are dependent on the labels of other tokens in the sequence
particularly their neighbors (not i.i.d).
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
71
Naïve Bayes Classification
• Determine category of xk by computing for each yi
• Priors P(Y=yi) and conditionals P(X=xk | Y=yi)
estimated from data (via MLE),
– E.g. If ni of the examples in D are in yi then P(Y=yi) = ni / |D|
• When categories are complete and disjoint, P(X=xk):
)(
)|()(
)|(
k
iki
ki
xXP
yYxXPyYP
xXyYP



 




m
i k
iki
m
i
ki
xXP
yYxXPyYP
xXyYP
11
1
)(
)|()(
)|(


m
i
ikik yYxXPyYPxXP
1
)|()()(
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
72
Classification via Logistic Regression
• Instead of generative models, a descriminative model can be
used to specifically focus on the conditional distribution P(Y | X)
• Assumes a parametric form for directly estimating P(Y | X)
• Basically a linear model
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
73


 n
i ii Xww
XYP
10 )exp(1
1
)|1(






 n
i ii
n
i ii
Xww
Xww
10
10
)exp(1
)exp(
)|1(
)|0(
1iff0labelAssign
XYP
XYP
Y


 

n
i ii Xww 10 )exp(1


n
i ii Xww 100


n
i ii Xww 10lyequivalentor
)|1(1)|0( XYPXYP 
Classification
Y
X1 X2
… Xn
Y
X1 X2
… Xn
Naïve
Bayes
Logistic
Regression
Conditional
Generative
Discriminative
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
74
Sequence Labeling
Y2
X1 X2
… XT
HMM
Linear-chain CRF
Conditional
Generative
Discriminative
Y1 YT
..
Y2
X1 X2 … XT
Y1 YT
..
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
75
Sunita Sarawagi and William W.
Cohen. Semi-Markov Conditional
Random Fields for Information
Extraction. In NIPS, 2005.
NER features
• Gazetteers (background knowledge)
– location names, first names, surnames, company names
• Word
– Orthographic
• initial-caps, all-caps, all-digits, contains-hyphen, contains-dots,
roman-number, punctuation-mark, URL, acronym
– Word type
• Capitalized, quote, lowercased, capitalized
– Part-of-speech tag
• NP, noun, nominal, VP, verb, adjective
• Context
– Text window: words, tags, predictions
– Trigger words
• Mr, Miss, Dr, PhD for person and city, street for location
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
76
Exploiting Query Logs / Click-Through Data
• Weakly-supervised entity Extraction from
queries / click-through data
– A small set of seed instances for each entity type
• Learn
– Patterns captured by LDA-based topic model:
Probabilities of query contexts and click websites of
named entities for each class
– Template: common query prefix and postfix, e.g.
“how did country gain independence”
• Apply patterns / templates to click-through data /
query logs to mine new named entities
Marius Pasca: Weakly-supervised discovery of named entities using web search queries. CIKM 2007:683-690
Gu Xu, Shuang-Hong Yang, Hang Li: Named entity mining from click-through data using weakly supervised
latent dirichlet allocation. KDD 2009:1365-1374
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
77
Information extraction: relations
• Relation extraction
– <“Slovenia”, “borders”, “Italy”>
• Relation resolution
– <“Slovenia”, borders, “Italy”>
– <“Slovenia”, next_to, “Italy”>
• We distinguish between open and
bootstraped approaches
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
78
Relation Extraction
• Extracting relations
– Typical paraphrase problem: identify all the ways a relation may be
expressed
• Formulated as classification task, e.g. uses SVM
– Training data: parse tree, with nodes associate with a type as well as a
role (e.g. role=member, role=affiliation to capture a person-affiliation
relation)
– Tree-based kernel: two trees are similar if roots have same type and role,
and each has a subsequence of children (not necessarily consecutive)
with the same types and roles
– Examples are converted into such parse trees with role labels, and used to
train the system
– SVM can then classify new examples of possible relations
• Formulate as sequence labeling (semantic role labeling)
• Joint inference: considers different types of features (syntactic,
semantic) and problems (extraction, resolution)
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
79
Bootstrapped information extraction
• Provide examples for relationships which we
want to extract
• Compromise: lower coverage, higher quality
• Example: Sofie, ReadTheWeb
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
80
Open information extraction
• We do not want to put constraints on the
types of relationships we want to extract
• Very interesting for open-domain WWW
datasets
• Example: TextRunner
• Compromise: higher coverage, lower quality
• Hybrid approaches:
– EntityCube combines both bootstrapped and open
extraction
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
81
Knowledge management
• Organization
• Consistency management
• Strictness
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
82
Knowledge organization
• Lexicon: A set of entities and statements
• Ontology: A complex graph of formal concepts
– Not only concrete entities, but also abstract
classes
– Sofie/Yago, WebKB, ReadTheWeb, TextRunner
• Full world model: A Context-sensitive complex
graph
• Cyc
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
83
Knowledge consistency
• Consistency management
– Not all extracted information is accurate
– Inaccurate information leads to inconsistencies in
the knowledge base
– Example:
• Having pattern “?x is mayor of ?y” and knowledge that
<x,mayorOf,y> requires <x,type,Person> and
<y,type,City>, we can filter out inconsistent
information
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
84
Knowledge consistency
• Examples:
– SOFIE:
• Select the subset of statements which have the
maximum satisfiability with regard to constraints
– ReadTheWeb:
• Learns new constraints via semi-supervised boostrap
learning
• Accuracy grows with ontology complexity
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
85
Knowledge management
• Bootstrapping
– Using existing manually prepared knowledge to
generate new knowledge
– While the knowledge base grows, the rules for
extraction also change
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
86
Knowledge management
• Strictness:
– When do we consider entity and relationship
resolution important?
• Depends on use case:
– Reasoning and data integration requires well-
formed and unambigouous entities and relations
– Information retrieval can cope with not-well
formed relations and entities
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
87
Machine learning
• Used in NLP, IE as well as knowledge
acquisition
• Various approaches
– Self-supervised
– Semi-supervised
– Supervised
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
88
Machine learning
• Natural language processing
– Part-of-speech learning
• Information extraction
– Pattern learning
• ReadTheWeb, TextRunner, WebKB
• Knowledge acquisition
– Rule learning (WebKB)
– Constraint learning (ReadTheWeb)
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
89
Summary
• Cyc
– Full world model knowledge base
• WebKB
– First attempt of automatically constructing a
knowledge base
• TextRunner
– Open information extraction
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
90
Summary
• EntityCube
– Hybrid bootstrapped and open IE
• SOFIE/YAGO
– Tight integration of natural language processing,
disambiguation and acquisition
• Read The Web
– Constraint learning
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
91
Entity Linking
Source: Tadej Steiner from Jozef
Stefan Institute, Ljubljana, Slovenia
Basic situation
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
93
Pipeline
1. Identify named entity mentions in source
text using a named entity recognizer
2. Given the mentions, gather candidate KB
entities that have that mention as a label
3. Rank the KB entities
4. Select the best KB entity for each mention
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
94
Pipeline
1. Identify named entity mentions in source
text using a named entity recognizer
2. Given the mentions, gather candidate KB
entities that have that mention as a label
3. Rank the KB entities
4. Select the best KB entity for each mention
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
95
Linking approaches - pair-wise linking
• Pair-wise linking: for each in-text entity,
choose the candidate entity which is the best
w.r.t. description similarity and textual
features
• Is each disambiguation choice independent?
– Pair-wise vs. collective disambiguation
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
96
Important ranking features
• Mention popularity – P(entity|mention)
– P(dbpedia:Kashmir_(song)|”Kashmir”) = 0.54
– P(dbpedia:Kashmir_(region)|”Kashmir”) = 0.91
– Distribution of links and anchors in Wikipedia
Context similarity - sim(ctx(mention), ctx(entity))
Context of a mention is the surrounding sentences
Context of an entity is the description of the entity (Wiki article)
Coherence / Collective
Entities that appear together tend to be related to one another
Usually solved by a greedy graph pruning algorithm
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
97
Collective linking
• For each in-text entity, choose the candidate entity
which is the most similar to the in-text entity and
related to other entities that are already chosen.
01 Apr 2012 98
Tadej Stajner, Dunja Mladenic: Entity Resolution in Texts Using Statistical Learning and
Ontologies. ASWC 2009:91-104
Xianpei Han, Le Sun, Jun Zhao: Collective entity linking in web text: a graph-based
method. SIGIR 2011:765-774
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
Relatedness
• Intuition: entities that co-occur in the same
context tend to be more related
• How can we express relatedness of two
entities in a numerical way?
– Statistical co-occurrence
– Similarity of entities’ descriptions
– Relationships in the ontology
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
99
Semantic relatedness
• If entities have an explicit assertion connecting
them (or have common neighbours), they tend
to be related
Elvis
Memphis
Elvis
Presley
Memphis,
Egypt
Memphis,
TN
origin
Person
Location
type
type
type
St. Elvis
type
01 Apr 2012 100
Co-occurrence as relatedness
• If distinct entities occur together more often
than by chance, they tend to be related
Document
FC
Barcelona
Bayern
FC
Barcelona
Bavaria
Bayern
München
Mutual information
Mutual information
x
y
x
y
x
y
z
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
101
Content similarity as relatedness
• If distinct entities have higher similarity of
their descriptions, they tend to be related
Document
a
b
x
y
z
similarity = 0.35
similarity = 0.25
similarity = 0.35
similarity = 0,7
similarity = 0,1
similarity = 0,2
01 Apr 2012 102
Architecture
Input text
Preprocessing
(entity extraction and
consolidation)
.. with in-
text
entities
Background
knowledge
(ontology)Match
retrieval
Entity description
vectors
Assertion type
informativeness
Entity
co-occurences
.. with
resolved
entities
Relatedness
Entity
linking 103
Crowdsourcing for Entity Linking
Micro
Matching
Tasks
HTML
Pages
HTML+ RDFa
Pages
LOD Open Data Cloud
Crowdsourcing
Platform
Z enCrowd
Entity
Extractors
LOD Index Get Entity
Input Output
Probabilistic
Network
Decision Engine
Micro-
TaskManager
Workers Decisions
Algorithmic
Matchers
Gianluca Demartini, Djellel Eddine Difallah, and Philippe Cudré-Mauroux.
ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for Large-Scale
Entity Linking. In: 21st International Conference on World Wide Web (WWW 2012), Lyon,
France, April 2012.
104
Crowdsourcing for Entity Linking
• Matching micro-task
– Unclear (i.e., low confidence) matches are
crowdsourced
– Top algorithmic results are presented to the
workers
– Answers from the crowd are input to a
probabilistic network
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
105
Crowdsourcing for Entity Linking
• Probabilistic Graph
– Worker prior probability (from previous tasks)
– Link prior probability (from algo matchers)
– Link factors connect worker clicks and links
– SameAs constraints
– Dataset unicity contstraints
w1
w2
l1
l2
pw1( ) pw2( )
lf1( ) lf2( )
pl1( ) pl2( )
l3
lf3( )
pl3( )
c11
c22
c12
c21
c13
c23
u2-3( )sa1-2( )
01 Apr 2012
Entity De-duplication
“Entity Consolidation”
“Entity Resolution”
“Record Linkage”
“Instance Matching”
Sources: Yongtao Ma from Karlsruhe Institute of Technology, Samur Araujo from
The Delft Bioinformatics Lab and Aidan Hogan from Digital Enterprise Research Institute
Structure
• Motivation
• Problem and task overview
• Consider only explicit owl:sameAs
• Consider some lightweight reasoning
• Inductive / instance matching methods
– Effectiveness
– Efficiency
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
108
Motivation 340,000
Results
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
109
Motivation
• 2% of customer records obsolete in 1 month, due to deaths, name
changes
• $611B/year loss in US due to poor customer data
• An average company has 49 different databases and spends 35% of its
IT dollars on integration efforts
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
110
Motivation
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
111
Hetereogenity in naming…
Tim Berners-Lee: URIs
…
timbl:i
dblp:100007
identica:45563
adv:timblfb:en.tim_berners-lee
db:Tim-Berners_Lee
= owl:sameAs
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
112
11
3
De-duplication for Web data
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
113
Entity De-duplication
Problem and Task Overview
Data integration – big picture
• Ontology matching
– Widely studied in Semantic Web research, see e.g. list of publications
at ontologymatching.org
• Entity de-duplication
– Logic-based approaches in the Semantic Web
– Studied as record linkage in the database literature
– Machine learning based approaches, focusing on attributes
– Graph-based approaches, see e.g. the work of Lisa Getoor are
applicable to RDF data
• Improvements over only attribute based matching
• Blending / data fusion
– Merging objects that represent the same real world entity and
reconciling information from multiple sources
– Information quality / redundance
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
115
De-duplication
• The problem of determining if two instances refer to the small real-
world entity.
owl:sameas
Source Instances Target Instances
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
116
1. Find equivalences in the data
• Consider only explicit owl:sameAs (baseline)
• Consider some lightweight reasoning (extended)
• Inductive / instance matching methods
2. Rewrite data according to equivalences (data fusion)
De-duplication – task overview
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
117
Entity De-duplication
Consider only explicit owl:sameAs
• Use provided owl:sameAs mappings in the data
timbl:i owl:sameas identica:45563 .
dbpedia:Berners-Lee owl:sameas
identica:45563 .
• Store “equivalences” found
timbl:i ->
identica:45563 ->
dbpedia:Berners-Lee ->
timbl:i
identica:45563
dbpedia:Berners-Lee
De-duplication
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
119
• For each set of equivalent identifiers, choose a
canonical term
timbl:i
identica:45563
dbpedia:Berners-Lee
De-duplication
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
120
• Afterwards, rewrite identifiers to their canonical
version:
De-duplication
timbl:i rdf:type foaf:Person .
identica:48404 foaf:knows identica:45563 .
dbpedia:Berners-Lee dpo:birthDate “1955-06-08”^^xsd:date .
dbpedia:Berners-Lee rdf:type foaf:Person .
identica:48404 foaf:knows dbpedia:Berners-Lee .
dbpedia:Berners-Lee dpo:birthDate “1955-06-08”^^xsd:date .
timbl:i
identica:45563
dbpedia:Berners-Lee
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
121
Entity De-duplication
Consider some lightweight reasoning
• Infer owl:sameAs through reasoning (OWL 2
RL/RDF)
1. explicit owl:sameAs (again)
2.owl:InverseFunctionalProperty
3.owl:FunctionalProperty
4.owl:cardinality 1 / owl:maxCardinality 1
foaf:homepage a owl:InverseFunctionalProperty .
timbl:i foaf:homepage w3c:timblhomepage .
adv:timbl foaf:homepage w3c:timblhomepage .
⇒
timbl:i owl:sameas adv:timbl .
…then apply data fusion as before
Extended de-duplication
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
123
Entity De-duplication
Inductive / Instance Matching
Methods
Agenda
• Problem overview
• Attribute level
– (see term matching)
• Instance level
– Effectiveness: learning
– Efficiency: blocking
• Dataset level
– (see collective entity linking)
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
125
Problem overview
effectiveness vs. efficiency
Instance Matching
Effectivity
Find correct matches!
Efficiency
Do it fast!
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
126
Efficiency
O(NxM)
Source Target
Not efficient
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
127
“Diclofenac” occurrence on DBPEDIA
01 Apr 2012 128
Problem overview – attribute level
<A1, ‘Dave White’, ‘Intel’, ‘Male’> <P1, ‘Database…’, ‘John Black’, ‘Don White’>
<A2, ‘Don White’, ‘CMU’, ‘Male’> <P2, ‘Multimedia…’, ‘Sue Grey’, ‘D. White’>
<A3, ‘Susan Grey’, ‘MIT’, ‘Female’> <P3, ‘Title3…’, ‘Dave White’>
<A4, ‘John Black’, ‘MIT’, ‘Male’> <P4, ‘Title5…, ‘Don White’, ‘Joe Brown’>
<A5, ‘Joe Brown’, unknown, ‘Male’><P5, ‘Title6…’, ‘Joe Brown’, ‘Liz Pink’>
<A6, ‘Liz Pink’, unknown, ‘Female’> <P6, ‘Title7…’, ‘Liz Pink’, ‘D. White’>
Attribute level
‘Don White’ , ‘D. White’
‘Don White’, ‘Dave, White’
• What (values?)
• How
• Similarity metrics
• Similarity threshold
• Matching techniques
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
129
Problem overview – instance level
<A1, ‘Dave White’, ‘Intel’, ‘Male’> <P1, ‘Database…’, ‘John Black’, ‘Don White’>
<A2, ‘Don White’, ‘CMU’, ‘Male’> <P2, ‘Multimedia…’, ‘Sue Grey’, ‘D. White’>
<A3, ‘Susan Grey’, ‘MIT’, ‘Female’> <P3, ‘Title3…’, ‘Dave White’>
<A4, ‘John Black’, ‘MIT’, ‘Male’> <P4, ‘Title5…, ‘Don White’, ‘Joe Brown’>
<A5, ‘Joe Brown’, unknown, ‘Male’><P5, ‘Title6…’, ‘Joe Brown’, ‘Liz Pink’>
<A6, ‘Liz Pink’, unknown, ‘Female’> <P6, ‘Title7…’, ‘Liz Pink’, ‘D. White’>
Instance level
<A1, ‘Dave White’, ‘Intel’, ‘Male’>
<A2, ‘Don White’, ‘CMU’, ‘Male’>
• What (attributes?)
• How
• Similarity metrics
• Similarity threshold
• Matching techniques
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
130
• How
• Similarity metrics
• Similarity threshold
• Matching techniques
Problem overview – dataset level
<A1, ‘Dave White’, ‘Intel’, ‘Male’> <P1, ‘Database…’, ‘John Black’, ‘Don White’>
<A2, ‘Don White’, ‘CMU’, ‘Male’> <P2, ‘Multimedia…’, ‘Sue Grey’, ‘D. White’>
<A3, ‘Susan Grey’, ‘MIT’, ‘Female’> <P3, ‘Title3…’, ‘Dave White’>
<A4, ‘John Black’, ‘MIT’, ‘Male’> <P4, ‘Title5…, ‘Don White’, ‘Joe Brown’>
<A5, ‘Joe Brown’, unknown, ‘Male’><P5, ‘Title6…’, ‘Joe Brown’, ‘Liz Pink’>
<A6, ‘Liz Pink’, unknown, ‘Female’> <P6, ‘Title7…’, ‘Liz Pink’, ‘D. White’>
Dataset level
• What (instances?)
01 Apr 2012 131
Agenda
• Problem overview
• Attribute Level
– (see term matching)
• Instance Level
– Effectiveness: learning
– Efficiency: blocking
• Dataset Level
– (see collective entity linking)
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
132
Character-based
• [see term matching in Part 3 on search & ranking]
• Edit Distance [G98]
– Character Operations: insert, delete, replace
– Given two strings, s and t, edit(s,t):
• Minimum cost of operations transforming s to t
• Exp.: edit(Eorror, Eror)=1, edit(great,grate)=2
– Aiming at: common typing errors
– Problem: works not well with other type of errors
• Exp.: D. White vs Dave White
• Jaro Rule
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
133
Token-based
• Q-gram
– The q-grams are short character substrings of
length q of the string
– Example: 3-gram(White)={ ‘Whi’, ‘hit’, ‘ite’ }
– set similarity then can be applied to the grams
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
134
Agenda
• Problem overview
• Attribute Level
– (see term matching)
• Instance Level
– Effectiveness: learning
– Efficiency: blocking
• Dataset Level
– (see collective entity linking)
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
135
Questions
• Given instance attributes {Name, Institute,
Gender, Publish}
– Which ones are more important?
– Which similarity measures should be adopted?
– What is the threshold of similarity that should be
adopted?
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
136
Bayes Decision Rule
• Notation
– A,B are two tables, of n comparable fields
– tuple pairs
– classes: M (match) and U (non-match)
– random vector , xi shows the level of
agreement of the ith field for
• Decision rule:
01 Apr 2012 137
called likelihood ratio
Bayes Decision Rule
• Given training data, assume p(xi|M) and
p(xj|M) are independent for i≠j[5]
• Extension:
– Using an expectation maximization (EM) algorithm
to estimate likelihood
– Relax independent assumption
– Decision with reject class
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
138
Agenda
• Problem overview
• Attribute Level
– (see term matching)
• Instance Level
– Effectiveness: learning
– Efficiency: blocking
• Dataset Level
– (see collective entity linking)
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
139
Blocking strategies
Source Target
• Used to reduce the number of instance comparison
• Non-overlapping partitions
• Canopies and clustering – overlapping partitions
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
140
Blocking strategies
Blocking
Attribute
dependent
Attribute
agnostic
When the source and
target schema match.
Otherwise
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
141
Attribute dependent
• Blocking Key Value (BKV)
– Sorted Neighborhood approach
– Q-grams blocking technique
• Blocking keys are highly
discriminating attributes (e.g. last
name, phone number)
• Targeting homogeneous datasets
b
a (blocking key)
c
d
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
142
Sorted Neighborhood
• Motivation:
– similar records have similar values
– multiple “cheap” passes faster than an “expensive” one
• Goal: sort feature by a key to bring matching
records close to each other
• Methodology:
– Create a key for every record (e.g. first 3 characters of last
name)
– Sort data by the key
– Pair-wise comparison within a small sliding window
– Multiple passes based on distinct key
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
143
Sorted Neighborhood
• Example:
ID Name SS Birthday ZIP
r1 David Black 123-45 01.05.1985 76137
r2 Dauid Black 123-45 01.06.1985 76137
r3 David White 325-52 23.09.1984 84212
r4 David B. 126-53 30.10.1983 84123
r5 David B. 745-32 07.05.1973 84212
r1
r2
r4
r3
r5
r2
r1
r3
r4
r5
ZIP[1..3]
Name[1..3]
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
144
Q-gram blocking
• Motivation: similar matches have high overlaps of q-grams
• Goal: relaxes the edit distance constraint to a weaker count
constraint on the number of matching q-grams
• Methodology:
given two strings s and t, and a edit distance constraint k
– Count Filtering: s and t must share LBs,t=max(|s|,|t|)-1-(k-1)*q q-
grams
– Position Filtering: s and t must share at least LBs,t positional q-
grams
– Length Filtering: ||s|-|t||≤k
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
145
Q-gram blocking
• Example:
3-gram
s=abaxabaaba ##a,#ab,aba,bax,axa,xab,aba,baa,aab,aba,ba$,a$$
t=abaabaaba ##a,#ab,aba,baa,aab,aba,baa,aab,aba,ba$,a$$
ED(s,t)≤k → |Q(s) ∩ Q(t)| ≥ max(|s|,|t|)-1-(k-1)*q
ED(s,t)=1, |Q(s) ∩ Q(t)|=9
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
146
Attribute dependent
• Learning the attributes (blocking keys)
– Decision tree
– Maximum hyper-rectangles
DrugBank
DBPEDIA
Label
Drugname
Sideeffect
Page
Title
Name
Producer
Composition
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
147
Attribute dependent
• Learning functions of similarity (e.g., Jaccard, Jaro, Levenshtein, Hamming, Cosine, etc.)
DrugBank
DBPEDIA
Label= TitleDiclofenac Diclofenac Sodium =≈
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
148
Attribute agnostic
• Designed for heterogeneous
information space. (i.e., loose
schema binding, noise, missing or
inconsistent values, as well as an
unprecedented level of
heterogeneity)
• No knowledge about the schema
software
Corp. (blocking key)
radio
film
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
149
Attribute agnostic
• “All tokens”
• Reduce comparison space
– Block purging,
– Block scheduling,
– Block enumeration,
– Duplicate propagation,
– Comparisons propagation, and
– Comparisons pruning.
software
Corp. (blocking key)
radio
film
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
150
Entity Storage & Indexing
Indexing
• Search requires matching and ranking
– Matching selects a subset of the elements to be
scored
• The goal of indexing is to speed up matching
– Retrieval needs to be performed in milliseconds
– Without an index, retrieval would require scanning
through the collection
• The type of index depends on the types of data
and queries to be supported
– DB-style indexing
– IR-style indexing
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
152
IR-style indexing
• Index data as text
– Create virtual documents from data
– One virtual document per subgraph, resource or
triple
• typically: resource
• Key differences to Text Retrieval
– RDF data is structured
– Minimally, queries on property values are required
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
153
Horizontal index structure
• Two fields (indices): one for terms, one for properties
• For each term, store the property on the same position
in the property index
– Positions are required even without phrase queries
• Query engine needs to support the alignment operator
• Dictionary is number of unique terms + number of
properties
01 Apr 2012 154
Vertical index structure
• One field (index) per property
• Positions are not required
– But useful for phrase queries
• Query engine needs to support fields
• Dictionary is number of unique terms
• Number of fields could be a problem for merging,
query performance
01 Apr 2012 155
Indexing using MapReduce
• MapReduce is the perfect model for building
inverted indices
– Map creates (term, {doc1}) pairs
– Reduce collects all docs for the same term: (term,
{doc1, doc2…}
– Sub-indices are merged separately
• Term-partitioned indices
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
156
Search and Ranking
Outline
• Expert Finding models
• Entity Ranking in Wikipedia
• Web Entity Retrieval
• Entity Search over Structured Data
• Relational Entity Search over Structured Data
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
158
From Documents to Entities
• Document Search
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
159
From Documents to Entities
• Entity Search
1. Ent1
2. Ent2
3. Ent3
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
160
A taxonomy of Entity Search tasks
01 Apr 2012 161
Expert Finding - Motivation
• Scenario
– In large companies competencies and
skills are spread
– Executives need to create a team for a new
project: find staff with the right expertise
– Someone needs to solve a problem
– Example: I need an expert on ontology
engineering
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
162
Expert Finding - Motivation
• Goal
– Use the digital content available in the
enterprise
– Create a ranking of people who are experts
in the given topic
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
163
Motivation for System Support
• Busy experts do not have time to maintain
adequate descriptions of their continuously
changing specialized skills
• Expert seekers have poorly articulated
requirements and are not fully enabled to
judge a good expert from a bad one
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
164
Complicating factors
• Volume of communication/publication is not a
reliable indication of expertise
• Certain topics engender more opinion than facts
• Lack of information about past performance of
experts
• New employees don’t know about informal social
networks
• Access to expertise is often controlled (informally or
formally, by the experts or their management)
• Solutions to complex problems require diverse
ranges of expertise
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
165
Evidence of Expertise
• Email or bulletin board messages
• Corporate communications
• Shared folders in file system
• Resumes and homepages
• Employee database
• Email flow
• Bibliographic information
• Software library usage
• Search and publication history
• Project time charges
See also bibliography on TREC-ENT wiki:
http://www.ins.cwi.nl/projects/trec-ent/wiki/index.php/Bibliography
Content
Social
networks
Activities
01 Apr 2012 166
Assumptions
• Content
– Experts are mentioned in relevant documents
– Experts author relevant documents
• Social networks
– People that interact are likely to share expertise
– Evidence in records of information exchange (and
co-authorship, co-work) and/or organizational
structure
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
167
Two Basic Approaches
Who should I ask about the copyright forms?
• Document-based: rank
docs, extract experts
Copyright forms
Lori
Lori
Lori
Ellen
Ian
Lori
Lori
Ellen
Lori
1.
2.
1.
4.
5.
3.
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
168
Document-based Expert Finding
• Find and score documents about the topic
– Title about topic
– Abstract about topic
• Aggregate scores for each distinct author
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
169
Two Basic Approaches
Who should I ask about the copyright forms?
• Document-based: rank
docs, extract experts
• Candidate-based: rank
candidate profiles
Copyright forms
Lori
Lori
Lori
Ellen
Ian
Lori
Lori
Ellen
Lori
1.
2.
1.
4.
5.
3.
Lori
Copyright forms
Ellen
Ian
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
170
Voting model
• Data fusion techniques
• Each ranked document represents a vote for
the expertise of a candidate
• Vote aggregation:
– Number of docs voting for each candidate
– Scores of retrieved documents
– Ranks of retrieved documents
01 Apr 2012
Craig Macdonald, Iadh Ounis: Voting for candidates: adapting
data fusion techniques for an expert search task. CIKM 2006:
387-396
User-Oriented Model
• Additional real-world constraints
• Distance between user and expert
– User previous knowledge on the topic
– Contact time (organizational hierarchy, geo
location, collaboration)
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
Elena Smirnova, Krisztian Balog: A User-Oriented Model for Expert Finding. ECIR 2011: 580-592
172
Additional Techniques
Research Systems
• Combine the two basic approaches
• Estimate the quality of the evidence
• Use of collection/structural knowledge
– Treat emails different from documents
– Treat email’s subject/sender/receiver different
from body
– Locate homepages
See also TREC proceedings 2005-2007
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
173
Additional Techniques
Research Systems
• Use social network extracted from co-
authorship or email lists
• Relevance propagation over expertise graph
• Use Web Search evidence
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
174
Expert Finding - References
– P@noptic Expert [Craswell et al.
Ausweb01]
– Balog’s Model 1 and 2 [Balog et al.
SIGIR06]
– Voting Model [Macdonald and Ounis
CIKM06, ECIR07, ECIR08]
– Expertise evidence [Macdonald et al.
ECIR08]
– Vector Space Model [Demartini et al.
ECIR09]
– Web evidence [Serdyukov et al. TREC08]01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
175
Entity Ranking
Ranking…
• People
• Actors
• … Car companies
[i.e., insert your fav entity type here]
Entity Ranking!!!
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
177
Wikipedia
• Encyclopedia
– multilingual, Web-based, free-content, openly-
editable: errors are promptly corrected
• Articles:
– balanced, neutral, and encyclopedic, containing
notable verifiable knowledge
• Categories / sub-categories
• Links, anchor text (Germany -> Albert Einstein)
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
Entities in Wikipedia
• Art museums
• Countries
• Actors, Singers
• Monarchs
• Artists
• Magicians
• ...
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
179
Example Entity Ranking Scenarios
• Impressionist art museums in Holland
• Countries with the Euro currency
• German car manufacturers
• Artists related to Pablo Picasso
• Countries involved in WWI
• Actors who played Hamlet
• English monarchs who married French women
Many examples on
http://www.ins.cwi.nl/projects/inex-xer/topics/
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
180
Entity Ranking
• Topical query Q
• Entity (result) type TX
• A list of entity instances Xs
• An entity is represented by its Wikipedia page
• Systems employ categories, structure, links
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
181
Tasks
• Entity Ranking (ER)
– Given Q and T, provide Xs
• List Completion (LC)
– Given Q and Xs[1..m]
– Return Xs[m+1..N]
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
182
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
Topic 60
Title
olympic classes dinghy sailing
Entities
470 (dinghy) (#816578)
49er (dinghy) (#1006535)
Europe (dinghy) (#855087)
Categories
dinghies (#30308)
Description
The user wants the dinghy classes that are or have
been olympic classes, such as Europe and 470.
Narrative
The expected answers are the olympic dinghy classes,
both historic and current. Examples include Europe and
470.
TX
Q
Xs
01 Apr 2012 183
Formal Model for Entity Ranking
– Indexing
• Entities
• Data Sources
“Alexandre Pato”
ID: ap12dH5a
(born in; 1989)
(playing with; acm15hDJ)
Formal Model for Entity Ranking
• Searching
– Users' Information Need
– Entity Ranking System
Approaches to ES in Wikipedia
• Exploit and refine the category structure
– Wordnet to find entity types (e.g., a professor is a
person)
• Extend the query
– Synonyms and related words (Wordnet synsets)
• Exploit the link structure
– Links in Wikipedia are usually entities
– Search Keywords also in anchor text of outLinks
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
186
YAGO
– Suchanek et al. 2007
– Highly accurate ontology
(>95%)
– Extracted from Wikipedia +
WordNet
– Provides semantic concepts
describing Wikipedia entities
Married...
With
Children
Sitcoms
WordNet
Synset
Wikipedia
Category
Wikipedia Taxonomy
YAGO subClassOf
Situation
Commedy
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
Category Based Search
• Query expansion by modifying category
information
– Subcategories
• Extracted from Wikipedia
– “Children” Categories
• Filtered using the YAGO subClassOf relation
– “Sibling” Categories
• Extracted from Wikipedia
• Having with the same YAGO type
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
Subcategories
Sitcoms
Wikipedia
Subcategories
Latino
Sitcoms
Sitcoms in
Canada
BBC
Television
Sitcoms
Sitcom
Characters
Wikipedia
Category
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
“Children” Categories
Sitcoms
YAGO subClassOf
Latino
Sitcoms
Sitcoms in
Canada
BBC
Television
Sitcoms
Sitcom
Characters
Situation
Comedy
Fictional
Character
Wikipedia
Subcategories
Wikipedia
Category
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
“Sibling” Categories
Sitcoms
YAGO subClassOf
Latino
Sitcoms
Sitcoms in
Canada
BBC
Television
Sitcoms
Situation
Comedy
Wikipedia
Category
YAGO
subClassOf
...
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
Entity Search over Wikipedia
• Search for many different entity types with one
system!
• Main observations
– Link information is important
– Cleaning the category structure of Wikipedia is critical
(YAGO)
– NLP-based techniques on the user query improve
effectiveness
• Open issues
– No temporal evolution of content is considered
– Wikipedia is meant to be objective
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
Gianluca Demartini, Claudiu S. Firan, Tereza Iofciu, Ralf Krestel, and Wolfgang Nejdl. Why Finding Entities in
Wikipedia is Difficult, Sometimes. In: "Information Retrieval" 13(5): 534-567, Springer, October 2010.
192
Time-Aware Entity Retrieval
• In some cases the time dimension is available
– News collections
– Blog postings
• News stories evolve over time
– Entities appear/disappear
– Analyse and exploit relevance evolution
– Decide about relevance at document level
• An Entity Search system can exploit the past to
find relevant entities
Gianluca Demartini, Malik Muhammad Saad Missen, Roi Blanco, Hugo Zaragoza. TAER: Time
Aware Entity Retrieval. CIKM 2010, Toronto, Canada.
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
193
Time-Aware Entity Retrieval
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
194
Time-Aware Entity Retrieval
• Evaluation
– P3, P5, AvgPrec
– Ties aware measures [McSherry and Najork, ECIR08]
• Paired t-test
– ** p<<0.01
– * p<0.05
• Related considered NonRelevant
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
195
History Features
• We also tried
– Weight history features with doc length
– Weight history features with BM25
Feature P3 P5 MAP
F(e,d) .65 .56 .60
F(e,d1) .58 .53 .56
F(e,d-1) .64 .56 .62*
F(e,H) .66 .59** .66**
CoOcc(e,H) .62 .57 .65**
DF(e,H) .63 .57* .65**
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
196
Dataset and Analysis
• TREC Novelty Track 2004
– 25 event topics
– 779 relevant news
• Entity annotations (7481 entities)
• Relevance judgements
• How useful is to find relevant sentences?
– P(e is Rel) 0.411 [0.404-0.417]
– P(e is NotRel) 0.168 [0.163-0.173]
– P(e is Rel | s is Rel) 0.547 [0.534-0.559]
– Sentences:
 21727 total 1.46 entity occurences
 5122 relevant 1.88 entity occurences
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
197
Data Analysis
• How useful is looking at the past?
– P(e|d1) 0.893 [0.881-0.905]
– P(e|d-1) 0.701 [0.677-0.726]
• Is useful to consider sentence co-
occurence?
P(e1,e2) Relevant Related NotRelevant NotAnEntity
Relevant 0.24 0.08 0.03 0.07
Related 0.07 0.03 0.03
NotRelevant 0.07 0.05
NotAnEntity 0.04
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
198
Approach
• Entity Ranking features for News articles
– Local Features
 F(e,d)
 FirstSenLen
 FirstSenPos
 Fsubj
 AvgBM25(q,s)
 SumBM25(q,s)
 History Features
• Feature combination
– Linear and Machine Learning
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
199
Local Features
Feature P5 MAP
F(e,d) .56 .60
FirstSenLen .36 .45
FirstSenPos .31 .43
Fsubj .44 .50
AvgBM25(q,s) .30 .41
SumBM25(q,s) .44 .52
Feature P5 MAP
All Tied .34 .42
20001 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
Is the past useful?
• Looking at previous documents
– Entity occurences so far F(e,H)
– Docs where the entity appeared so far
DF(e,H)
– Entity occurences in the previous doc
F(e,d-1)
– Frequency of entity the first time F(e,d1)
– Number of other entities with which the
entity co-occured so far CoOcc(e,H)
20101 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
History Features
• * t-test p value < 0.05 as compared with F(e,d)
• ** t-test p value < 0.01 as compared with F(e,d)
Feature P5 MAP
F(e,d) .56 .60
F(e,d1) .53 .56
F(e,d-1) .56 .62*
F(e,H) .59** .66**
CoOcc(e,H) .57 .65**
DF(e,H) .57* .65**
20201 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 10 20 30 40 50 60
AvgPrec
i-th document (i.e., history size+1)
Using the History
Using the History
• Conclusion
– Evidence from past documents is very important
– Effectiveness should improve over time (run F(e,H))
01 Apr 2012 203
Discussion
• New search task: Time-Aware Entity
Retrieval
• Constructed evaluation benchmark
• Experimental Evaluation
– Investigated some features and
combinations
– Information from the past helps most
– Obtain 15% improvement over F(e,d)
20401 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
Ranking Entities on the Web
Ranking Entities on the Web
• TREC Entity Track 2009-2010
– 50M web pages (including Wikipedia)
– Find related entities (return homepages)
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web 206
Ranking Entities on the Web
• Approaches
– Use Wikipedia (and infoboxes) as background info
– Extract entities from tables and lists
– Find the homepage given the entity name (see
ENS)
• Barack Obama -> www.barackobama.com
• Since 2010: 1 billion web pages
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
207
Related Entity Finding
• Approaches
– Kaptein et al., CIKM10
• Exploits Wikipedia to improve entity retrieval
effectiveness
• Identifies entity types
• Wikipedia external links as source for entity homepage
• Anchor text index for entity search
– Bron et al, CIKM10
• Entity co-occurence
• Entity type filtering
• Context (relation type)
• Wikipedia-based homepage finding
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
208
Discussion
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
209
Expert Finding - Key Requirements
• Identify experts via self-nomination and/or
automated analysis of expert communications,
publications, and activities
• Classify the type and level of expertise of individuals
and communities
• Validate the breadth and depth of expertise of an
individual
• Recommend experts, including the ability to rank
order experts on multiple dimensions including skills,
experience, certification and reputation
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
210
Current systems
• Hardly validate the breadth and depth of
expertise
– Count mentions
– Weight with relevance score
– Sometimes weight with authority of document
containing candidate mention
• Do not really attempt to classify the type and
level of expertise
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
211
Evidence of Expertise
• Information about true expertise is often not
explicit in artifacts (as opposed to factual
knowledge)
• Information about expertise is expressed using
specialized terms and concepts
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
212
How to improve?
• Integrate more sources of evidence
– CV information
– Project related data
• Including temporal information
– Training data (HR dept)
• Cost of achieving this evidence for expert vs.
non-expert as weighting factor
– Participation in TREC, authoring a book, getting a
PhD in IR, ...
Raymond D'Amore, Expert Finding in
Disparate Environments, 2008
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
213
However...
• Two types of challenges to be overcome:
– System challenge
– Evaluation challenge
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
214
System Challenges
• Multi-lingual entity extraction
• Privacy management
– E.g., Tacit can email top N experts with private
profiles (only recipient knows)
• Interoperability with heterogeneous data
sources
– IMAP, Exchange, Lotus Notes
– LDAP, JDBC/ODBC, XML repositories, Peoplesoft,
Oracle Financials, Word/Excel/PDF, ...
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
215
Where is my data?
• > 80% of data not in relational databases
– Documents, spreadsheets, presentations
– Web pages
– Email, instant messages, news feeds
– Images, audio, video
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
216
Dataspaces
• The complete set of information belonging to
one organization or task
• Examples:
– Personal dataspace
– Enterprise dataspace
– Community dataspace
• E.g., scientific, sports club, ...
“From Databases to Dataspaces: A New Abstraction for Information Management”,
Michael Franklin, Alon Halevy, David Maier, SIGMOD Record, December 2005.
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
217
Conclusions so far...
• Expert finding could in principle use many
more resources that indicate expertise,
possibly more reliably, but it is difficult to
setup the research
– System challenges
– Data availability
• Motivates research in operational setting
– E.g., Raymond D'Amore, Expert Finding in
Disparate Environments, 2008
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
218
Entity Search - Discussion
• Similar challenges as Expert Finding
• Entity information is spread over the Web
– In different formats (HTML, RDF, images)
– It is redundant (Wikipedia, DBPedia, homepage)
– It varies over time (e.g., population of a country)
– It is inconsistent (neutrino vs light speed)
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
219
Entity Search - References
• Approaches exploit
– Wikipedia structure (links, categories)
• Kaptein et al., CIKM10 (REF)
• Demartini et al., IRJ 2010 (XER)
– Entity relations
• Bron et al., CIKM10 (REF)
• Demartini et al., CIKM10 (TAER)
– Graph-based methods
• Rode et al., INEX08 (XER)
• Iofciu et al., ECIR11 (XER)
– Probabilistic Models
• Weerkamp et al., INEX08 (XER)
• Balog et al., ECIR10 (XER)
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
220
Entity Search - Discussion
• Structured data may be the way to improve
search effectiveness
– Entity identifiers
– Entity relations
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
221
Ad-hoc Object Retrieval
Introduction
• Unstructured or hybrid search over RDF data
– Supporting end-users
• Users who can not express their need in SPARQL
– Dealing with large-scale data
• Giving up query expressivity for scale
– Dealing with heterogeneity
• Users who are unaware of the schema of the data
• No single schema to the data
– Example: 2.6m classes and 33k properties in Billion Triples 2009
• Entity search
– Queries where the user is looking for a single entity named
or described in the query
– e.g. kaz vaporizer, hospice of cincinnati, mst3000
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
223
Use cases in web search
Top-1 entity with
structured data
Related entities
Structured data
extracted from HTML
224
Architecture overview
Doc
1. Download, uncompress,
convert (if needed)
2. Sort quads by subject
3. Compute Minimal Perfect
Hash (MPH)
map
map
reduce
reduce
map reduce
Index
3. Each mapper reads part of the
collection
4. Each reducer builds an index
for a subset of the vocabulary
5. Optionally, we also build an
archive (forward-index)
5. The sub-indices are
merged into a single index
6. Serving
and
Ranking
1st part of the talk 2nd part
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
225
Vertical index structure (reminder)
• One field (index) per property
• Positions are not required
• Query engine needs to support fields
 Dictionary is number of unique terms
 Occurrences is number of tokens
✗ Number of fields is a problem for merging, query performance
• In experiments we index the N most common properties
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
226
BM25F Ranking
BM25(F) uses a term-frequency (tf) that accounts for the decreasing
marginal contribution of terms
where
vs is the weight of the field
tfsi is the frequency of term i in field s
Bs is the document length normalization factor:
ls is the length of field s
avls is the average length of s
bs is a tunable parameter
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
227
Roi Blanco, Peter Mika, Sebastiano Vigna: Effective and Efficient Entity Search in RDF
Data. International Semantic Web Conference 2011:83-97
BM25F ranking cont.
• Final term score is a combination of tf and idf
where
k1 is a tunable parameter
wIDF is the inverse-document frequency:
• Finally, the score of a document D is the sum of
the scores of query terms q
01 Apr 2012 228
Hierarchical entity model
• Unstructured, structured and hierarchical entity model
• Hierrachical entity model
– Predicate type generation
– Predicate generation: importance of a predicate within its type
– Term generation: importance of a term is determined by the predicate
in which it occurs and all other predicates of that type in the entity
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
229
Robert Neumayer, Krisztian Balog, and Kjetil Nørvåg. On the modeling of entities
for ad-hoc entity search in the web of data. In ECIR'12.
Query Independent Ranking
• The question is not which answer is more
relevant; i.e. all answers are relevant
• The task is finding out which of the answers
should be ranked higher
• Importance is subjective
• Closely related to the popularity of a resource?
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
230
Lorand Dali, Blaz Fortuna, Thanh Tran Duc and Dunja Mladenic
Learning the Query-Independent Ranking of RDF Entity Search Results
In Proceedings of 9th Extended Semantic Web Conference (ESWC'12)
Towns from Andhra Pradesh
• Hyderabad
• Srisailam
• Chittoor
• Masulipatnam
• Chandavaram
• Mahbubnagar
• Gooty
• Vijaywada
• …
1. All answers are relevant
2. Ranking is important
3. Ranking is static
4. Hard to obtain the true ranking
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
231
Learning to Rank
• Machine learning approach to building a ranking
model
• We know the true ranking (golden standard)
• We represent each answer (resource) as a feature
vector
• The final score is a linear combination of the features,
and the weights have to be learned
A
B
C
Pairwise preferences
A better than B
A better than C
B better than C
true ranking
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
232
Ranking Features
• Importance derived
– from Graph analysis
– from Wikipedia
– from Web search engine
– from other external sources (N-gram databases)
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
233
Graph Features
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
234
Graph Features
• Pagerank
• Hubs and Authorities
• RDF graph features
– nRSubj - number of relations where this resource
appears as the subject
– nRObj - number of relations where this resource
appears as the object
– nLiteral - number of relations where this resource
appears as the subject and the object is a literal
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
235
Importance of Wikipedia Pages
• Popularity
– How many people visited a particular page during
June-January 2010
– Data obtained from the Wikipedia access logs
available at: http://dammit.lt/wikistats
– Captures importance from the point of view of users
• Page length
– How much text a Wikipedia page contains
– Importance from the authors’ perspective
• Number of edits
– Importance from the editors’ perspective
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
236
Features Approximating Importance
Correlate Well
• Compare rank based on page length and
based number of edits with page popularity
Spearman’s CC NDCG
Page length 0.60 0.84
Number of edits 0.78 0.93
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
237
Web Search Features
• How many search results do we get in a web
search if we search for:
– The answer’s name
– Keywords from the answer’s description
• We used Yahoo! BOSS services to do the
search
• Querying the web for many resources is
expensive
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
238
N-gram features
• Similar to web search features
• We look how many times the name of a
resource appears in a large N-gram database
(e.g. Google N-grams, Google Book N-grams,
etc.)
• A cheaper way to see how many times a
resource appears on the web or in books
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
239
Relational Entity Search
Introduction
• Intuitive keyword search interface over databases
• “A direction” of semantic search, which employs semantics of
– Relational information (structured data) in
– Different datasets
to produce complex structured, aggregated results to answer complex
information needs
• Short version of the Semantic Search tutorial at ESSIR’11
– Matching Techniques
– Ranking Techniques
• Complementary to DB keyword search tutorial, emphasizes
– The role of textual data: data graphs with textual content nodes
– Ranking
[Chen et al, SIGMOD09]
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
241
Relational Entity Search
Matching
Structure
• Keyword search: keywords over data graphs
– Term matching
– Content matching
– Structure matching
• Schema-based keyword search
• Schema-agnostic keyword search
– Online search algorithms
– Index-based approaches
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
243
Keyword search approaches
• Finding “substructures” matching keyword nodes
• Different result semantics for different types of data
– Textual data (Web pages connected via hyperlink)
– DB (tuple connected via foreign keys)
– XML (elements/attributes via parent-child edges)
• Commonly used results: Steiner tree / subgraph
– Connect keyword matching elements
– Contain one keyword matching element for every query keyword
– Minimal substructures: closely connected keyword nodes
• Query is ambiguous, lacks explicit structure constraints
– NP-hard, thus efficiency of matching is a problem
– Large amounts on candidate matches, thus ranking is a problem
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
244
Keyword search on hybrid data graphs
Alice
Bob is a good friend
of mine. We went to
the same university,
and also shared an
apartment in Berlin in
2008. The trouble
with Bob is that he
takes much better
photos than I do:
trouble with bob
Bob
sunset.jpg
Beautiful
Sunset
Thanh
KIT
Germany
Semantic
Search
2009
Germany
PeterFluidOps 34
knows someone works at KITapartment shared Berlin Alice
Example information need
“Information about a friend of Alice, who shared an apartment with her in
Berlin and knows someone working at KIT.”
Term
matching
Content
matching
Structure
matching
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
245
Term matching
• Distance-based (syntax)
– Levenshtein distance (edit distance)
– Hamming distance
– Jaro-Winkler distance
• Dictionary-based (semantics)
– Taxonomy
– Dictionary of similar words
– Translation memory
– Ontologies
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
246
Content matching
• Retrieve partial matches
• Inverted list (inverted index)
ki  {< d1, pos, score, ...>,
< d2, pos, score, ...>, ...}
• Combine partial matches: union or join
shared
shared berlin alice= =
shared Berlin Alice shared Berlin Alice
D1 D1 D1
01 Apr 2012 247
Structure matching
• Retrieve structured data given patterns (e.g. triple patterns)
• Index on tables
• Multiple “redundant” indexes to cover different access patterns
• Combine: union or join
• Blocking, e.g. linear merge join (required sorted input)
• Non-blocking, e.g. symmetric hash-join
• Materialized join indexes
SP-index PO-index
=
=
=
?x ns:knows ?y. ?x ns:knows ?z.
?z ns: works ?v. ?v ns:name “KIT”
Per1 ns:works ?v ?v ns:name “KIT”
Per1 ns:works Ins1 Ins1 ns:name KIT
Per1 ns:works Ins1 Ins1 ns:name KIT
Structure not explicitly given in
query  exploration / other
kinds of join
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
248
Structure
• Keyword search: keywords over data graphs
– Term matching
– Content matching
– Structure matching
• Schema-based keyword search
• Schema-agnostic keyword search
– Online search algorithms
– Index-based approaches
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
249
Matching in keyword search – schema-based
Alice Bob KIT
• Operate on schema graph
• Query interpretation
– Compute queries instead of results
– Query presentation
– Query processing by DB engine
• Leverage the power of underlying DB query engine
Result 1
Result 2
[Tran et al, ICDE09]
[Hristidis et al, VLDB02] [Agrawal et al, ICDE02]
[Qin et al, SIGMOD09]
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
250
Structure
• Keyword search: keywords over data graphs
– Term matching
– Content matching
– Structure matching
• Schema-based keyword search
• Schema-agnostic keyword search
– Online search algorithms
– Index-based approaches
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
251
Matching in keyword search – schema-agnostic
Alice Bob KIT
• Operate on data graph
– No schema needed
– Flexibly support different types of data e.g. hybrid data
graphs
– Native tailored optimization
• Online in-memory graph search
• Using materialized indexes
Result 1
Result 2
[He et al, SIGMOD07] [Li et al, SIGMOD08]
[Tran et al, CIKM11]
[Kacholia et al, VLDB05]
01 Apr 2012 252
Online search – top-k exploration• Compute Steiner tree with distinct roots
• Backward expansion strategy
• Run Dijkstra’s single-source-shortest-path algorithms
– Explore shortest keyword-root paths
– To find root (an answer)
– Until k answers are found
– Approximate: no top-k guarantee, i.e. further answers found later from
other expansion paths may have higher score
• Complete top-k: terminate safely when lower bound of top-k
candidate is higher than upper bound of what can be achieved with
remaining inputs
[Bhalotia et al, ICDE02]
Alice Bob KIT
Result 1
01 Apr 2012 253
Taxonomy of matching approaches
• Schema-based vs. schema-agnostic
• Online search
– Complete top-k
– Approximate top-k
– Backward expansion, bidirectional search, undirected
subgraph exploration, dynamic programming
• Indexing for retrieval + join for combine
– Path retrieval, then path join
– Graph retrieval, then graph pruning
– Graph retrieval, then neighborhood / graph join
(neighborhood indexed as a set of paths)
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
254
Relational Entity Search
Ranking
Structure
• Ranking paradigms
– Explicit model of relevance
– No notion of relevance
• Features
– Content-based
– Structure-based
– Structured-content-based
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
256
Ranking paradigms
• No explicit notion of relevance: similarity between the
query and the document model
– Vector space model (cosine similarity)
– Language models (KL divergence)
• Explicit relevance model
– Foundation: probability ranking principle
– Ranking results by the posterior probability (odds) of being
observed in the relevant class:
)),...,(,),...,((),( ,,1,,1 qkqdtd wwwwCosdqSim 
)|(
)|(
log()|()||(),(
d
q
q
Vt
dq
tP
tP
tPKLdqSim


 

))|(1()|()|(  

DtDt
NtPRtPRDP
01 Apr 2012 257
Features
• Features are orthogonal to retrieval models
– Weights for query / document vectors?
– Language models for document / queries?
– Relevance models?
– What to use for learning to rank?
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
258
Features
Dealing with ambiguities
• Content features
– Co-occurrences
• Terms K that often co-occur form a contextual
interpretation, i.e. topics (cluster hypothesis, distributional
semantics)
• “Berlin” and “apartment”  geographic context  Berlin as
city
– Frequencies: d more likely to be “about” a query term
k when d more often, mentions k (probabilistic IR)
• Structure features
– Structured-content-based: consider relevance at fine-
grained level of attributes
– Link-based popularity
– Proximity-based
Term
ambiguity
Content
ambiguity
Structure
ambiguity
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
259
Content-based features – frequency
• Document statistics, e.g.
– Term frequency
– Document length
• Collection statistics, e.g.
– Inverse document frequency
– Background language models
)|()1(
||
)|( CtP
d
tf
tP d  
idf
d
tf
w dt 
||
,
• An object is more likely
about “Berlin”?
• When it contains a
relatively high number
of mentions of the
term “Berlin”
• When number of
mentions of term in
the overall collection is
relatively low
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
260
Structure-based features – links
• PageRank
– Link analysis algorithm
– Measuring relative importance of nodes
– Link counts as a vote of support
– The PageRank of a node recursively depends on the
number and PageRank of all nodes that link to it
(incoming links)
• ObjectRank
– Types and semantics of links vary in structured data
– Authority transfer schema graph specifies connection
strengths
– Recursively compute authority transfer data graph
• An object (about “Berlin”) is more important?
• When a relatively large number of objects are linked to it
[Hristidis et al, TDS08]
How to incorporate it
into a content-based
retrieval model?
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
261
• EASE, XRANK, BLINKS, etc.
• EASE
– Proximity between a pair of keywords
– Overall score of a JRT is aggregation on the score of keyword pairs
• XRANK
– Ranking of XML documents / elements
– Proximity of n is defined based on w, the smallest text window in n
that contains all search keywords
Structure-based features – proximity
• A structured result (e.g. Steiner tree) is more relevant?
• When it is more compact s.t. elements are closely related
[Li et al, SIGMOD08]
[Guo et al, SIGMOD03]
adopted from: [Chen et al, SIGMOD09]
How to incorporate it
into a content-based
retrieval model?
262
Structured-content-based model
• Consider structure of objects during content-based
modeling, i.e., to obtain structured content-based
model
– Content-based model for structured objects, structured
documents, database tuples…
)|()|( f
Ff
fd
d
tPtP  

• An object is more likely about “Berlin”?
• When its (important) fields / attributes contain a relatively
high number of mentions of the term “Berlin”
01 Apr 2012
ECIR 2012 Tutorial - From Expert Finding to
Entity Search on the Web
263
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web

Contenu connexe

Tendances

NHSPUG April 2017 - We Need to Talk: How to Converse with Regular People Abou...
NHSPUG April 2017 - We Need to Talk: How to Converse with Regular People Abou...NHSPUG April 2017 - We Need to Talk: How to Converse with Regular People Abou...
NHSPUG April 2017 - We Need to Talk: How to Converse with Regular People Abou...Jonathan Ralton
 
Knowledge discoverylaurahollink
Knowledge discoverylaurahollinkKnowledge discoverylaurahollink
Knowledge discoverylaurahollinkSSSW
 
Analysing & Improving Learning Resources Markup on the Web
Analysing & Improving Learning Resources Markup on the WebAnalysing & Improving Learning Resources Markup on the Web
Analysing & Improving Learning Resources Markup on the WebStefan Dietze
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Peter Mika
 
Web Information Network Extraction and Analysis
Web Information Network Extraction and AnalysisWeb Information Network Extraction and Analysis
Web Information Network Extraction and AnalysisTim Weninger
 
NetIKX Semantic Search Presentation
NetIKX Semantic Search PresentationNetIKX Semantic Search Presentation
NetIKX Semantic Search Presentationurvics
 
Semantic Search on the Rise
Semantic Search on the RiseSemantic Search on the Rise
Semantic Search on the RisePeter Mika
 
Linked Data for Czech Legislation
Linked Data for Czech LegislationLinked Data for Czech Legislation
Linked Data for Czech LegislationMartin Necasky
 
METADATA: A PRACTICE AND ITS SERVICES TOWARDS DIGITAL ENVIRONMENT
METADATA: A PRACTICE AND ITS SERVICES TOWARDS DIGITAL ENVIRONMENTMETADATA: A PRACTICE AND ITS SERVICES TOWARDS DIGITAL ENVIRONMENT
METADATA: A PRACTICE AND ITS SERVICES TOWARDS DIGITAL ENVIRONMENTVikas Bhushan
 
Bioschemas Workshop
Bioschemas WorkshopBioschemas Workshop
Bioschemas WorkshopNiall Beard
 

Tendances (20)

Activities of JaLC as a national service
Activities of JaLC as a national serviceActivities of JaLC as a national service
Activities of JaLC as a national service
 
NISO/DCMI Webinar: Schema.org and Linked Data: Complementary Approaches to Pu...
NISO/DCMI Webinar: Schema.org and Linked Data: Complementary Approaches to Pu...NISO/DCMI Webinar: Schema.org and Linked Data: Complementary Approaches to Pu...
NISO/DCMI Webinar: Schema.org and Linked Data: Complementary Approaches to Pu...
 
NHSPUG April 2017 - We Need to Talk: How to Converse with Regular People Abou...
NHSPUG April 2017 - We Need to Talk: How to Converse with Regular People Abou...NHSPUG April 2017 - We Need to Talk: How to Converse with Regular People Abou...
NHSPUG April 2017 - We Need to Talk: How to Converse with Regular People Abou...
 
Knowledge discoverylaurahollink
Knowledge discoverylaurahollinkKnowledge discoverylaurahollink
Knowledge discoverylaurahollink
 
Open Science and Identifiers
Open Science and IdentifiersOpen Science and Identifiers
Open Science and Identifiers
 
Analysing & Improving Learning Resources Markup on the Web
Analysing & Improving Learning Resources Markup on the WebAnalysing & Improving Learning Resources Markup on the Web
Analysing & Improving Learning Resources Markup on the Web
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012
 
Thompson 6-jun15-final
Thompson 6-jun15-finalThompson 6-jun15-final
Thompson 6-jun15-final
 
April 24, 2013 NISO/DCMI Webinar: Deployment of RDA (Resource Description and...
April 24, 2013 NISO/DCMI Webinar: Deployment of RDA (Resource Description and...April 24, 2013 NISO/DCMI Webinar: Deployment of RDA (Resource Description and...
April 24, 2013 NISO/DCMI Webinar: Deployment of RDA (Resource Description and...
 
NISO/DCMI Webinar: Cooperative Authority Control: The Virtual International A...
NISO/DCMI Webinar: Cooperative Authority Control: The Virtual International A...NISO/DCMI Webinar: Cooperative Authority Control: The Virtual International A...
NISO/DCMI Webinar: Cooperative Authority Control: The Virtual International A...
 
Web Information Network Extraction and Analysis
Web Information Network Extraction and AnalysisWeb Information Network Extraction and Analysis
Web Information Network Extraction and Analysis
 
NetIKX Semantic Search Presentation
NetIKX Semantic Search PresentationNetIKX Semantic Search Presentation
NetIKX Semantic Search Presentation
 
Semantic Search on the Rise
Semantic Search on the RiseSemantic Search on the Rise
Semantic Search on the Rise
 
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
 
Linked Data for Czech Legislation
Linked Data for Czech LegislationLinked Data for Czech Legislation
Linked Data for Czech Legislation
 
McDanold-1-jun15
McDanold-1-jun15McDanold-1-jun15
McDanold-1-jun15
 
METADATA: A PRACTICE AND ITS SERVICES TOWARDS DIGITAL ENVIRONMENT
METADATA: A PRACTICE AND ITS SERVICES TOWARDS DIGITAL ENVIRONMENTMETADATA: A PRACTICE AND ITS SERVICES TOWARDS DIGITAL ENVIRONMENT
METADATA: A PRACTICE AND ITS SERVICES TOWARDS DIGITAL ENVIRONMENT
 
SharePoint Fest Chicago Presentation
SharePoint Fest Chicago PresentationSharePoint Fest Chicago Presentation
SharePoint Fest Chicago Presentation
 
Bioschemas Workshop
Bioschemas WorkshopBioschemas Workshop
Bioschemas Workshop
 
NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
 NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti... NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
 

Similaire à From Expert Finding to Entity Search on the Web

Semantic Search
Semantic SearchSemantic Search
Semantic Searchsssw2012
 
Semantic Web: introduction & overview
Semantic Web: introduction & overviewSemantic Web: introduction & overview
Semantic Web: introduction & overviewAmit Sheth
 
Walk Before You Run: Prerequisites to Linked Data
Walk Before You Run: Prerequisites to Linked DataWalk Before You Run: Prerequisites to Linked Data
Walk Before You Run: Prerequisites to Linked DataKenning Arlitsch
 
Clarin nl odijk-final_event_2015-03-13
Clarin nl odijk-final_event_2015-03-13Clarin nl odijk-final_event_2015-03-13
Clarin nl odijk-final_event_2015-03-13CLARIAH
 
Large-Scale Semantic Search
Large-Scale Semantic SearchLarge-Scale Semantic Search
Large-Scale Semantic SearchRoi Blanco
 
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...Denis Shestakov
 
Internet browsing techniques
Internet browsing techniquesInternet browsing techniques
Internet browsing techniquesTola Odugbesan
 
Linked Energy Data Generation
Linked Energy Data GenerationLinked Energy Data Generation
Linked Energy Data GenerationFilip Radulovic
 
TPDL2013 tutorial linked data for digital libraries 2013-10-22
TPDL2013 tutorial linked data for digital libraries 2013-10-22TPDL2013 tutorial linked data for digital libraries 2013-10-22
TPDL2013 tutorial linked data for digital libraries 2013-10-22jodischneider
 
Linked Open Data in Romania
Linked Open Data in RomaniaLinked Open Data in Romania
Linked Open Data in RomaniaVlad Posea
 
ORCID and RDM
ORCID and RDMORCID and RDM
ORCID and RDMJisc
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...S. Diana Hu
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...Joaquin Delgado PhD.
 
Semantic Search using RDF Metadata (SemTech 2005)
Semantic Search using RDF Metadata (SemTech 2005)Semantic Search using RDF Metadata (SemTech 2005)
Semantic Search using RDF Metadata (SemTech 2005)Bradley Allen
 
SPConnections Amsterdam: Beyond the Search Center - Application or Solution? ...
SPConnections Amsterdam: Beyond the Search Center - Application or Solution? ...SPConnections Amsterdam: Beyond the Search Center - Application or Solution? ...
SPConnections Amsterdam: Beyond the Search Center - Application or Solution? ...Agnes Molnar
 
Practical Information Architecture
Practical Information ArchitecturePractical Information Architecture
Practical Information ArchitectureRob Bogue
 
Searching of Web and Electronic Resources
Searching of Web and Electronic Resources Searching of Web and Electronic Resources
Searching of Web and Electronic Resources Bramesha B
 
Question answering in linked data
Question answering in linked dataQuestion answering in linked data
Question answering in linked dataReza Ramezani
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalCarsten Eickhoff
 

Similaire à From Expert Finding to Entity Search on the Web (20)

Linked (Open) Data
Linked (Open) DataLinked (Open) Data
Linked (Open) Data
 
Semantic Search
Semantic SearchSemantic Search
Semantic Search
 
Semantic Web: introduction & overview
Semantic Web: introduction & overviewSemantic Web: introduction & overview
Semantic Web: introduction & overview
 
Walk Before You Run: Prerequisites to Linked Data
Walk Before You Run: Prerequisites to Linked DataWalk Before You Run: Prerequisites to Linked Data
Walk Before You Run: Prerequisites to Linked Data
 
Clarin nl odijk-final_event_2015-03-13
Clarin nl odijk-final_event_2015-03-13Clarin nl odijk-final_event_2015-03-13
Clarin nl odijk-final_event_2015-03-13
 
Large-Scale Semantic Search
Large-Scale Semantic SearchLarge-Scale Semantic Search
Large-Scale Semantic Search
 
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
 
Internet browsing techniques
Internet browsing techniquesInternet browsing techniques
Internet browsing techniques
 
Linked Energy Data Generation
Linked Energy Data GenerationLinked Energy Data Generation
Linked Energy Data Generation
 
TPDL2013 tutorial linked data for digital libraries 2013-10-22
TPDL2013 tutorial linked data for digital libraries 2013-10-22TPDL2013 tutorial linked data for digital libraries 2013-10-22
TPDL2013 tutorial linked data for digital libraries 2013-10-22
 
Linked Open Data in Romania
Linked Open Data in RomaniaLinked Open Data in Romania
Linked Open Data in Romania
 
ORCID and RDM
ORCID and RDMORCID and RDM
ORCID and RDM
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
Semantic Search using RDF Metadata (SemTech 2005)
Semantic Search using RDF Metadata (SemTech 2005)Semantic Search using RDF Metadata (SemTech 2005)
Semantic Search using RDF Metadata (SemTech 2005)
 
SPConnections Amsterdam: Beyond the Search Center - Application or Solution? ...
SPConnections Amsterdam: Beyond the Search Center - Application or Solution? ...SPConnections Amsterdam: Beyond the Search Center - Application or Solution? ...
SPConnections Amsterdam: Beyond the Search Center - Application or Solution? ...
 
Practical Information Architecture
Practical Information ArchitecturePractical Information Architecture
Practical Information Architecture
 
Searching of Web and Electronic Resources
Searching of Web and Electronic Resources Searching of Web and Electronic Resources
Searching of Web and Electronic Resources
 
Question answering in linked data
Question answering in linked dataQuestion answering in linked data
Question answering in linked data
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 

Dernier

Cyber Security Training in Office Environment
Cyber Security Training in Office EnvironmentCyber Security Training in Office Environment
Cyber Security Training in Office Environmentelijahj01012
 
WSMM Media and Entertainment Feb_March_Final.pdf
WSMM Media and Entertainment Feb_March_Final.pdfWSMM Media and Entertainment Feb_March_Final.pdf
WSMM Media and Entertainment Feb_March_Final.pdfJamesConcepcion7
 
Unveiling the Soundscape Music for Psychedelic Experiences
Unveiling the Soundscape Music for Psychedelic ExperiencesUnveiling the Soundscape Music for Psychedelic Experiences
Unveiling the Soundscape Music for Psychedelic ExperiencesDoe Paoro
 
Fordham -How effective decision-making is within the IT department - Analysis...
Fordham -How effective decision-making is within the IT department - Analysis...Fordham -How effective decision-making is within the IT department - Analysis...
Fordham -How effective decision-making is within the IT department - Analysis...Peter Ward
 
TriStar Gold Corporate Presentation - April 2024
TriStar Gold Corporate Presentation - April 2024TriStar Gold Corporate Presentation - April 2024
TriStar Gold Corporate Presentation - April 2024Adnet Communications
 
business environment micro environment macro environment.pptx
business environment micro environment macro environment.pptxbusiness environment micro environment macro environment.pptx
business environment micro environment macro environment.pptxShruti Mittal
 
Effective Strategies for Maximizing Your Profit When Selling Gold Jewelry
Effective Strategies for Maximizing Your Profit When Selling Gold JewelryEffective Strategies for Maximizing Your Profit When Selling Gold Jewelry
Effective Strategies for Maximizing Your Profit When Selling Gold JewelryWhittensFineJewelry1
 
digital marketing , introduction of digital marketing
digital marketing , introduction of digital marketingdigital marketing , introduction of digital marketing
digital marketing , introduction of digital marketingrajputmeenakshi733
 
Introducing the Analogic framework for business planning applications
Introducing the Analogic framework for business planning applicationsIntroducing the Analogic framework for business planning applications
Introducing the Analogic framework for business planning applicationsKnowledgeSeed
 
Send Files | Sendbig.comSend Files | Sendbig.com
Send Files | Sendbig.comSend Files | Sendbig.comSend Files | Sendbig.comSend Files | Sendbig.com
Send Files | Sendbig.comSend Files | Sendbig.comSendBig4
 
WSMM Technology February.March Newsletter_vF.pdf
WSMM Technology February.March Newsletter_vF.pdfWSMM Technology February.March Newsletter_vF.pdf
WSMM Technology February.March Newsletter_vF.pdfJamesConcepcion7
 
Go for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptx
Go for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptxGo for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptx
Go for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptxRakhi Bazaar
 
Pitch Deck Teardown: Xpanceo's $40M Seed deck
Pitch Deck Teardown: Xpanceo's $40M Seed deckPitch Deck Teardown: Xpanceo's $40M Seed deck
Pitch Deck Teardown: Xpanceo's $40M Seed deckHajeJanKamps
 
Supercharge Your eCommerce Stores-acowebs
Supercharge Your eCommerce Stores-acowebsSupercharge Your eCommerce Stores-acowebs
Supercharge Your eCommerce Stores-acowebsGOKUL JS
 
Horngren’s Financial & Managerial Accounting, 7th edition by Miller-Nobles so...
Horngren’s Financial & Managerial Accounting, 7th edition by Miller-Nobles so...Horngren’s Financial & Managerial Accounting, 7th edition by Miller-Nobles so...
Horngren’s Financial & Managerial Accounting, 7th edition by Miller-Nobles so...ssuserf63bd7
 
Lucia Ferretti, Lead Business Designer; Matteo Meschini, Business Designer @T...
Lucia Ferretti, Lead Business Designer; Matteo Meschini, Business Designer @T...Lucia Ferretti, Lead Business Designer; Matteo Meschini, Business Designer @T...
Lucia Ferretti, Lead Business Designer; Matteo Meschini, Business Designer @T...Associazione Digital Days
 
Entrepreneurship lessons in Philippines
Entrepreneurship lessons in  PhilippinesEntrepreneurship lessons in  Philippines
Entrepreneurship lessons in PhilippinesDavidSamuel525586
 
The McKinsey 7S Framework: A Holistic Approach to Harmonizing All Parts of th...
The McKinsey 7S Framework: A Holistic Approach to Harmonizing All Parts of th...The McKinsey 7S Framework: A Holistic Approach to Harmonizing All Parts of th...
The McKinsey 7S Framework: A Holistic Approach to Harmonizing All Parts of th...Operational Excellence Consulting
 
20200128 Ethical by Design - Whitepaper.pdf
20200128 Ethical by Design - Whitepaper.pdf20200128 Ethical by Design - Whitepaper.pdf
20200128 Ethical by Design - Whitepaper.pdfChris Skinner
 

Dernier (20)

Cyber Security Training in Office Environment
Cyber Security Training in Office EnvironmentCyber Security Training in Office Environment
Cyber Security Training in Office Environment
 
WSMM Media and Entertainment Feb_March_Final.pdf
WSMM Media and Entertainment Feb_March_Final.pdfWSMM Media and Entertainment Feb_March_Final.pdf
WSMM Media and Entertainment Feb_March_Final.pdf
 
Unveiling the Soundscape Music for Psychedelic Experiences
Unveiling the Soundscape Music for Psychedelic ExperiencesUnveiling the Soundscape Music for Psychedelic Experiences
Unveiling the Soundscape Music for Psychedelic Experiences
 
Fordham -How effective decision-making is within the IT department - Analysis...
Fordham -How effective decision-making is within the IT department - Analysis...Fordham -How effective decision-making is within the IT department - Analysis...
Fordham -How effective decision-making is within the IT department - Analysis...
 
The Bizz Quiz-E-Summit-E-Cell-IITPatna.pptx
The Bizz Quiz-E-Summit-E-Cell-IITPatna.pptxThe Bizz Quiz-E-Summit-E-Cell-IITPatna.pptx
The Bizz Quiz-E-Summit-E-Cell-IITPatna.pptx
 
TriStar Gold Corporate Presentation - April 2024
TriStar Gold Corporate Presentation - April 2024TriStar Gold Corporate Presentation - April 2024
TriStar Gold Corporate Presentation - April 2024
 
business environment micro environment macro environment.pptx
business environment micro environment macro environment.pptxbusiness environment micro environment macro environment.pptx
business environment micro environment macro environment.pptx
 
Effective Strategies for Maximizing Your Profit When Selling Gold Jewelry
Effective Strategies for Maximizing Your Profit When Selling Gold JewelryEffective Strategies for Maximizing Your Profit When Selling Gold Jewelry
Effective Strategies for Maximizing Your Profit When Selling Gold Jewelry
 
digital marketing , introduction of digital marketing
digital marketing , introduction of digital marketingdigital marketing , introduction of digital marketing
digital marketing , introduction of digital marketing
 
Introducing the Analogic framework for business planning applications
Introducing the Analogic framework for business planning applicationsIntroducing the Analogic framework for business planning applications
Introducing the Analogic framework for business planning applications
 
Send Files | Sendbig.comSend Files | Sendbig.com
Send Files | Sendbig.comSend Files | Sendbig.comSend Files | Sendbig.comSend Files | Sendbig.com
Send Files | Sendbig.comSend Files | Sendbig.com
 
WSMM Technology February.March Newsletter_vF.pdf
WSMM Technology February.March Newsletter_vF.pdfWSMM Technology February.March Newsletter_vF.pdf
WSMM Technology February.March Newsletter_vF.pdf
 
Go for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptx
Go for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptxGo for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptx
Go for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptx
 
Pitch Deck Teardown: Xpanceo's $40M Seed deck
Pitch Deck Teardown: Xpanceo's $40M Seed deckPitch Deck Teardown: Xpanceo's $40M Seed deck
Pitch Deck Teardown: Xpanceo's $40M Seed deck
 
Supercharge Your eCommerce Stores-acowebs
Supercharge Your eCommerce Stores-acowebsSupercharge Your eCommerce Stores-acowebs
Supercharge Your eCommerce Stores-acowebs
 
Horngren’s Financial & Managerial Accounting, 7th edition by Miller-Nobles so...
Horngren’s Financial & Managerial Accounting, 7th edition by Miller-Nobles so...Horngren’s Financial & Managerial Accounting, 7th edition by Miller-Nobles so...
Horngren’s Financial & Managerial Accounting, 7th edition by Miller-Nobles so...
 
Lucia Ferretti, Lead Business Designer; Matteo Meschini, Business Designer @T...
Lucia Ferretti, Lead Business Designer; Matteo Meschini, Business Designer @T...Lucia Ferretti, Lead Business Designer; Matteo Meschini, Business Designer @T...
Lucia Ferretti, Lead Business Designer; Matteo Meschini, Business Designer @T...
 
Entrepreneurship lessons in Philippines
Entrepreneurship lessons in  PhilippinesEntrepreneurship lessons in  Philippines
Entrepreneurship lessons in Philippines
 
The McKinsey 7S Framework: A Holistic Approach to Harmonizing All Parts of th...
The McKinsey 7S Framework: A Holistic Approach to Harmonizing All Parts of th...The McKinsey 7S Framework: A Holistic Approach to Harmonizing All Parts of th...
The McKinsey 7S Framework: A Holistic Approach to Harmonizing All Parts of th...
 
20200128 Ethical by Design - Whitepaper.pdf
20200128 Ethical by Design - Whitepaper.pdf20200128 Ethical by Design - Whitepaper.pdf
20200128 Ethical by Design - Whitepaper.pdf
 

From Expert Finding to Entity Search on the Web

  • 1. From Expert Finding to Entity Search on the Web Full-day Tutorial at ECIR 2012 1st April 2012 Gianluca Demartini, Peter Mika, Thanh Tran, Arjen P. de Vries http://diuf.unifr.ch/main/xi/EntitySearchTutorial
  • 2. Presenters • Dr. Gianluca Demartini – eXascale Infolab, University of Fribourg, Switzerland – Research Interests: • Entity Search • IR Evaluation • Semantic Web 01 Apr 2012 2 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web
  • 3. Presenters • Dr. Peter Mika – Senior Researcher, Yahoo! Research, Barcelona – Semantic Search group at Yahoo! Barcelona – Semantic Search, Web Object Retrieval, Natural Language Processing 01 Apr 2012 3 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web
  • 4. Presenters • Dr. Thanh Tran – (Institut AIFB, Universität Karlsruhe, Germany) – Semantic Search group at AIFB – Semantic Search, Semantic Data Management, Linked Data Query Processing 01 Apr 2012 4 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web
  • 5. Presenters • Prof.dr.ir. Arjen P. de Vries – Interactive Information Access research group, Centrum Wiskunde & Informatica (CWI); Delft University of Technology; Spinque – Research interest: the intersection between information retrieval and databases 01 Apr 2012 5Van Rijsbergen, 1979 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web
  • 6. Entity • An entity is a “proper noun”, “something that is referred to” Outcome of “definition” discussion reported in SIGIR Workshop Report on The First International Workshop on Entity-Oriented Search (EOS), Krisztian Balog, Arjen P. de Vries, Pavel Serdyukov, Ji-Rong Wen, ACM SIGIR Forum, Vol. 45, No. 2, Dec. 2011, pp 43-50 01 Apr 2012 6 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web
  • 7. Entity Search • All those search tasks that aim at retrieving as answer to a user query an entity instead of a document – People, Countries, Movies, Restaurants, etc. 01 Apr 2012 7 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web
  • 8. Motivation • Information is entity-centric • Search for information is often conducted around entities (Query log analysis) – Many queries (50%) search for specific entities instead of documents [Kumar&Tomkins09] • Traditional search retrieves a list of blue links • Novel web experiences may be designed around entities instead 01 Apr 2012 8 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web
  • 9. Here, for one search query “Nicole Kidman”, various entities make up the answer: bio photos movies trivia quotes (...) 01 Apr 2012 9
  • 10. Entity-centric Applications • Enterprise applications • News portals • Movie portals • Product reviews • Search Engines 01 Apr 2012 10 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web
  • 11. Entities in SERP 01 Apr 2012 11 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web
  • 12. Entities in SERP 01 Apr 2012 12 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web
  • 13. Entity Search: The Pipeline • Entity Representation (DB/SW) • Entity Extraction (NLP) • Entity Linking and De-duplication (DB/SW) • Entity Storage and Indexing (DB/SW) • Entity Search and Ranking (IR) • Result presentation (HCI) • Evaluation (HCI/IR) 01 Apr 2012 13 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web
  • 14. Outline • FULL DAY Tutorial (sorry, this is not a joke :) • Morning – Data (Peter) – Data Management (Thanh) • Afternoon – Search and Ranking (Gianluca & Thanh) – Evaluation (Arjen) 01 Apr 2012 14 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web
  • 15. Morning • Data – Structured vs. Unstructured: – Entity Profiles: data models, entity identifiers, standards – Datasets (Desktop, Enterprise, Wikipedia, Web, RDF) • Data Management – Entity Extraction – Entity de-duplication / data fusion – Entity storage & indexing 01 Apr 2012 15 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web
  • 16. Afternoon • Search and Ranking – Expert Finding models – Entity Ranking in Wikipedia – Web Entity Retrieval – Entity Search over Structured Data – Relational Entity Search over Structured Data • Evaluation – TREC Enterprise – INEX Entity Ranking – TREC Entity – SemSearch, Ad-hoc Object Retrieval 01 Apr 2012 16 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web
  • 18. Data • Web data – Information Extraction – Semantic Web • Non-web data – Enterprise data – Desktop data – Email – ... 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 18
  • 19. Data on the Web • Most web pages on the Web are generated from structured data – Data is stored in relational databases (typically) – Queried through web forms – Presented as tables or simply as unstructured text • The structure and semantics (meaning) of the data is not directly accessible to search engines • Two solutions – Information Extraction [see Part 2] – Relying on publishers to use Semantic Web formats • Linked Data vs. metadata in HTML 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 19
  • 20. Semantic Web • Sharing structured data across the Web – Standard graph-based data model • RDF – A number of syntaxes (file formats) • RDF/XML, RDFa – Powerful, logic-based schema languages • OWL, RIF – Query languages and protocols • HTTP, SPARQL 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 20
  • 21. Resource Description Framework (RDF) • Each resource (thing, entity) is identified by a URI or otherwise it’s a blank node – URIs are globally unique • Data is broken down into individual facts – Triples of (subject, predicate, object) • A set of triples is published together in an RDF document example:roi “Roi Blanco” name type foaf:Person RDF document 01 Apr 2012 21
  • 22. An RDF graph peter#123 “Peter Mika” name foaf:Person sameAs peter#456 worksWith roi#234 “roi@yahoo-inc.com” email type type 01 Apr 2012 22
  • 23. OWL, the Web Ontology Language • The schema language for the Semantic Web – Classes, properties and restrictions on their usage – Allows validation and inference • Schema is also data – Published just like any other RDF document – Queries can refer to both schema and data • e.g. taxonomy expansion: retrieve instances of a class and instances of all subclasses 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 23
  • 24. Publishing RDF and OWL • Linked Data – Data published as RDF documents linked to other RDF documents • Typically RDF/XML or Turtle • Keep an eye on JSON-LD – Community effort to (re-)publish open datasets • Embedded metadata – RDFa, microdata, microformats annotations inside webpages – Recommended for site owners by Yahoo, Google, Facebook • SPARQL endpoints – Triple stores (RDF databases) that can be queried through the web 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 24
  • 25. Linked Data • Interlinked datasets on the Web – Often data from existing databases or APIs • The four rules of Linked Data: – Use URIs to identify things. – Use HTTP URIs so that these things can be referred to and accessed by people and crawlers. – Use standard formats such as RDF to provide useful information about the thing when its URI is accessed – Include links to other datasets • Most importantly: links to entities in other datasets that describe the same entity 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 25
  • 26. Peter’s homepage Yahoo! Friend-of-a-Friend ontology Linked Data peter#123 “Peter Mika” name foaf:Person sameAs peter#456 worksWith roi#234 “roi@yahoo-inc.com” email type type 01 Apr 2012 26
  • 27. Linked (Open) Data = LOD • Advantages: – No change to the publishing of the HTML documents – Data can be published by third party (e.g. Dbpedia) • Disadvantages: – Web servers need to be configured to properly handle URIs that identify concepts instead of documents – Not favored by search engines • Lack of use cases • Crawling needs to be changed • Authority is difficult to determine • Tools – Triple stores (Virtuoso, Oracle etc.) and front-ends (Pubby) – RDB-to-RDF mappers (e.g. D2RQ, Triplify) – Validators (Vapour) – Linked Data browsers (many) 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 27
  • 28. Linked Data community • Community effort to (re)publish open datasets as Linked Data – In particular, scientific and government data – see linkeddata.org and ckan.org for developer information and datasets
  • 29. Linked Data in practice • Fetching data dumps – See catalogs such as thedatahub.org, linkeddata.org • Crawling Linked Data – Similar to HTML crawling, but the the crawler needs to parse RDF/XML (and others) to extract URIs to be crawled – Semantic Sitemap/VOID descriptions – Existing crawls • Billion Triples Challenge (2009-2011) datasets • LOD cache • Querying SPARQL endpoints – See catalogs such as thedatahub.org, linkeddata.org – Semantic Sitemap/VOID descriptions 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 29
  • 30. Datasets • Broad coverage datasets are linking hubs – Dbpedia – Freebase – Starting in 2012: Wikidata • Domain-specific datasets form clusters – Biology – Government – Library – Entertainment – … 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 30
  • 33. Metadata in HTML • 1995: HTML meta tags • 1998: RDF/XML • 2003: Web 2.0 – Tagging – Microformats – Metadata in Wikipedia – Machine tags in Flickr • 2005: eRDF • 2008: RDFa 1.0 • 2011: RDFa 1.1 • 2012: Microdata 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 33
  • 34. HTML meta tags <HTML> <HEAD profile="http://dublincore.org/documents/dcq-html/"> <META name="DC.author" content="Peter Mika"> <LINK rel="DC.rights copyright" href="http://www.example.org/rights.html" /> <LINK rel="meta" type="application/rdf+xml" title="FOAF" href= "http://www.cs.vu.nl/~pmika/foaf.rdf"> </HEAD> … </HTML> 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 34
  • 35. Microformats (μf) • Agreements on the way to encode certain kinds metadata in HTML – Reuse of semantic-bearing HTML elements – Based on existing standards – Minimality • Microformats exist for a limited set of objects – hCard (persons and organizations) – hCalendar (events) – hResume – hProduct – hRecipe • Varying degrees of support and stability – hCard and rel-tag are widely supported • Community centered around microformats.org – Specifications and discussions are hosted there 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 35
  • 36. Microformats: limitations • No shared syntax – Each microformat has a separate syntax tailored to the vocabulary • No formal schemas – Limited reuse, extensibility of schemas – Unclear which combinations are allowed • No datatypes • No namespaces, unique identifiers (URIs) – no interlinking – mapping between instances is required • Always appears in the HTML <body> 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 36
  • 37. Example: the hCard microformat <cite class="vcard"> <a class="fn url" rel="friend colleague met” href="http://meyerweb.com/"> Eric Meyer</a> </cite> wrote a post (<cite> <a href="http://meyerweb.com/eric/thoughts/2005/12/16/tax-relief/"> Tax Relief</a></cite>) about an unintentionally humorous letter he received from the <span class="vcard”> <a class="fn org url" href="http://irs.gov/"> Internal Revenue Service</a> </span>. <div class="vcard"> <a class="email fn" href="mailto:jfriday@host.com">Joe Friday</a> <div class="tel">+1-919-555-7878</div> <div class="title">Area Administrator, Assistant</div> </div> 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 37
  • 38. RDFa • W3C standard for embedding RDF data in HTML documents – A set of new HTML attributes to be used in head or body – A specification of how to extract the data from these attributes • RDFa is just a syntax, you have to choose a vocabulary separately • RDFa 1.0 is a W3C Recommendation since October, 2008 – RDFa Primer • RDFa 1.1 currently under standardization – RDFa Core & RDFa Lite Working Draft as of January 31, 2012 – Updated version of the RDFa Primer • RDFa API for accessing RDFa data in a webpage in the browser from JavaScript – Currently Working Draft (April 19, 2011) 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 38
  • 39. RDFa 1.1 • Changes – New vocab attribute to define the default namespace for the document or subtree – Syntax changes for ease of use – RDFa Lite profile • RDFa 1.1 is backward compatible with RDFa 1.0 – RDFa 1.1 is recommended if you want to use HTML5 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 39
  • 40. Microdata • Currently under standardization at the W3C – Working Draft (May 25, 2011) • Microdata vs. RDFa – Microdata is simpler to author – Lacking some extension features such as co-typing • HTML5 also has a number of “semantic” elements such as <time>, <video>, <article>, <section>… 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 40
  • 41. Microdata example <div itemscope itemid=“http://www.yahoo.com/resource/person”> <p>My name is <span itemprop="name">Neil</span>.</p> <p>My band is called <span itemprop="band">Four Parts Water</span>. I was born on <time itemprop="birthday" datetime="2009-05-10">May 10th 2009</time>. <img itemprop="image" src=”me.png" alt=”me”> </p> </div 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 41
  • 42. Example: Facebook’s Like and the Open Graph Protocol • The ‘Like’ button provides publishers with a way to promote their content on Facebook and build communities – Shows up in profiles and news feed – Site owners can later reach users who have liked an object – Facebook Graph API allows 3rd party developers to access the data • Open Graph Protocol is an RDFa-based format that allows to describe the object that the user ‘Likes’ 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 42
  • 43. Example: Facebook’s Open Graph Protocol • RDF vocabulary to be used in conjunction with RDFa – Simplify the work of developers by restricting the freedom in RDFa • Activities, Businesses, Groups, Organizations, People, Places, Products and Entertainment • Only HTML <head> accepted <html xmlns:og="http://opengraphprotocol.org/schema/"> <head> <title>The Rock (1996)</title> <meta property="og:title" content="The Rock" /> <meta property="og:type" content="movie" /> <meta property="og:url" content="http://www.imdb.com/title/tt0117500/" /> <meta property="og:image" content="http://ia.media- imdb.com/images/rock.jpg" /> … </head> ... 43
  • 44. Fragmentation of web markup • Multiple schemas – Yahoo!’s SearchMonkey – June, 2008 – Google announces Rich Snippets – June, 2009 • Faceted search for recipes – Feb, 2011 – Facebook’s Open Graph Protocol – April, 2010 • ‘Verbs’ added to OGP – September, 2010 – Bing tiles – Feb, 2011 • Different syntax – Microformats, RDFa, microdata 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 44
  • 45. Schema.org • Agreement between Bing, Google, and Yahoo on what markup webmasters should use – Help adoption by reducing fragmentation – Pre-competitive: each party will continue to build competing products independently • Schema.org covers areas of interest to all three parties – Business listings (local), creative works (video), recipes, reviews – Expected to open up also to external contributions for non-core areas 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 45
  • 46. Example: schema.org 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 46
  • 47. Embedded metadata in practice • 5-10% of webpages contain some explicit metadata – Statistics computed from commoncrawl.org give different results • Schema.org helped resolve fragmentation – Except Facebook • RDFa and microdata likely to co-exist 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 47
  • 49. Enterprise Data • Unstructured – Technical reports, Product Specification, etc. • Semi-structured – E-mail, Spreadsheets • Structured – Databases, Repositories 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 49
  • 50. Enterprise Search • Challenges – Deal with data and format diversity – Index/search diverse datasets • Vertical vs Centralized systems – Deal with security and access control – Specific informational needs • Expert Finding • Writing an overview 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 50
  • 51. Desktop Data • Textual – Unstructured • Txt documents – Semi-Structured • E-mails, PDFs, Word files, etc. contain much metadata • Multi-media – Pictures, Videos, Audio – Metadata 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 51
  • 52. Desktop Search • IR techniques over unstructured data • Exploit – the structure and metadata available – user activity logs (browsing history, file access patterns, etc.) • Beagle++ – Hybrid search over inverted index and RDF store – E-mail context and attachments – Folder structure – Browser cache Enrico Minack, Raluca Paiu, Stefania Costache, Gianluca Demartini, Julien Gaugaz, Ekaterini Ioannou, Paul-Alexandru Chirita, Wolfgang Nejdl: Leveraging personal metadata for Desktop search: The Beagle++ system. J. Web Sem. 8(1): 37- 54 (2010) 01 Apr 2012
  • 53. Tutorial Outline • Morning – Data (Peter) – Data Management (Thanh) • Afternoon – Search and Ranking (Gianluca & Thanh) – Evaluation (Arjen) 01 Apr 2012 53 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web
  • 55. Agenda • Knowledge/Entity Extraction • Entity Linking • Entity De-duplication • Entity Storage & Indexing … very high-level overview of problems and solutions! … see tutorials on the specific problems! 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 55
  • 56. Knowledge/Entity Extraction Source: Tadej Steiner from Jozef Stefan Institute, Ljubljana, Slovenia
  • 57. Problem definition • Knowledge extraction: – Extracting information from data and – Adding it to a knowledge base 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 57
  • 58. Problem definition • Information extraction + knowledge acquisition (textual) data extracted infomation knowledge base Information extraction Knowledge acquisition 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 58
  • 59. Information extraction • From the advent of the WWW, there are huge quantities of unstructured textual data, where manual information extraction would be infeasible • How to extract information from text automatically with human-comparable quality 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 59
  • 60. Information extraction: early solutions • Match manually defined patterns against text • Example: – Patterns like “Pay ? from ? in favor of ?” – ATRANS (1986) inter-banking message exchange 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 60
  • 61. Knowledge acquisition • How to transform a world (or domain) model from existing forms into a computer-friendly form – Conceptual knowledge (classes, rules, T-Box) VS. – Instance information (instance data, resource descriptions, data records, A-Box) • Use cases for knowledge bases: – Answering complex entity search queries / questions in general: • “which scientists are also politicians?” 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 61
  • 62. Knowledge acquisition • Constructing a knowledge base is expensive – The Cyc KB was mostly manually constructed over the last 20 years • Coupling information extraction and knowledge acquistion lets us construct a knowledge base with no or little human effort 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 62
  • 63. Challenges • Human effort: – Defining (domain-specific and domain- independent) extraction patterns – Especially, in case of bootstrapping approaches: • Specifying relations • Construction of training examples – Maintaining knowledge base consistency 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 63
  • 64. Related research areas • Natural language processing • Information extraction • Machine learning • Knowledge management  Knowledge extraction tools can be compared by these perspectives 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 64
  • 65. General knowledge extraction tools • WebKB • TextRunner • Cyc • SOFIE with the corresponding YAGO knowledge base • Read The Web • EntityCube 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 65
  • 66. Natural language processing • Employed by most modern approaches • Part-of-speech tagging • Noun phrase chunking, used for entity extraction • Abstraction of text – From: “Slovenia borders Italy” – To:“noun – verb – noun” 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 66
  • 67. Information extraction: entities • Entity extraction / Named Entity Recognition – “Slovenia borders Italy” • Entity resolution – “Apple released a new Mac”. – From “Apple”, “Mac” – To Apple_Inc., Macintosh_(computer) • Entity classification – Into a set of predefined categories of interest – Person, location, organization, date/time, e-mail address, phone number, etc. – E.g. <“Slovenia”, type, Country> 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 67
  • 68. Some NER tools • Java – Stanford Named Entity Recognizer • http://nlp.stanford.edu/software/CRF-NER.shtml – GATE • http://gate.ac.uk/ – LingPipe • http://alias-i.com/lingpipe/ • C – SuperSense Tagger • http://sourceforge.net/projects/supersensetag/ • Python – NLTK • http://www.nltk.org 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 68
  • 69. NER – list lookup • Entities stored in lists (gazetteers) – E.g., Countries and cities • Plus: Simple, fast, cross-language • Minus: list update, name variants (UPF, Universitat Pompeu Fabra), ambiguity 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 69
  • 70. List lookup – ambiguities • Term level – E.g. capitalized words: [All American Bank] vs. All [State Police] • Structure level – “[Dolce and Gabbana]” vs “[Microsoft] and [Yahoo!]” • Type level – John Smith (organization vs. person) – May (person vs. date vs. verb) – Washington (person vs. location) – 2015 (date vs. time)  Gazetteers not end solution but sources of background knowledge 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 70
  • 71. NER methods • Rule Based – Regular expressions, e.g. capitalized word + {street, boulevard, avenue} indicates location – Engineered vs. learned rules • NER can be formulated as classification tasks – NE extraction: assign word mentions to tags (B beginning of an entity, I continues the entity, O word outside the entity) – NE classification: assign entity mentions to categories (Person, Organization, etc.) – Use ML methods for classification: Decision trees, SVM, AdaBoost – Standard classification assumes cases are disconnected (i.i.d) • Probabilistic sequence models: HMM, CRF – Each token in a sequence is assigned a label – Labels of tokens are dependent on the labels of other tokens in the sequence particularly their neighbors (not i.i.d). 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 71
  • 72. Naïve Bayes Classification • Determine category of xk by computing for each yi • Priors P(Y=yi) and conditionals P(X=xk | Y=yi) estimated from data (via MLE), – E.g. If ni of the examples in D are in yi then P(Y=yi) = ni / |D| • When categories are complete and disjoint, P(X=xk): )( )|()( )|( k iki ki xXP yYxXPyYP xXyYP          m i k iki m i ki xXP yYxXPyYP xXyYP 11 1 )( )|()( )|(   m i ikik yYxXPyYPxXP 1 )|()()( 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 72
  • 73. Classification via Logistic Regression • Instead of generative models, a descriminative model can be used to specifically focus on the conditional distribution P(Y | X) • Assumes a parametric form for directly estimating P(Y | X) • Basically a linear model 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 73    n i ii Xww XYP 10 )exp(1 1 )|1(        n i ii n i ii Xww Xww 10 10 )exp(1 )exp( )|1( )|0( 1iff0labelAssign XYP XYP Y      n i ii Xww 10 )exp(1   n i ii Xww 100   n i ii Xww 10lyequivalentor )|1(1)|0( XYPXYP 
  • 74. Classification Y X1 X2 … Xn Y X1 X2 … Xn Naïve Bayes Logistic Regression Conditional Generative Discriminative 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 74
  • 75. Sequence Labeling Y2 X1 X2 … XT HMM Linear-chain CRF Conditional Generative Discriminative Y1 YT .. Y2 X1 X2 … XT Y1 YT .. 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 75 Sunita Sarawagi and William W. Cohen. Semi-Markov Conditional Random Fields for Information Extraction. In NIPS, 2005.
  • 76. NER features • Gazetteers (background knowledge) – location names, first names, surnames, company names • Word – Orthographic • initial-caps, all-caps, all-digits, contains-hyphen, contains-dots, roman-number, punctuation-mark, URL, acronym – Word type • Capitalized, quote, lowercased, capitalized – Part-of-speech tag • NP, noun, nominal, VP, verb, adjective • Context – Text window: words, tags, predictions – Trigger words • Mr, Miss, Dr, PhD for person and city, street for location 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 76
  • 77. Exploiting Query Logs / Click-Through Data • Weakly-supervised entity Extraction from queries / click-through data – A small set of seed instances for each entity type • Learn – Patterns captured by LDA-based topic model: Probabilities of query contexts and click websites of named entities for each class – Template: common query prefix and postfix, e.g. “how did country gain independence” • Apply patterns / templates to click-through data / query logs to mine new named entities Marius Pasca: Weakly-supervised discovery of named entities using web search queries. CIKM 2007:683-690 Gu Xu, Shuang-Hong Yang, Hang Li: Named entity mining from click-through data using weakly supervised latent dirichlet allocation. KDD 2009:1365-1374 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 77
  • 78. Information extraction: relations • Relation extraction – <“Slovenia”, “borders”, “Italy”> • Relation resolution – <“Slovenia”, borders, “Italy”> – <“Slovenia”, next_to, “Italy”> • We distinguish between open and bootstraped approaches 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 78
  • 79. Relation Extraction • Extracting relations – Typical paraphrase problem: identify all the ways a relation may be expressed • Formulated as classification task, e.g. uses SVM – Training data: parse tree, with nodes associate with a type as well as a role (e.g. role=member, role=affiliation to capture a person-affiliation relation) – Tree-based kernel: two trees are similar if roots have same type and role, and each has a subsequence of children (not necessarily consecutive) with the same types and roles – Examples are converted into such parse trees with role labels, and used to train the system – SVM can then classify new examples of possible relations • Formulate as sequence labeling (semantic role labeling) • Joint inference: considers different types of features (syntactic, semantic) and problems (extraction, resolution) 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 79
  • 80. Bootstrapped information extraction • Provide examples for relationships which we want to extract • Compromise: lower coverage, higher quality • Example: Sofie, ReadTheWeb 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 80
  • 81. Open information extraction • We do not want to put constraints on the types of relationships we want to extract • Very interesting for open-domain WWW datasets • Example: TextRunner • Compromise: higher coverage, lower quality • Hybrid approaches: – EntityCube combines both bootstrapped and open extraction 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 81
  • 82. Knowledge management • Organization • Consistency management • Strictness 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 82
  • 83. Knowledge organization • Lexicon: A set of entities and statements • Ontology: A complex graph of formal concepts – Not only concrete entities, but also abstract classes – Sofie/Yago, WebKB, ReadTheWeb, TextRunner • Full world model: A Context-sensitive complex graph • Cyc 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 83
  • 84. Knowledge consistency • Consistency management – Not all extracted information is accurate – Inaccurate information leads to inconsistencies in the knowledge base – Example: • Having pattern “?x is mayor of ?y” and knowledge that <x,mayorOf,y> requires <x,type,Person> and <y,type,City>, we can filter out inconsistent information 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 84
  • 85. Knowledge consistency • Examples: – SOFIE: • Select the subset of statements which have the maximum satisfiability with regard to constraints – ReadTheWeb: • Learns new constraints via semi-supervised boostrap learning • Accuracy grows with ontology complexity 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 85
  • 86. Knowledge management • Bootstrapping – Using existing manually prepared knowledge to generate new knowledge – While the knowledge base grows, the rules for extraction also change 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 86
  • 87. Knowledge management • Strictness: – When do we consider entity and relationship resolution important? • Depends on use case: – Reasoning and data integration requires well- formed and unambigouous entities and relations – Information retrieval can cope with not-well formed relations and entities 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 87
  • 88. Machine learning • Used in NLP, IE as well as knowledge acquisition • Various approaches – Self-supervised – Semi-supervised – Supervised 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 88
  • 89. Machine learning • Natural language processing – Part-of-speech learning • Information extraction – Pattern learning • ReadTheWeb, TextRunner, WebKB • Knowledge acquisition – Rule learning (WebKB) – Constraint learning (ReadTheWeb) 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 89
  • 90. Summary • Cyc – Full world model knowledge base • WebKB – First attempt of automatically constructing a knowledge base • TextRunner – Open information extraction 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 90
  • 91. Summary • EntityCube – Hybrid bootstrapped and open IE • SOFIE/YAGO – Tight integration of natural language processing, disambiguation and acquisition • Read The Web – Constraint learning 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 91
  • 92. Entity Linking Source: Tadej Steiner from Jozef Stefan Institute, Ljubljana, Slovenia
  • 93. Basic situation 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 93
  • 94. Pipeline 1. Identify named entity mentions in source text using a named entity recognizer 2. Given the mentions, gather candidate KB entities that have that mention as a label 3. Rank the KB entities 4. Select the best KB entity for each mention 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 94
  • 95. Pipeline 1. Identify named entity mentions in source text using a named entity recognizer 2. Given the mentions, gather candidate KB entities that have that mention as a label 3. Rank the KB entities 4. Select the best KB entity for each mention 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 95
  • 96. Linking approaches - pair-wise linking • Pair-wise linking: for each in-text entity, choose the candidate entity which is the best w.r.t. description similarity and textual features • Is each disambiguation choice independent? – Pair-wise vs. collective disambiguation 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 96
  • 97. Important ranking features • Mention popularity – P(entity|mention) – P(dbpedia:Kashmir_(song)|”Kashmir”) = 0.54 – P(dbpedia:Kashmir_(region)|”Kashmir”) = 0.91 – Distribution of links and anchors in Wikipedia Context similarity - sim(ctx(mention), ctx(entity)) Context of a mention is the surrounding sentences Context of an entity is the description of the entity (Wiki article) Coherence / Collective Entities that appear together tend to be related to one another Usually solved by a greedy graph pruning algorithm 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 97
  • 98. Collective linking • For each in-text entity, choose the candidate entity which is the most similar to the in-text entity and related to other entities that are already chosen. 01 Apr 2012 98 Tadej Stajner, Dunja Mladenic: Entity Resolution in Texts Using Statistical Learning and Ontologies. ASWC 2009:91-104 Xianpei Han, Le Sun, Jun Zhao: Collective entity linking in web text: a graph-based method. SIGIR 2011:765-774 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web
  • 99. Relatedness • Intuition: entities that co-occur in the same context tend to be more related • How can we express relatedness of two entities in a numerical way? – Statistical co-occurrence – Similarity of entities’ descriptions – Relationships in the ontology 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 99
  • 100. Semantic relatedness • If entities have an explicit assertion connecting them (or have common neighbours), they tend to be related Elvis Memphis Elvis Presley Memphis, Egypt Memphis, TN origin Person Location type type type St. Elvis type 01 Apr 2012 100
  • 101. Co-occurrence as relatedness • If distinct entities occur together more often than by chance, they tend to be related Document FC Barcelona Bayern FC Barcelona Bavaria Bayern München Mutual information Mutual information x y x y x y z 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 101
  • 102. Content similarity as relatedness • If distinct entities have higher similarity of their descriptions, they tend to be related Document a b x y z similarity = 0.35 similarity = 0.25 similarity = 0.35 similarity = 0,7 similarity = 0,1 similarity = 0,2 01 Apr 2012 102
  • 103. Architecture Input text Preprocessing (entity extraction and consolidation) .. with in- text entities Background knowledge (ontology)Match retrieval Entity description vectors Assertion type informativeness Entity co-occurences .. with resolved entities Relatedness Entity linking 103
  • 104. Crowdsourcing for Entity Linking Micro Matching Tasks HTML Pages HTML+ RDFa Pages LOD Open Data Cloud Crowdsourcing Platform Z enCrowd Entity Extractors LOD Index Get Entity Input Output Probabilistic Network Decision Engine Micro- TaskManager Workers Decisions Algorithmic Matchers Gianluca Demartini, Djellel Eddine Difallah, and Philippe Cudré-Mauroux. ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for Large-Scale Entity Linking. In: 21st International Conference on World Wide Web (WWW 2012), Lyon, France, April 2012. 104
  • 105. Crowdsourcing for Entity Linking • Matching micro-task – Unclear (i.e., low confidence) matches are crowdsourced – Top algorithmic results are presented to the workers – Answers from the crowd are input to a probabilistic network 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 105
  • 106. Crowdsourcing for Entity Linking • Probabilistic Graph – Worker prior probability (from previous tasks) – Link prior probability (from algo matchers) – Link factors connect worker clicks and links – SameAs constraints – Dataset unicity contstraints w1 w2 l1 l2 pw1( ) pw2( ) lf1( ) lf2( ) pl1( ) pl2( ) l3 lf3( ) pl3( ) c11 c22 c12 c21 c13 c23 u2-3( )sa1-2( ) 01 Apr 2012
  • 107. Entity De-duplication “Entity Consolidation” “Entity Resolution” “Record Linkage” “Instance Matching” Sources: Yongtao Ma from Karlsruhe Institute of Technology, Samur Araujo from The Delft Bioinformatics Lab and Aidan Hogan from Digital Enterprise Research Institute
  • 108. Structure • Motivation • Problem and task overview • Consider only explicit owl:sameAs • Consider some lightweight reasoning • Inductive / instance matching methods – Effectiveness – Efficiency 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 108
  • 109. Motivation 340,000 Results 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 109
  • 110. Motivation • 2% of customer records obsolete in 1 month, due to deaths, name changes • $611B/year loss in US due to poor customer data • An average company has 49 different databases and spends 35% of its IT dollars on integration efforts 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 110
  • 111. Motivation 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 111
  • 112. Hetereogenity in naming… Tim Berners-Lee: URIs … timbl:i dblp:100007 identica:45563 adv:timblfb:en.tim_berners-lee db:Tim-Berners_Lee = owl:sameAs 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 112
  • 113. 11 3 De-duplication for Web data 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 113
  • 115. Data integration – big picture • Ontology matching – Widely studied in Semantic Web research, see e.g. list of publications at ontologymatching.org • Entity de-duplication – Logic-based approaches in the Semantic Web – Studied as record linkage in the database literature – Machine learning based approaches, focusing on attributes – Graph-based approaches, see e.g. the work of Lisa Getoor are applicable to RDF data • Improvements over only attribute based matching • Blending / data fusion – Merging objects that represent the same real world entity and reconciling information from multiple sources – Information quality / redundance 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 115
  • 116. De-duplication • The problem of determining if two instances refer to the small real- world entity. owl:sameas Source Instances Target Instances 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 116
  • 117. 1. Find equivalences in the data • Consider only explicit owl:sameAs (baseline) • Consider some lightweight reasoning (extended) • Inductive / instance matching methods 2. Rewrite data according to equivalences (data fusion) De-duplication – task overview 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 117
  • 118. Entity De-duplication Consider only explicit owl:sameAs
  • 119. • Use provided owl:sameAs mappings in the data timbl:i owl:sameas identica:45563 . dbpedia:Berners-Lee owl:sameas identica:45563 . • Store “equivalences” found timbl:i -> identica:45563 -> dbpedia:Berners-Lee -> timbl:i identica:45563 dbpedia:Berners-Lee De-duplication 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 119
  • 120. • For each set of equivalent identifiers, choose a canonical term timbl:i identica:45563 dbpedia:Berners-Lee De-duplication 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 120
  • 121. • Afterwards, rewrite identifiers to their canonical version: De-duplication timbl:i rdf:type foaf:Person . identica:48404 foaf:knows identica:45563 . dbpedia:Berners-Lee dpo:birthDate “1955-06-08”^^xsd:date . dbpedia:Berners-Lee rdf:type foaf:Person . identica:48404 foaf:knows dbpedia:Berners-Lee . dbpedia:Berners-Lee dpo:birthDate “1955-06-08”^^xsd:date . timbl:i identica:45563 dbpedia:Berners-Lee 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 121
  • 122. Entity De-duplication Consider some lightweight reasoning
  • 123. • Infer owl:sameAs through reasoning (OWL 2 RL/RDF) 1. explicit owl:sameAs (again) 2.owl:InverseFunctionalProperty 3.owl:FunctionalProperty 4.owl:cardinality 1 / owl:maxCardinality 1 foaf:homepage a owl:InverseFunctionalProperty . timbl:i foaf:homepage w3c:timblhomepage . adv:timbl foaf:homepage w3c:timblhomepage . ⇒ timbl:i owl:sameas adv:timbl . …then apply data fusion as before Extended de-duplication 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 123
  • 124. Entity De-duplication Inductive / Instance Matching Methods
  • 125. Agenda • Problem overview • Attribute level – (see term matching) • Instance level – Effectiveness: learning – Efficiency: blocking • Dataset level – (see collective entity linking) 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 125
  • 126. Problem overview effectiveness vs. efficiency Instance Matching Effectivity Find correct matches! Efficiency Do it fast! 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 126
  • 127. Efficiency O(NxM) Source Target Not efficient 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 127
  • 128. “Diclofenac” occurrence on DBPEDIA 01 Apr 2012 128
  • 129. Problem overview – attribute level <A1, ‘Dave White’, ‘Intel’, ‘Male’> <P1, ‘Database…’, ‘John Black’, ‘Don White’> <A2, ‘Don White’, ‘CMU’, ‘Male’> <P2, ‘Multimedia…’, ‘Sue Grey’, ‘D. White’> <A3, ‘Susan Grey’, ‘MIT’, ‘Female’> <P3, ‘Title3…’, ‘Dave White’> <A4, ‘John Black’, ‘MIT’, ‘Male’> <P4, ‘Title5…, ‘Don White’, ‘Joe Brown’> <A5, ‘Joe Brown’, unknown, ‘Male’><P5, ‘Title6…’, ‘Joe Brown’, ‘Liz Pink’> <A6, ‘Liz Pink’, unknown, ‘Female’> <P6, ‘Title7…’, ‘Liz Pink’, ‘D. White’> Attribute level ‘Don White’ , ‘D. White’ ‘Don White’, ‘Dave, White’ • What (values?) • How • Similarity metrics • Similarity threshold • Matching techniques 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 129
  • 130. Problem overview – instance level <A1, ‘Dave White’, ‘Intel’, ‘Male’> <P1, ‘Database…’, ‘John Black’, ‘Don White’> <A2, ‘Don White’, ‘CMU’, ‘Male’> <P2, ‘Multimedia…’, ‘Sue Grey’, ‘D. White’> <A3, ‘Susan Grey’, ‘MIT’, ‘Female’> <P3, ‘Title3…’, ‘Dave White’> <A4, ‘John Black’, ‘MIT’, ‘Male’> <P4, ‘Title5…, ‘Don White’, ‘Joe Brown’> <A5, ‘Joe Brown’, unknown, ‘Male’><P5, ‘Title6…’, ‘Joe Brown’, ‘Liz Pink’> <A6, ‘Liz Pink’, unknown, ‘Female’> <P6, ‘Title7…’, ‘Liz Pink’, ‘D. White’> Instance level <A1, ‘Dave White’, ‘Intel’, ‘Male’> <A2, ‘Don White’, ‘CMU’, ‘Male’> • What (attributes?) • How • Similarity metrics • Similarity threshold • Matching techniques 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 130
  • 131. • How • Similarity metrics • Similarity threshold • Matching techniques Problem overview – dataset level <A1, ‘Dave White’, ‘Intel’, ‘Male’> <P1, ‘Database…’, ‘John Black’, ‘Don White’> <A2, ‘Don White’, ‘CMU’, ‘Male’> <P2, ‘Multimedia…’, ‘Sue Grey’, ‘D. White’> <A3, ‘Susan Grey’, ‘MIT’, ‘Female’> <P3, ‘Title3…’, ‘Dave White’> <A4, ‘John Black’, ‘MIT’, ‘Male’> <P4, ‘Title5…, ‘Don White’, ‘Joe Brown’> <A5, ‘Joe Brown’, unknown, ‘Male’><P5, ‘Title6…’, ‘Joe Brown’, ‘Liz Pink’> <A6, ‘Liz Pink’, unknown, ‘Female’> <P6, ‘Title7…’, ‘Liz Pink’, ‘D. White’> Dataset level • What (instances?) 01 Apr 2012 131
  • 132. Agenda • Problem overview • Attribute Level – (see term matching) • Instance Level – Effectiveness: learning – Efficiency: blocking • Dataset Level – (see collective entity linking) 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 132
  • 133. Character-based • [see term matching in Part 3 on search & ranking] • Edit Distance [G98] – Character Operations: insert, delete, replace – Given two strings, s and t, edit(s,t): • Minimum cost of operations transforming s to t • Exp.: edit(Eorror, Eror)=1, edit(great,grate)=2 – Aiming at: common typing errors – Problem: works not well with other type of errors • Exp.: D. White vs Dave White • Jaro Rule 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 133
  • 134. Token-based • Q-gram – The q-grams are short character substrings of length q of the string – Example: 3-gram(White)={ ‘Whi’, ‘hit’, ‘ite’ } – set similarity then can be applied to the grams 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 134
  • 135. Agenda • Problem overview • Attribute Level – (see term matching) • Instance Level – Effectiveness: learning – Efficiency: blocking • Dataset Level – (see collective entity linking) 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 135
  • 136. Questions • Given instance attributes {Name, Institute, Gender, Publish} – Which ones are more important? – Which similarity measures should be adopted? – What is the threshold of similarity that should be adopted? 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 136
  • 137. Bayes Decision Rule • Notation – A,B are two tables, of n comparable fields – tuple pairs – classes: M (match) and U (non-match) – random vector , xi shows the level of agreement of the ith field for • Decision rule: 01 Apr 2012 137 called likelihood ratio
  • 138. Bayes Decision Rule • Given training data, assume p(xi|M) and p(xj|M) are independent for i≠j[5] • Extension: – Using an expectation maximization (EM) algorithm to estimate likelihood – Relax independent assumption – Decision with reject class 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 138
  • 139. Agenda • Problem overview • Attribute Level – (see term matching) • Instance Level – Effectiveness: learning – Efficiency: blocking • Dataset Level – (see collective entity linking) 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 139
  • 140. Blocking strategies Source Target • Used to reduce the number of instance comparison • Non-overlapping partitions • Canopies and clustering – overlapping partitions 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 140
  • 141. Blocking strategies Blocking Attribute dependent Attribute agnostic When the source and target schema match. Otherwise 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 141
  • 142. Attribute dependent • Blocking Key Value (BKV) – Sorted Neighborhood approach – Q-grams blocking technique • Blocking keys are highly discriminating attributes (e.g. last name, phone number) • Targeting homogeneous datasets b a (blocking key) c d 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 142
  • 143. Sorted Neighborhood • Motivation: – similar records have similar values – multiple “cheap” passes faster than an “expensive” one • Goal: sort feature by a key to bring matching records close to each other • Methodology: – Create a key for every record (e.g. first 3 characters of last name) – Sort data by the key – Pair-wise comparison within a small sliding window – Multiple passes based on distinct key 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 143
  • 144. Sorted Neighborhood • Example: ID Name SS Birthday ZIP r1 David Black 123-45 01.05.1985 76137 r2 Dauid Black 123-45 01.06.1985 76137 r3 David White 325-52 23.09.1984 84212 r4 David B. 126-53 30.10.1983 84123 r5 David B. 745-32 07.05.1973 84212 r1 r2 r4 r3 r5 r2 r1 r3 r4 r5 ZIP[1..3] Name[1..3] 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 144
  • 145. Q-gram blocking • Motivation: similar matches have high overlaps of q-grams • Goal: relaxes the edit distance constraint to a weaker count constraint on the number of matching q-grams • Methodology: given two strings s and t, and a edit distance constraint k – Count Filtering: s and t must share LBs,t=max(|s|,|t|)-1-(k-1)*q q- grams – Position Filtering: s and t must share at least LBs,t positional q- grams – Length Filtering: ||s|-|t||≤k 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 145
  • 146. Q-gram blocking • Example: 3-gram s=abaxabaaba ##a,#ab,aba,bax,axa,xab,aba,baa,aab,aba,ba$,a$$ t=abaabaaba ##a,#ab,aba,baa,aab,aba,baa,aab,aba,ba$,a$$ ED(s,t)≤k → |Q(s) ∩ Q(t)| ≥ max(|s|,|t|)-1-(k-1)*q ED(s,t)=1, |Q(s) ∩ Q(t)|=9 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 146
  • 147. Attribute dependent • Learning the attributes (blocking keys) – Decision tree – Maximum hyper-rectangles DrugBank DBPEDIA Label Drugname Sideeffect Page Title Name Producer Composition 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 147
  • 148. Attribute dependent • Learning functions of similarity (e.g., Jaccard, Jaro, Levenshtein, Hamming, Cosine, etc.) DrugBank DBPEDIA Label= TitleDiclofenac Diclofenac Sodium =≈ 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 148
  • 149. Attribute agnostic • Designed for heterogeneous information space. (i.e., loose schema binding, noise, missing or inconsistent values, as well as an unprecedented level of heterogeneity) • No knowledge about the schema software Corp. (blocking key) radio film 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 149
  • 150. Attribute agnostic • “All tokens” • Reduce comparison space – Block purging, – Block scheduling, – Block enumeration, – Duplicate propagation, – Comparisons propagation, and – Comparisons pruning. software Corp. (blocking key) radio film 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 150
  • 151. Entity Storage & Indexing
  • 152. Indexing • Search requires matching and ranking – Matching selects a subset of the elements to be scored • The goal of indexing is to speed up matching – Retrieval needs to be performed in milliseconds – Without an index, retrieval would require scanning through the collection • The type of index depends on the types of data and queries to be supported – DB-style indexing – IR-style indexing 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 152
  • 153. IR-style indexing • Index data as text – Create virtual documents from data – One virtual document per subgraph, resource or triple • typically: resource • Key differences to Text Retrieval – RDF data is structured – Minimally, queries on property values are required 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 153
  • 154. Horizontal index structure • Two fields (indices): one for terms, one for properties • For each term, store the property on the same position in the property index – Positions are required even without phrase queries • Query engine needs to support the alignment operator • Dictionary is number of unique terms + number of properties 01 Apr 2012 154
  • 155. Vertical index structure • One field (index) per property • Positions are not required – But useful for phrase queries • Query engine needs to support fields • Dictionary is number of unique terms • Number of fields could be a problem for merging, query performance 01 Apr 2012 155
  • 156. Indexing using MapReduce • MapReduce is the perfect model for building inverted indices – Map creates (term, {doc1}) pairs – Reduce collects all docs for the same term: (term, {doc1, doc2…} – Sub-indices are merged separately • Term-partitioned indices 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 156
  • 158. Outline • Expert Finding models • Entity Ranking in Wikipedia • Web Entity Retrieval • Entity Search over Structured Data • Relational Entity Search over Structured Data 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 158
  • 159. From Documents to Entities • Document Search 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 159
  • 160. From Documents to Entities • Entity Search 1. Ent1 2. Ent2 3. Ent3 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 160
  • 161. A taxonomy of Entity Search tasks 01 Apr 2012 161
  • 162. Expert Finding - Motivation • Scenario – In large companies competencies and skills are spread – Executives need to create a team for a new project: find staff with the right expertise – Someone needs to solve a problem – Example: I need an expert on ontology engineering 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 162
  • 163. Expert Finding - Motivation • Goal – Use the digital content available in the enterprise – Create a ranking of people who are experts in the given topic 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 163
  • 164. Motivation for System Support • Busy experts do not have time to maintain adequate descriptions of their continuously changing specialized skills • Expert seekers have poorly articulated requirements and are not fully enabled to judge a good expert from a bad one 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 164
  • 165. Complicating factors • Volume of communication/publication is not a reliable indication of expertise • Certain topics engender more opinion than facts • Lack of information about past performance of experts • New employees don’t know about informal social networks • Access to expertise is often controlled (informally or formally, by the experts or their management) • Solutions to complex problems require diverse ranges of expertise 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 165
  • 166. Evidence of Expertise • Email or bulletin board messages • Corporate communications • Shared folders in file system • Resumes and homepages • Employee database • Email flow • Bibliographic information • Software library usage • Search and publication history • Project time charges See also bibliography on TREC-ENT wiki: http://www.ins.cwi.nl/projects/trec-ent/wiki/index.php/Bibliography Content Social networks Activities 01 Apr 2012 166
  • 167. Assumptions • Content – Experts are mentioned in relevant documents – Experts author relevant documents • Social networks – People that interact are likely to share expertise – Evidence in records of information exchange (and co-authorship, co-work) and/or organizational structure 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 167
  • 168. Two Basic Approaches Who should I ask about the copyright forms? • Document-based: rank docs, extract experts Copyright forms Lori Lori Lori Ellen Ian Lori Lori Ellen Lori 1. 2. 1. 4. 5. 3. 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 168
  • 169. Document-based Expert Finding • Find and score documents about the topic – Title about topic – Abstract about topic • Aggregate scores for each distinct author 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 169
  • 170. Two Basic Approaches Who should I ask about the copyright forms? • Document-based: rank docs, extract experts • Candidate-based: rank candidate profiles Copyright forms Lori Lori Lori Ellen Ian Lori Lori Ellen Lori 1. 2. 1. 4. 5. 3. Lori Copyright forms Ellen Ian 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 170
  • 171. Voting model • Data fusion techniques • Each ranked document represents a vote for the expertise of a candidate • Vote aggregation: – Number of docs voting for each candidate – Scores of retrieved documents – Ranks of retrieved documents 01 Apr 2012 Craig Macdonald, Iadh Ounis: Voting for candidates: adapting data fusion techniques for an expert search task. CIKM 2006: 387-396
  • 172. User-Oriented Model • Additional real-world constraints • Distance between user and expert – User previous knowledge on the topic – Contact time (organizational hierarchy, geo location, collaboration) 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web Elena Smirnova, Krisztian Balog: A User-Oriented Model for Expert Finding. ECIR 2011: 580-592 172
  • 173. Additional Techniques Research Systems • Combine the two basic approaches • Estimate the quality of the evidence • Use of collection/structural knowledge – Treat emails different from documents – Treat email’s subject/sender/receiver different from body – Locate homepages See also TREC proceedings 2005-2007 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 173
  • 174. Additional Techniques Research Systems • Use social network extracted from co- authorship or email lists • Relevance propagation over expertise graph • Use Web Search evidence 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 174
  • 175. Expert Finding - References – P@noptic Expert [Craswell et al. Ausweb01] – Balog’s Model 1 and 2 [Balog et al. SIGIR06] – Voting Model [Macdonald and Ounis CIKM06, ECIR07, ECIR08] – Expertise evidence [Macdonald et al. ECIR08] – Vector Space Model [Demartini et al. ECIR09] – Web evidence [Serdyukov et al. TREC08]01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 175
  • 177. Ranking… • People • Actors • … Car companies [i.e., insert your fav entity type here] Entity Ranking!!! 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 177
  • 178. Wikipedia • Encyclopedia – multilingual, Web-based, free-content, openly- editable: errors are promptly corrected • Articles: – balanced, neutral, and encyclopedic, containing notable verifiable knowledge • Categories / sub-categories • Links, anchor text (Germany -> Albert Einstein) ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web
  • 179. Entities in Wikipedia • Art museums • Countries • Actors, Singers • Monarchs • Artists • Magicians • ... 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 179
  • 180. Example Entity Ranking Scenarios • Impressionist art museums in Holland • Countries with the Euro currency • German car manufacturers • Artists related to Pablo Picasso • Countries involved in WWI • Actors who played Hamlet • English monarchs who married French women Many examples on http://www.ins.cwi.nl/projects/inex-xer/topics/ 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 180
  • 181. Entity Ranking • Topical query Q • Entity (result) type TX • A list of entity instances Xs • An entity is represented by its Wikipedia page • Systems employ categories, structure, links 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 181
  • 182. Tasks • Entity Ranking (ER) – Given Q and T, provide Xs • List Completion (LC) – Given Q and Xs[1..m] – Return Xs[m+1..N] 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 182
  • 183. ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web Topic 60 Title olympic classes dinghy sailing Entities 470 (dinghy) (#816578) 49er (dinghy) (#1006535) Europe (dinghy) (#855087) Categories dinghies (#30308) Description The user wants the dinghy classes that are or have been olympic classes, such as Europe and 470. Narrative The expected answers are the olympic dinghy classes, both historic and current. Examples include Europe and 470. TX Q Xs 01 Apr 2012 183
  • 184. Formal Model for Entity Ranking – Indexing • Entities • Data Sources “Alexandre Pato” ID: ap12dH5a (born in; 1989) (playing with; acm15hDJ)
  • 185. Formal Model for Entity Ranking • Searching – Users' Information Need – Entity Ranking System
  • 186. Approaches to ES in Wikipedia • Exploit and refine the category structure – Wordnet to find entity types (e.g., a professor is a person) • Extend the query – Synonyms and related words (Wordnet synsets) • Exploit the link structure – Links in Wikipedia are usually entities – Search Keywords also in anchor text of outLinks 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 186
  • 187. YAGO – Suchanek et al. 2007 – Highly accurate ontology (>95%) – Extracted from Wikipedia + WordNet – Provides semantic concepts describing Wikipedia entities Married... With Children Sitcoms WordNet Synset Wikipedia Category Wikipedia Taxonomy YAGO subClassOf Situation Commedy ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web
  • 188. Category Based Search • Query expansion by modifying category information – Subcategories • Extracted from Wikipedia – “Children” Categories • Filtered using the YAGO subClassOf relation – “Sibling” Categories • Extracted from Wikipedia • Having with the same YAGO type ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web
  • 190. “Children” Categories Sitcoms YAGO subClassOf Latino Sitcoms Sitcoms in Canada BBC Television Sitcoms Sitcom Characters Situation Comedy Fictional Character Wikipedia Subcategories Wikipedia Category ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web
  • 191. “Sibling” Categories Sitcoms YAGO subClassOf Latino Sitcoms Sitcoms in Canada BBC Television Sitcoms Situation Comedy Wikipedia Category YAGO subClassOf ... ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web
  • 192. Entity Search over Wikipedia • Search for many different entity types with one system! • Main observations – Link information is important – Cleaning the category structure of Wikipedia is critical (YAGO) – NLP-based techniques on the user query improve effectiveness • Open issues – No temporal evolution of content is considered – Wikipedia is meant to be objective 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web Gianluca Demartini, Claudiu S. Firan, Tereza Iofciu, Ralf Krestel, and Wolfgang Nejdl. Why Finding Entities in Wikipedia is Difficult, Sometimes. In: "Information Retrieval" 13(5): 534-567, Springer, October 2010. 192
  • 193. Time-Aware Entity Retrieval • In some cases the time dimension is available – News collections – Blog postings • News stories evolve over time – Entities appear/disappear – Analyse and exploit relevance evolution – Decide about relevance at document level • An Entity Search system can exploit the past to find relevant entities Gianluca Demartini, Malik Muhammad Saad Missen, Roi Blanco, Hugo Zaragoza. TAER: Time Aware Entity Retrieval. CIKM 2010, Toronto, Canada. 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 193
  • 194. Time-Aware Entity Retrieval 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 194
  • 195. Time-Aware Entity Retrieval • Evaluation – P3, P5, AvgPrec – Ties aware measures [McSherry and Najork, ECIR08] • Paired t-test – ** p<<0.01 – * p<0.05 • Related considered NonRelevant 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 195
  • 196. History Features • We also tried – Weight history features with doc length – Weight history features with BM25 Feature P3 P5 MAP F(e,d) .65 .56 .60 F(e,d1) .58 .53 .56 F(e,d-1) .64 .56 .62* F(e,H) .66 .59** .66** CoOcc(e,H) .62 .57 .65** DF(e,H) .63 .57* .65** 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 196
  • 197. Dataset and Analysis • TREC Novelty Track 2004 – 25 event topics – 779 relevant news • Entity annotations (7481 entities) • Relevance judgements • How useful is to find relevant sentences? – P(e is Rel) 0.411 [0.404-0.417] – P(e is NotRel) 0.168 [0.163-0.173] – P(e is Rel | s is Rel) 0.547 [0.534-0.559] – Sentences:  21727 total 1.46 entity occurences  5122 relevant 1.88 entity occurences 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 197
  • 198. Data Analysis • How useful is looking at the past? – P(e|d1) 0.893 [0.881-0.905] – P(e|d-1) 0.701 [0.677-0.726] • Is useful to consider sentence co- occurence? P(e1,e2) Relevant Related NotRelevant NotAnEntity Relevant 0.24 0.08 0.03 0.07 Related 0.07 0.03 0.03 NotRelevant 0.07 0.05 NotAnEntity 0.04 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 198
  • 199. Approach • Entity Ranking features for News articles – Local Features  F(e,d)  FirstSenLen  FirstSenPos  Fsubj  AvgBM25(q,s)  SumBM25(q,s)  History Features • Feature combination – Linear and Machine Learning 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 199
  • 200. Local Features Feature P5 MAP F(e,d) .56 .60 FirstSenLen .36 .45 FirstSenPos .31 .43 Fsubj .44 .50 AvgBM25(q,s) .30 .41 SumBM25(q,s) .44 .52 Feature P5 MAP All Tied .34 .42 20001 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web
  • 201. Is the past useful? • Looking at previous documents – Entity occurences so far F(e,H) – Docs where the entity appeared so far DF(e,H) – Entity occurences in the previous doc F(e,d-1) – Frequency of entity the first time F(e,d1) – Number of other entities with which the entity co-occured so far CoOcc(e,H) 20101 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web
  • 202. History Features • * t-test p value < 0.05 as compared with F(e,d) • ** t-test p value < 0.01 as compared with F(e,d) Feature P5 MAP F(e,d) .56 .60 F(e,d1) .53 .56 F(e,d-1) .56 .62* F(e,H) .59** .66** CoOcc(e,H) .57 .65** DF(e,H) .57* .65** 20201 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web
  • 203. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 10 20 30 40 50 60 AvgPrec i-th document (i.e., history size+1) Using the History Using the History • Conclusion – Evidence from past documents is very important – Effectiveness should improve over time (run F(e,H)) 01 Apr 2012 203
  • 204. Discussion • New search task: Time-Aware Entity Retrieval • Constructed evaluation benchmark • Experimental Evaluation – Investigated some features and combinations – Information from the past helps most – Obtain 15% improvement over F(e,d) 20401 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web
  • 206. Ranking Entities on the Web • TREC Entity Track 2009-2010 – 50M web pages (including Wikipedia) – Find related entities (return homepages) 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 206
  • 207. Ranking Entities on the Web • Approaches – Use Wikipedia (and infoboxes) as background info – Extract entities from tables and lists – Find the homepage given the entity name (see ENS) • Barack Obama -> www.barackobama.com • Since 2010: 1 billion web pages 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 207
  • 208. Related Entity Finding • Approaches – Kaptein et al., CIKM10 • Exploits Wikipedia to improve entity retrieval effectiveness • Identifies entity types • Wikipedia external links as source for entity homepage • Anchor text index for entity search – Bron et al, CIKM10 • Entity co-occurence • Entity type filtering • Context (relation type) • Wikipedia-based homepage finding 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 208
  • 209. Discussion 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 209
  • 210. Expert Finding - Key Requirements • Identify experts via self-nomination and/or automated analysis of expert communications, publications, and activities • Classify the type and level of expertise of individuals and communities • Validate the breadth and depth of expertise of an individual • Recommend experts, including the ability to rank order experts on multiple dimensions including skills, experience, certification and reputation 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 210
  • 211. Current systems • Hardly validate the breadth and depth of expertise – Count mentions – Weight with relevance score – Sometimes weight with authority of document containing candidate mention • Do not really attempt to classify the type and level of expertise 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 211
  • 212. Evidence of Expertise • Information about true expertise is often not explicit in artifacts (as opposed to factual knowledge) • Information about expertise is expressed using specialized terms and concepts 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 212
  • 213. How to improve? • Integrate more sources of evidence – CV information – Project related data • Including temporal information – Training data (HR dept) • Cost of achieving this evidence for expert vs. non-expert as weighting factor – Participation in TREC, authoring a book, getting a PhD in IR, ... Raymond D'Amore, Expert Finding in Disparate Environments, 2008 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 213
  • 214. However... • Two types of challenges to be overcome: – System challenge – Evaluation challenge 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 214
  • 215. System Challenges • Multi-lingual entity extraction • Privacy management – E.g., Tacit can email top N experts with private profiles (only recipient knows) • Interoperability with heterogeneous data sources – IMAP, Exchange, Lotus Notes – LDAP, JDBC/ODBC, XML repositories, Peoplesoft, Oracle Financials, Word/Excel/PDF, ... 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 215
  • 216. Where is my data? • > 80% of data not in relational databases – Documents, spreadsheets, presentations – Web pages – Email, instant messages, news feeds – Images, audio, video 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 216
  • 217. Dataspaces • The complete set of information belonging to one organization or task • Examples: – Personal dataspace – Enterprise dataspace – Community dataspace • E.g., scientific, sports club, ... “From Databases to Dataspaces: A New Abstraction for Information Management”, Michael Franklin, Alon Halevy, David Maier, SIGMOD Record, December 2005. 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 217
  • 218. Conclusions so far... • Expert finding could in principle use many more resources that indicate expertise, possibly more reliably, but it is difficult to setup the research – System challenges – Data availability • Motivates research in operational setting – E.g., Raymond D'Amore, Expert Finding in Disparate Environments, 2008 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 218
  • 219. Entity Search - Discussion • Similar challenges as Expert Finding • Entity information is spread over the Web – In different formats (HTML, RDF, images) – It is redundant (Wikipedia, DBPedia, homepage) – It varies over time (e.g., population of a country) – It is inconsistent (neutrino vs light speed) 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 219
  • 220. Entity Search - References • Approaches exploit – Wikipedia structure (links, categories) • Kaptein et al., CIKM10 (REF) • Demartini et al., IRJ 2010 (XER) – Entity relations • Bron et al., CIKM10 (REF) • Demartini et al., CIKM10 (TAER) – Graph-based methods • Rode et al., INEX08 (XER) • Iofciu et al., ECIR11 (XER) – Probabilistic Models • Weerkamp et al., INEX08 (XER) • Balog et al., ECIR10 (XER) 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 220
  • 221. Entity Search - Discussion • Structured data may be the way to improve search effectiveness – Entity identifiers – Entity relations 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 221
  • 223. Introduction • Unstructured or hybrid search over RDF data – Supporting end-users • Users who can not express their need in SPARQL – Dealing with large-scale data • Giving up query expressivity for scale – Dealing with heterogeneity • Users who are unaware of the schema of the data • No single schema to the data – Example: 2.6m classes and 33k properties in Billion Triples 2009 • Entity search – Queries where the user is looking for a single entity named or described in the query – e.g. kaz vaporizer, hospice of cincinnati, mst3000 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 223
  • 224. Use cases in web search Top-1 entity with structured data Related entities Structured data extracted from HTML 224
  • 225. Architecture overview Doc 1. Download, uncompress, convert (if needed) 2. Sort quads by subject 3. Compute Minimal Perfect Hash (MPH) map map reduce reduce map reduce Index 3. Each mapper reads part of the collection 4. Each reducer builds an index for a subset of the vocabulary 5. Optionally, we also build an archive (forward-index) 5. The sub-indices are merged into a single index 6. Serving and Ranking 1st part of the talk 2nd part 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 225
  • 226. Vertical index structure (reminder) • One field (index) per property • Positions are not required • Query engine needs to support fields  Dictionary is number of unique terms  Occurrences is number of tokens ✗ Number of fields is a problem for merging, query performance • In experiments we index the N most common properties 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 226
  • 227. BM25F Ranking BM25(F) uses a term-frequency (tf) that accounts for the decreasing marginal contribution of terms where vs is the weight of the field tfsi is the frequency of term i in field s Bs is the document length normalization factor: ls is the length of field s avls is the average length of s bs is a tunable parameter 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 227 Roi Blanco, Peter Mika, Sebastiano Vigna: Effective and Efficient Entity Search in RDF Data. International Semantic Web Conference 2011:83-97
  • 228. BM25F ranking cont. • Final term score is a combination of tf and idf where k1 is a tunable parameter wIDF is the inverse-document frequency: • Finally, the score of a document D is the sum of the scores of query terms q 01 Apr 2012 228
  • 229. Hierarchical entity model • Unstructured, structured and hierarchical entity model • Hierrachical entity model – Predicate type generation – Predicate generation: importance of a predicate within its type – Term generation: importance of a term is determined by the predicate in which it occurs and all other predicates of that type in the entity 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 229 Robert Neumayer, Krisztian Balog, and Kjetil Nørvåg. On the modeling of entities for ad-hoc entity search in the web of data. In ECIR'12.
  • 230. Query Independent Ranking • The question is not which answer is more relevant; i.e. all answers are relevant • The task is finding out which of the answers should be ranked higher • Importance is subjective • Closely related to the popularity of a resource? 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 230 Lorand Dali, Blaz Fortuna, Thanh Tran Duc and Dunja Mladenic Learning the Query-Independent Ranking of RDF Entity Search Results In Proceedings of 9th Extended Semantic Web Conference (ESWC'12)
  • 231. Towns from Andhra Pradesh • Hyderabad • Srisailam • Chittoor • Masulipatnam • Chandavaram • Mahbubnagar • Gooty • Vijaywada • … 1. All answers are relevant 2. Ranking is important 3. Ranking is static 4. Hard to obtain the true ranking 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 231
  • 232. Learning to Rank • Machine learning approach to building a ranking model • We know the true ranking (golden standard) • We represent each answer (resource) as a feature vector • The final score is a linear combination of the features, and the weights have to be learned A B C Pairwise preferences A better than B A better than C B better than C true ranking 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 232
  • 233. Ranking Features • Importance derived – from Graph analysis – from Wikipedia – from Web search engine – from other external sources (N-gram databases) 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 233
  • 234. Graph Features 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 234
  • 235. Graph Features • Pagerank • Hubs and Authorities • RDF graph features – nRSubj - number of relations where this resource appears as the subject – nRObj - number of relations where this resource appears as the object – nLiteral - number of relations where this resource appears as the subject and the object is a literal 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 235
  • 236. Importance of Wikipedia Pages • Popularity – How many people visited a particular page during June-January 2010 – Data obtained from the Wikipedia access logs available at: http://dammit.lt/wikistats – Captures importance from the point of view of users • Page length – How much text a Wikipedia page contains – Importance from the authors’ perspective • Number of edits – Importance from the editors’ perspective 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 236
  • 237. Features Approximating Importance Correlate Well • Compare rank based on page length and based number of edits with page popularity Spearman’s CC NDCG Page length 0.60 0.84 Number of edits 0.78 0.93 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 237
  • 238. Web Search Features • How many search results do we get in a web search if we search for: – The answer’s name – Keywords from the answer’s description • We used Yahoo! BOSS services to do the search • Querying the web for many resources is expensive 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 238
  • 239. N-gram features • Similar to web search features • We look how many times the name of a resource appears in a large N-gram database (e.g. Google N-grams, Google Book N-grams, etc.) • A cheaper way to see how many times a resource appears on the web or in books 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 239
  • 241. Introduction • Intuitive keyword search interface over databases • “A direction” of semantic search, which employs semantics of – Relational information (structured data) in – Different datasets to produce complex structured, aggregated results to answer complex information needs • Short version of the Semantic Search tutorial at ESSIR’11 – Matching Techniques – Ranking Techniques • Complementary to DB keyword search tutorial, emphasizes – The role of textual data: data graphs with textual content nodes – Ranking [Chen et al, SIGMOD09] 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 241
  • 243. Structure • Keyword search: keywords over data graphs – Term matching – Content matching – Structure matching • Schema-based keyword search • Schema-agnostic keyword search – Online search algorithms – Index-based approaches 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 243
  • 244. Keyword search approaches • Finding “substructures” matching keyword nodes • Different result semantics for different types of data – Textual data (Web pages connected via hyperlink) – DB (tuple connected via foreign keys) – XML (elements/attributes via parent-child edges) • Commonly used results: Steiner tree / subgraph – Connect keyword matching elements – Contain one keyword matching element for every query keyword – Minimal substructures: closely connected keyword nodes • Query is ambiguous, lacks explicit structure constraints – NP-hard, thus efficiency of matching is a problem – Large amounts on candidate matches, thus ranking is a problem 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 244
  • 245. Keyword search on hybrid data graphs Alice Bob is a good friend of mine. We went to the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do: trouble with bob Bob sunset.jpg Beautiful Sunset Thanh KIT Germany Semantic Search 2009 Germany PeterFluidOps 34 knows someone works at KITapartment shared Berlin Alice Example information need “Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.” Term matching Content matching Structure matching 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 245
  • 246. Term matching • Distance-based (syntax) – Levenshtein distance (edit distance) – Hamming distance – Jaro-Winkler distance • Dictionary-based (semantics) – Taxonomy – Dictionary of similar words – Translation memory – Ontologies 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 246
  • 247. Content matching • Retrieve partial matches • Inverted list (inverted index) ki  {< d1, pos, score, ...>, < d2, pos, score, ...>, ...} • Combine partial matches: union or join shared shared berlin alice= = shared Berlin Alice shared Berlin Alice D1 D1 D1 01 Apr 2012 247
  • 248. Structure matching • Retrieve structured data given patterns (e.g. triple patterns) • Index on tables • Multiple “redundant” indexes to cover different access patterns • Combine: union or join • Blocking, e.g. linear merge join (required sorted input) • Non-blocking, e.g. symmetric hash-join • Materialized join indexes SP-index PO-index = = = ?x ns:knows ?y. ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” Per1 ns:works ?v ?v ns:name “KIT” Per1 ns:works Ins1 Ins1 ns:name KIT Per1 ns:works Ins1 Ins1 ns:name KIT Structure not explicitly given in query  exploration / other kinds of join 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 248
  • 249. Structure • Keyword search: keywords over data graphs – Term matching – Content matching – Structure matching • Schema-based keyword search • Schema-agnostic keyword search – Online search algorithms – Index-based approaches 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 249
  • 250. Matching in keyword search – schema-based Alice Bob KIT • Operate on schema graph • Query interpretation – Compute queries instead of results – Query presentation – Query processing by DB engine • Leverage the power of underlying DB query engine Result 1 Result 2 [Tran et al, ICDE09] [Hristidis et al, VLDB02] [Agrawal et al, ICDE02] [Qin et al, SIGMOD09] 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 250
  • 251. Structure • Keyword search: keywords over data graphs – Term matching – Content matching – Structure matching • Schema-based keyword search • Schema-agnostic keyword search – Online search algorithms – Index-based approaches 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 251
  • 252. Matching in keyword search – schema-agnostic Alice Bob KIT • Operate on data graph – No schema needed – Flexibly support different types of data e.g. hybrid data graphs – Native tailored optimization • Online in-memory graph search • Using materialized indexes Result 1 Result 2 [He et al, SIGMOD07] [Li et al, SIGMOD08] [Tran et al, CIKM11] [Kacholia et al, VLDB05] 01 Apr 2012 252
  • 253. Online search – top-k exploration• Compute Steiner tree with distinct roots • Backward expansion strategy • Run Dijkstra’s single-source-shortest-path algorithms – Explore shortest keyword-root paths – To find root (an answer) – Until k answers are found – Approximate: no top-k guarantee, i.e. further answers found later from other expansion paths may have higher score • Complete top-k: terminate safely when lower bound of top-k candidate is higher than upper bound of what can be achieved with remaining inputs [Bhalotia et al, ICDE02] Alice Bob KIT Result 1 01 Apr 2012 253
  • 254. Taxonomy of matching approaches • Schema-based vs. schema-agnostic • Online search – Complete top-k – Approximate top-k – Backward expansion, bidirectional search, undirected subgraph exploration, dynamic programming • Indexing for retrieval + join for combine – Path retrieval, then path join – Graph retrieval, then graph pruning – Graph retrieval, then neighborhood / graph join (neighborhood indexed as a set of paths) 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 254
  • 256. Structure • Ranking paradigms – Explicit model of relevance – No notion of relevance • Features – Content-based – Structure-based – Structured-content-based 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 256
  • 257. Ranking paradigms • No explicit notion of relevance: similarity between the query and the document model – Vector space model (cosine similarity) – Language models (KL divergence) • Explicit relevance model – Foundation: probability ranking principle – Ranking results by the posterior probability (odds) of being observed in the relevant class: )),...,(,),...,((),( ,,1,,1 qkqdtd wwwwCosdqSim  )|( )|( log()|()||(),( d q q Vt dq tP tP tPKLdqSim      ))|(1()|()|(    DtDt NtPRtPRDP 01 Apr 2012 257
  • 258. Features • Features are orthogonal to retrieval models – Weights for query / document vectors? – Language models for document / queries? – Relevance models? – What to use for learning to rank? 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 258
  • 259. Features Dealing with ambiguities • Content features – Co-occurrences • Terms K that often co-occur form a contextual interpretation, i.e. topics (cluster hypothesis, distributional semantics) • “Berlin” and “apartment”  geographic context  Berlin as city – Frequencies: d more likely to be “about” a query term k when d more often, mentions k (probabilistic IR) • Structure features – Structured-content-based: consider relevance at fine- grained level of attributes – Link-based popularity – Proximity-based Term ambiguity Content ambiguity Structure ambiguity 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 259
  • 260. Content-based features – frequency • Document statistics, e.g. – Term frequency – Document length • Collection statistics, e.g. – Inverse document frequency – Background language models )|()1( || )|( CtP d tf tP d   idf d tf w dt  || , • An object is more likely about “Berlin”? • When it contains a relatively high number of mentions of the term “Berlin” • When number of mentions of term in the overall collection is relatively low 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 260
  • 261. Structure-based features – links • PageRank – Link analysis algorithm – Measuring relative importance of nodes – Link counts as a vote of support – The PageRank of a node recursively depends on the number and PageRank of all nodes that link to it (incoming links) • ObjectRank – Types and semantics of links vary in structured data – Authority transfer schema graph specifies connection strengths – Recursively compute authority transfer data graph • An object (about “Berlin”) is more important? • When a relatively large number of objects are linked to it [Hristidis et al, TDS08] How to incorporate it into a content-based retrieval model? 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 261
  • 262. • EASE, XRANK, BLINKS, etc. • EASE – Proximity between a pair of keywords – Overall score of a JRT is aggregation on the score of keyword pairs • XRANK – Ranking of XML documents / elements – Proximity of n is defined based on w, the smallest text window in n that contains all search keywords Structure-based features – proximity • A structured result (e.g. Steiner tree) is more relevant? • When it is more compact s.t. elements are closely related [Li et al, SIGMOD08] [Guo et al, SIGMOD03] adopted from: [Chen et al, SIGMOD09] How to incorporate it into a content-based retrieval model? 262
  • 263. Structured-content-based model • Consider structure of objects during content-based modeling, i.e., to obtain structured content-based model – Content-based model for structured objects, structured documents, database tuples… )|()|( f Ff fd d tPtP    • An object is more likely about “Berlin”? • When its (important) fields / attributes contain a relatively high number of mentions of the term “Berlin” 01 Apr 2012 ECIR 2012 Tutorial - From Expert Finding to Entity Search on the Web 263

Notes de l'éditeur

  1. When the competition is copying you, you know that you are doing something right.
  2. Facebook invited, but continues to pursue OGP
  3. This presentation will focus mainly on extraction information from textual data
  4. This presentation will focus mainly on extraction information from textual data
  5. I should also say that the state of the art entity resolution approaches use some form of collective resolutionDifferent algorithms (relational learning, jointinferencing, similarityflooding)[adapted from Bhattacharya and Getoor 2007]:Iteratively select entities:Prior pair-wise evaluation of candidate entities;While top available candidate is good enough:Select top candidate from queue;Update evaluations of available candidates;Evaluate candidates by: Similarity of entity description and documentRelatednessto other selected candidates
  6. Amajor requirement of these methods is that the schema describing the data at hand as well as the properties of its individual attributes are know a priori. Inevitably, though, this fundamental assumption is broken by the inherent characteristics of heterogeneous informa- tion spaces (i.e., loose schema binding, noise, missing or incon- sistent values, as well as an unprecedented level of heterogeneity), turning them inapplicable.
  7. It contains more than 1 million entities and 5 million facts and achieves an ac-curacy of about 95%.Each Wikipedia page title is a candidate to become anentity in YAGO, and the Wikipedia categories of that page become its containing classes. Wikipedia categories are organized in a directed acyclic graph, whichyields a hierarchy of categories.
  8. Why did you use the features
  9. Explain measures
  10. Hybrid data graph with content nodes
  11. Content matching: not only one single term but several query terms (predicate)  not only one matching operations but also combining results of matches for parts of the query produced by several operationsInstead of online matching  index is needed for managing last amount of data and fast access to matches Matching can be decomposed into two operations: matching and combine Join : dictionary posting lists  intersect posting lists
  12. Assume given structure patterns in the query, i.e. structured queries, e.g. graph patterns (a popular fragment of widely used languages SQL and SPARQL)Blocking: iterator-based approachesNon-blocking: good for streaming, good we cannot wait for some parts of the results to be completely worked-offLink data: cannot wait for sources, (some are slower then other) thus better to push data into query processing as the they come instead of pulling data and wait (busy waiting)This structure matching based on given structure patterns demonstrate the idea behind keyword  however query structure provide guidances as to what structure elements in the data are relevant, given keywords, all possible structured have to be explored, other kinds of join
  13. Followed from the excurse what about semantic features? Not directly incorporated into ranking models yet but only to generate candidate matches during the matching stepNot straightforward when using stastitical ranking models
  14. Proximity-based ranking employ minimal distance heuristics to maximize structural compactness of results When JRT is more compact, it is assumed to be more meaningful and relevant Intuition: keyword specified by the users are closely related and thus should be connected over relatively short paths I.e. Compactness measured in terms of the length of paths between nodes, i.e. The proximity The larger the length of paths, the less relevant is the overall resultThe proximity of two keywords defined based on proximity of elements matches these keywordsNi and nj are nodes in the graph sim(ni,nj) denotes the compactness between two any nodessim(ki,kj) denotes the compactness between two keywords (taking account the compactness of all pairs of nodes matching the two keywords), i.e. Cki denotes the set of all nodes that match kiOverall score of a JRT is an aggregation on the score of its n is a keyword search result  matching the query keywords