An overview of the ways in which news organizations leverage semantic web resources such as dbpedia, freebase, wikipedia; adoption of Semantic web standards.
Axa Assurance Maroc - Insurer Innovation Award 2024
The Semantic Web And The News
1. The Semantic Web and the News:
Exploitation and Adoption
Ken Ellis
Chief Scientist
2. Agenda
Intro to Daylife
Exploiting the Semantic Web
Named Entities
Toolsets, issues
Adopting / Enabling
Others
Daylife
3. Daylife
A Platform for News Innovation:
A scalable solution for publishers of all sizes to generate more content
and more inventory – with no additional personnel costs
4. Daylife: What We Do
Aggregate Content
Licensed photos (Getty, AP, Reuters)
Articles (scraped, real-time)
Create Metadata
Topics (people, organizations, concepts)
Topic taxonomy, descriptions
Quotes with attribution
Photo identification
Relatedness
Authorship, sentiment analysis, etc.
Deliver to Clients
Web Sites / Modules / Data
Flexibility: API w/ 500 distinct queries
Novel search/ranking algorithms
Free API
5. [Wiki|DB]Pedia and Named Entites
We also want to collect content around a named entity
…and associate it with external data (Wikipedia, Freebase)
6. [Wiki|DB]Pedia and Named Entites
… for a lot of NE’s
(55k newsworthy ones last month)
1000000
100000
Articles Per Month
10000
1000
100
10
1
1 10 100 1000 10000 100000
NE Rank
8. Daylife and the Semantic Web
Wikipedia
website
API
Wikimedia dumps
DBPedia
Freebase
Partners
IPTC, NewsML
Clients
Proprietary metadata
9. Resources for News Organizations
Named Entities
Wikipedia
vetting
website
disambiguation
API
aliases
Wikimedia dumps
prominence
DBPedia
Freebase
Partners
IPTC, NewsML
Clients
Proprietary metadata
10. [Wiki|DB]Pedia and Named Entites
But:
“… Now, team owner Kevin Buckler is looking to debut
in NASCAR Sprint Cup Series competition, when Mike
Wallace runs in Thursday's Gatorade Duel …”
Which Mike Wallace?
Mike_Wallace_(journalist)
Mike_Wallace_(NASCAR)
Two disambiguation approaches
Given an article, extracted name, what Wikipedia entry does
it map to?
Given a Wikipedia entry, what articles match?
11. [Wiki|DB]Pedia and Named Entites
Articles First:
Wikimedia dumps and DBPedia
Filter for people, organizations, other NE
Construct weighted graph from links
Proxy for prominence (# edits, pageviews, dumps only)
Redirects & disambiguation pages
“Hillary Clinton” redirect to Hillary_Rodham_Clinton: human
decided reference is unambiguous; Usama/Osama
Identify names, possibly matching graph nodes
Select set of nodes that minimizes total distance
Perhaps factor in node prominence
12. [Wiki|DB]Pedia and Named Entites
Mike
Wallace
journalist
NASCAR
Chicago
Sun-
Times
Mike
Kevin
Chicago
Wallace
Buckler
Bulls
NASCAR
Gatorade
I made this up!
13. [Wiki|DB]Pedia and Named Entites
Another possibility: compare text of Wikipedia entry to
the article
But:
Wikipedia entries largely historical, small fraction related to
current events
Journalists, in providing context for lesser-known individuals,
often mention a few other named entities
14. [Wiki|DB]Pedia and Named Entites
NE First approach:
Classifier for race car drivers, Wikipedia to identify names
Filter based on prominence
See EVRI taxonomical paths
http://www.evri.com/mainline-ui/jsp/index.jsf#searching-with-
taxonomical-paths
15. [Wiki|DB]Pedia and Named Entites
NE First:
Tractable for a human (limited number of classifiers)
Better for low-recall high-precision
Article First:
Low editorial oversight
Best-guess
Neither is a complete solution
Not for locations
16. [Wiki|DB]Pedia and Named Entites
General Nits
Sticky Graffiti
Wikipedia can be updated
real-time if you don’t like it
Some derived data sets
can’t. Makes it our
problem!
On-demand updates from
Wikipedia API / HTML
17. [Wiki|DB]Pedia and Named Entites
General Nits
Career Changes
Mike Wallace (journalist)
becomes a NASCAR driver
Joe Wurtzelbacher
becomes a political pundit
Not a complete solution,
but we knew that.
18. [Wiki|DB]Pedia and Named Entites
General Nits
Staleness
Infrequent Wikimedia
dumps
GWB is still president?
DBPedia bad
Wikimedia dumps bad
Freebase good
Wikipedia HTML/API good
DBPedia, 3/5/09
19. [Wiki|DB]Pedia and Named Entites
Obscure Information
Clint Eastwood:
Is prominent, is a politician
Not a prominent politician
20. [Wiki|DB]Pedia and Named Entites
URI Stability
If this were 1981, unambiguous “George Bush”:
<rdf:RDF xmlns:rdf=quot;http://www.w3.org/1981/02/22-rdf-syntax-ns#quot;
xmlns:dc=quot;http://purl.org/dc/elements/1.1/quot;>
<rdf:Description rdf:about=quot;http://en.wikipedia.org/wiki/George_Bushquot;>
<dc:title>George Bush</dc:title>
<dc:publisher>Wikipedia</dc:publisher>
</rdf:Description>
</rdf:RDF>
The NYTimes did this, and still does (API):
“George Bush” tag George H. W. Bush
A lucky problem to have!
21. Resources
Named Entities
Wikipedia
GUID’s!
website
tagging
API
associations (members of
Wikimedia dumps
teams)
DBPedia
other data
Freebase
Partners
IPTC, NewsML
Clients
Proprietary metadata
22. Freebase
GUID’s are stable
Query by Wikipedia URI
http://www.freebase.com/api/service/mqlre
ad?query={quot;queryquot;:{quot;*quot;:null,quot;idquot;:quot;/wikipedia/
Easy-to-find redirects
en/Mike_Wallace_$0028journalist$0029quot;}}
GWB isn’t president
Professions vs. Types
Easier for topic tagging
Clint Eastwood still a politician
but: easier to tell he’s a minor one
multiple types/professions, not much political data
No good proxy for significance
cross-reference
24. Interagency Metadata
Data:
authorship
location
caption
sometimes people,
category
NE’s hand-typed,
often quickly
RSS almost as good
Stripped
Matching problem,
but STILL USEFUL
25. Resources
Q: “Can you use our metadata”
Wikipedia
A: “Sometimes”
website
API
Again, matching problem, but
Wikimedia dumps
good for client-specific topics,
DBPedia
still useful
Freebase
Partners
IPTC, NewsML
Clients
Proprietary metadata
26. Others Using the Semantic Web
Having an API
not the Semantic Web, but at least machine-friendly
eventually common, even for publishers
Publishing URI’s for Wikipedia, Freebase, IMDB, etc.
common among non-publishers
parasitic (not bad!)
Querying using the same URI’s
not so common
mutualistic
27. Others Using the Semantic Web
EVRI
API
Topics (mostly, all?) from Wikipedia
Probably taxonomic pathways, facets, derived from Wikipedia
Disambiguation based on above
Published Wikipedia URL’s
Can’t query by Wikipedia, other URI’s
28. Others Using the Semantic Web
Zemanta
Lots of Linked Data
API provides text markup
Developing (with others)
simplified RDFa based
semantic tagging standard
29. Others Using the Semantic Web
Calais (Thomson Reuters)
API extracts NE’s, other information
Provides Linked Data URI’s to others (one-way)
Provides their own endpoints
Not an aggregator
Eventual support for querying
Very clean!
30. Others Using the Semantic Web
The New York Times
Leading charge with publisher API
Their own tagging, great quality
Some major newspapers
following suit
Others APIs: NewsGator, Inform,
Outside.in
Slow Moves to Digital Access
Full-text RSS rare
API rare
Semantic Web standards rare
Wouldn’t it be great if:
You could ask for content about Mike_Wallace_(American_football)
They pointed you to other rich data sources
31. Wikipedia URI Lookup
A quick service to support lookup for Wikipedia URI’s
http://labs.daylife.com/wikipedia_topic_getInfo.php?uri=
http://en.wikipedia.org/wiki/Mike_Wallace_(journalist)
or
http://labs.daylife.com/wikipedia_topic_getInfo.php?uri=Barack_Obama
32. Thank you
Web Site
http://www.daylife.com
Daylife API
http://developer.daylife.com
Labs
http://labs.daylife.com
Email
ken@daylife.com