1. Thomson Reuters Calais Web Service & the Linked Content Economy
Executive Summary: The rise of the Internet has brought dramatic change to the publishing
industry. While newspapers in particular struggle to adapt, advertisers are cutting budgets,
seeking new efficiencies and increasingly using the Web to go straight to the consumer.
Semantic technologies and new open data resources on the Web give both publishers and
advertisers new tools and services that can help them succeed. The Thomson Reuters Calais
Web service, found at OpenCalais.com, is one such service.
Calais identifies and automatically tags the
people, places, companies, facts and events “Calais turns static text into ‘Smart
in text. It then forges connections between Media’ that is enriched with open data
those entities and relevant data sets, media and connected to a dynamic ‘Linked
files, Wikipedia entries and more on the open Content Economy’.”
Web. Finally, it gives publishers a new way to
share that tagged content with next generation -Thomas (Tom) Tague, Calais initiative lead
search engines, news aggregators and others
in the content ecosystem.
Armed with this powerful new tool, forward-looking publishers are automating time consuming
content operations and increasing editorial productivity. They are also enhancing the value of
their content, improving their user experience and preparing to reach more readers in tomorrow’s
media landscape – increasingly called the ‘linked content economy.’
Background: Calais is a strategic initiative at Thomson Reuters to advance the interoperability of
content and support the company’s mission to provide pervasive intelligent information to its
customers. Calais uses Natural Language Processing to give publishers free metatagging
services, developer tools and an open standard for the generation of semantic content.
The latest update to Calais – Calais 4.0 – is a significant advance on the initiative’s goals. The
Calais team originally set out to help developers, bloggers and publishers automatically tag their
content to improve search and navigation, and enable new reader engagement features.
With Calais 4.0, the Calais Web service goes beyond metatagging to help publishers enhance
their content, using open data from sources like Wikipedia, DBpedia. GeoNames, the Internet
Movie Database (IMDB), Shopping.com and more. It also makes it easy for publishers to use
2. their metadata to share their content with next generation content consumers – such as search
engines, news aggregators ‘related stories’ service and more – to ultimately reach more readers.
With these added capabilities, Calais helps content creators and content consumers alike
connect to the rapidly emerging ‘Linked Content Economy’ and deliver ‘Smart Media.’
The Linked Content Economy & Smart Media: The Linked Content Economy is an evolving
ecosystem of enriched and connected content that helps publishers engage readers, improve the
user experience, and – ultimately – better convert readership to revenue.
Linked Content goes beyond ‘link journalism,’ (linking to related stories, etc.). It uses metadata to
help publishers create “Smart Media” – content that automatically connects the concepts, people,
companies, etc. it contains to a rich array of related data sets and media assets on the Web.
It then uses metadata to help publishers share their “Smart Media” with the rest of the content
ecosystem, including search engines, news aggregators, ‘related stories’ applications and more.
How it Works:
1. Publishers submit content to the Calais
Web service using their Calais API key.
2. Calais tags each person, place, fact and
event in the content, making it machine-
readable and interoperable on the Web.
3. Each piece of content - and each entity or
event in that content - is assigned a unique
identifier (a document ID and many URIs)
that ties back to the Linked Data Cloud.
4. Publishers use the metadata Calais returns
(tags, document IDs and URIs) to enhance
their content and create features like topic
pages that improve the reader experience.
5. Publishers can also use their metadata to
share their content with next generation
search engines, news aggregators, etc.
Calais’ participation in this ecosystem is as a platform. Calais lays the foundation on which, in
conjunction with Content Management Systems, users can create a next generation publishing
site, service or community.
Calais adopted the Linked Data standard to build a back-end infrastructure and repository,
enabling linkage between concepts and documents. Linked Data is a standard promulgated by
Sir Tim Berners-Lee. Here are some of the open data assets in the Linked Data cloud.
3. By embracing the Linked Data standard —and by creating a Calais repository of Linked Data
assets on publicly-traded companies — Thomson Reuters has built scaffolding that enables Web
sites, social networks and other content-rich applications to navigate between previously separate
silos of data and information. Here’s how it works:
1.) When Calais processes an article, it extracts many named entities. For some classes of
named entities, such as companies, Calais now also returns an HTTP hyperlink, called a
Uniform Resource Identifier (URI).
2.) This hyperlink points into the Calais repository, to a machine readable XML page containing
related content (company description, management team, board of directors, etc.) as well as
links to related assets in DBpedia, from Thomson Reuters, etc.
3.) This linked data infrastructure forms a web-of-links that applications can navigate and use to
pull information up for display or integration into the user experience.
Calais has thus created a lingua-franca to drive
content interoperability, and provided a simple “Calais provides a transportation layer
standard for the sharing of rich semantic metadata that enables users to share their semantic
metadata with downstream consumers
Here’s an example: like search engines, news aggregators,
A news story breaks on an IBM earnings report. ‘related stories’ applications and more.”
The user wants to find out if IBM has any affiliation
with Warren Buffett of Berkshire Hathaway. -Thomas (Tom) Tague, Calais initiative lead
Today such a complex query requires time-consuming research. Search engines can’t hopscotch
through content.
4. But with Calais:
1. The news application sends the story to Calais.
2. Calais extracts IBM from the news story, ties it to International Business Machines
Corporation in the Linked Data cloud and returns the URI (i.e. hyperlink) for IBM
3. The app. uses the IBM URI to retrieve the list of the Board of Director members from the
Retuers.com content in the Calais repository
4. The app. queries the Board members for their other affiliations and finds a member that is
also on the Board of Coca Cola plus a member that is the CEO of American Express
5. The app. runs a query of shareholders of Coca Cola and finds Berkshire Hathaway.
6. The app. runs a query on shareholders of American Express and finds Berkshire Hathaway.
IBM Corporation
Board of Directors
Cathleen Black Cathleen Black
William Brody
Kenneth Chenault Other Affiliations
Michael Eskew
President,
Hearst
Magazines
Board Member, Coca Cola
Berkshire Hathaway
Key Stockholders
Management Team
Kenneth Chenault
Berkshire
Warren Buffett
Other Affiliations
Charlie Munger
CEO,
American Express American Express
Key Stockholders
Berkshire Hathaway
Semantic extraction is far more powerful than keyword search, which can confuse Paris (Texas),
Paris (France) and Paris (Hilton). Calais can determine that the Paris in this particular article is
Paris Texas based on sophisticated disambiguation that leverages a variety of clues in the text.
New Applications: Calais 4.0 and beyond will enable many emergent applications including:
- Publisher sites that dynamically mingle and deliver additional relevant content based on user
preferences, profiles, history, friends’ selections and breaking topics that are hot now.
- Media Monitoring tools that deliver slices of relevant information, e.g. content from all sites
and blogs discussing natural disasters occurring near iron mines in Southeast Asia.
- Plug-ins that integrate social networking / community / blogging, and bypass search.
- Semantic ad networks and servers that go beyond keywords to inform ad placement with
context, e.g. preventing airline ads from appearing next to news of air accidents.
Conclusion: Armed with this powerful new tool, publishers are automating content operations,
increasing productivity and cutting costs. They are enhancing the value of their content,
improving their user experience and preparing to lead in the linked content economy.
No-one can predict precisely what kinds of creative and potentially game-changing applications
will emerge. With more than nine thousand users in the OpenCalais.com community, Thomson
Reuters expects to see hyper-evolution in many arenas.