This document summarizes a presentation about using Drupal and Calais API for semantic web integration. It discusses how Calais can extract metadata like entities, categories and relationships from content and map them to URIs. It shows how this metadata can be displayed in Drupal using modules like Calais Collection, More Like This and Topics Hubs. It also demonstrates how the extracted metadata can be exposed as linked open data using RDF and connected to external datasets like DBPedia through the Calais URIs. The presentation aims to provide an open source CMS platform called OpenPublish to help publishers leverage semantic web technologies.
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Phase2 OpenPublish Presentation SF SemWeb Meetup, April 28, 2009
1. Drupal, Calais
& the Semantic Web
Prepared by Frank Febbraro, CTO & Presented by Jeff Walpole, CEO
2. Introductions (and sizing each other up)
Raise your hand if you are a…
Technologist?
Journalist?
SemHead?
Raise your hand if you use or have used Drupal?
Calais API?
Lets play word association…
Linked data
RDF
SPARQL
GRDDL
3. Publishing tech Phase2 is working on
CMS frameworks
Drupal & Java Development
Taxonomy solutions
Geo-tagging & Mapping
Charting & Graphing Data
Semantic Web integration
Open Data/APIs
Topic Hubs
Publishing workflow
Feed Syndication
Buzz and topic trend monitoring
Community collaboration sites
Multi-site & virtual site CMS
architecture
An open source CMS installation
specifically for publishers – called
OpenPublish
5. Why We use Drupal for CMS
Performance/Reliability: Dozens of major
publishers turn to Drupal and tens of
thousands of high traffic sites because it is
an enterprise class platform
Ease/Expense of Implementation: As one
of the leading shops developing for this
platform, we can be as efficient as anyone
and this platform is our preferred
technology.
Evolving Technology Extensibility: You
need something modular/extensible that
allows you to add new features easily and
we know this is possible with Drupal.
Easier Modular Enhancements: Drupal's
architecture is modular and integrates well
without requiring customization to core
components that would make them
difficult to maintain.
P2 Expertise: Our entire development staff
of 12+ developers can support you on
Drupal and we are known as one of the
top firms in the country.
Large Community Support: You need a
community that is active, robust,
responsive and growing. We are involved
in the Drupal community and have an ear
to the ground on features and changes
that would affect your site.
Easy Staff Training: The Drupal CMS is
intuitive and we are well versed in training
others to use it. To support training, there
are numerous videos, online tutorials, local
classes and even books on how it works.
Decreased Support Costs: Publishers find
they can do a lot more themselves and
when they do need help, the time is a
fraction of what a proprietary CMS would
cost for similar changes.
8. How does Calais work?
1. Categorizes and metatags the
people, places, companies, facts
and events in your content to make
it ‘machine-readable,’ and returns
that metadata to you.
2. Makes connections between the
entities in your content and related
data in Wikipedia, GeoNames, the
IMDB, Shopping.com and more
3. Empowers you to share your
metadata with search engines,
news aggregators, ‘related stories’
applications and others in the
content ecosystem.
9. <Topic>M&A</Topic>
<Acquisition offset="494" length="130">
<Company_Acquirer>Reuters</Company_Acquirer>
<Company_Acquired>ClearForest
Ltd.</Company_Acquired>
<Status>Planned</Status>
</Acquisition>
<Company>Reuters</Company>
<Company>ClearForest Ltd.</Company>
Reuters Announced the Acquisition of ClearForest
New York - April 30, 2007
Reuters, the global information company, has entered
into an agreement to acquire all of the outstanding
shares of ClearForest Ltd., a privately held provider of
Text Analytics solutions, whose tagging platform and
analytical products allow clients to derive precise
business information from huge amounts of textual
content.
ClearForest has received sufficient shareholder
approval to complete the transaction, which is
expected to close in approximately 30 days, subject to
customary closing conditions. The financial terms were
not disclosed. Reuters plans to retain and continue to
work with the existing management team and their
highly skilled workforces in the US and Israel. It also
plans to continue to support existing products and
customers.
Reuters believes that search will be a pivotal element to
the future of how financial information is sourced and
consumed. As part of its drive into this space, Reuters
has created a new strategic group and appointed
Gerry Campbell, who will oversee the integration of
ClearForest and drive this innovation.
<Product>Text Analytic Solution </Product>
<Company>ClearForest Ltd.</Company>
<Company>Reuters</Company>
<Country>United States</Country>
<Country>Israel</Country>
<Company>Reuters</Company>
<Person>Gerry Campbell</Person>
<ManagementChange offset="2789" length="92">
<Person>Gerry Campbell</Person>
<Company>Reuters</Company>
<Action>Enters</Position>
</ManagementChange>
What Would that Look Like (in code)?
32. Linked Data
it’s all about the URIs
Drupal: http://dbpedia.org/resource/Drupal
Washington DC: http://d.opencalais.com/er/geo/city/ralg-geo1/f497898f-2b9b-7cda-
ec7b-85d896acbe3e
Calais linked data for humans
44. DrupalDrupal
Calais for DrupalCalais for Drupal
Linked DataLinked Data
More Like ThisMore Like This Topic HubsTopic HubsGeoGeo
MarmosetMarmoset
Marmoset: microformats for search agents
45. The Big Picture – OpenPublish
DrupalDrupal
Calais for DrupalCalais for Drupal
Linked DataLinked Data
More Like ThisMore Like This Topic HubsTopic HubsGeoGeo
MarmosetMarmoset
Developing quite a few great SemWeb modules too. Arto is a maniac
Calais provides the Semantic Engine for OpenPublish. It gives us the context to the world outside if our site. So lets talk about how Calais and Drupal work together.
Has anyone used Calais? This represents the core of our discussions. The Calais module sits at the epicenter of this collection of modules. It is an API and integration with nodes. It provides auto tagging of your nodes, and these other modules we developed site on top of the Calais data to drive the power of the meta data into your site and to your users.
As I said, Calais is an auto-tagger. It’s really just a taxonomy integration. Calais Terms are like the maternal twin of Taxonomy. We wanted to make use of taxonomy for the added benefits.
How is it configured? Calais is configured per content type.
Saving is where the magic happens.
Use the relevance threshold to limit the amount of noise, you can also blacklist terms, subsititue, hook into, etc.
Autodiscovery links allow bots, browsers, readers, etc to find content in other formats related to the current page. Seen here there are a few other related content formats, the application/rdf+xml is the related Calais RDF document in XML form.
RDF is great for representing data, but awful for your eyes. That is why semwebbers all wear glasses. This is the #1 comment I have received. RDFa is a method for embedding RDF data into XHTML documents. GRDDL can be used to transform it into RDF. We did not tackle RDFa YET!!! in Calais b/c this is an area that is beign worked on and integrated into D7 (at the theme layer) and has already begun. Might be a nice back-port though.
RDF can turn you into stone.
A collection of modules that consist of a core “framework” module that provides a plugin architecture allowing modules to provide related content. On or off site content.
Start with a More Like This Thumbprint (Terms). This is the thumbprint of a node, the terms that you feel most accurately represent the essence of your node content. In here you will select or enter terms, or have Calais prefill. Calais returns a relevancy score, we can use that to prefill these automatically.
Configure the relevance score that a term must have to be automagically applied.
When viewing a node, it now provides other relevant on site nodes matched based on taxonomy.
It also does off site searching, seen here using Yahoo’s BOSS, Build your Own Search Service.
Topic Hubs are site pages that aggregate content based on inclusion in taxonomy expressions.
Here is where you can build your expressions. You can broaden or narrow the scope based on the expression you create. But simply put, all nodes/comments, etc that match this expression will be present in your topichub.
There are a variety of plugins, or you can define your own.
This represents how the various plugins represent the content on your site that is matching your contextual expression
The map provides some nice features. Showing your content based on geo graphical terms. Cities, States or Countries.
They are just panels so add whatever you want. Node content, views, blocks, define your own. What makes the TopicHub plugins unique is that they respond to the context of your Hub, using the expression.
Linked Data refers to the linking of RDF datasets across the Semantic Web. Sony referenced over here, is the same Sony talked about over there. This has been a huge goal of the semantic web for quite some time and it is finally alive.
Diagram shows the Linked Data world. There are new datasets being release all the time and this diagram is already obsolete as the Calais Linked Data is not in there
Diagram shows the Linked Data world. There are new datasets being release all the time and this diagram is already obsolete as the Calais Linked Data is not in there
Again, RDF is ugly.
DBPedia human-readable data.
Calais has disambiguated these geographical terms and provided lat/lon for us.
But the Calais Linked Data URI allow much more.
Here we are showing additional data retrieved from DBPedia
Article about Toyota having a rough go at it. Who would have thought a car company would be in financial trouble in this day and age!?!?!
This grabs the most relevant company from Calais and if it is disambiguated, looks up data on DBPedia.
This is a view of the Taxonomy Term edit screen. The Calais Term for Toyota has the following Linked Data URI.
With that URI, we grab the RDF from Calais for the disambiguated company. That RDF doc returned has a link to the DBPedia resource that is “the same as” this resource.
With that Resource URI, we create a SPARQL query to get data from the DBPedia via it’s SPARQL endpoint. (endpoint is just a fancy name for webservice that responds to SPARQL queries)
We then render the resultant data into HTML. Easy as Pie.
Recognizes search bots (configurable) and sends your page to Calais and injects microformats into the body of your page that crawlers such as Yahoo SearchMonkey can comprehend. So what does this pyramid bring us to?
OpenPublish is a Drupal semantic publishing platform. It consists of Drupal, and Install profile, and a number of Modules that we have combined to provide a great starting point for publishers filled with best practices from our experience. There is nothing you could not build yourself, but we have combined things you would likely want to safe you a few (or few hundred) hours. Save a newspaper.
Go and download it, install it, kick the tires. Let us know your thoughts. We love feedback.
We will be showing people how to install and configure OpenPublish and the Calais Collection modules. Work through issues, give feedback, provide ideas.