1. The role of Thesauri
and Standard Vocabularies
in linking data
Dr. Johannes Keizer
FAO of the United Nations
Office of Knowledge Exchange, Research and Extension
Knowledge and Capacity for Development
2. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
The Development of the Internet
3. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
“Closed” (“normal”) IT environments
Data sources carefully controlled.
Data formats “custom-defined” for an
application.
Linked data based on an “open world
mindset”
Integrating data from the open Web
Systems designed to incorporate new
information incrementally
By design, tolerance of incomplete
information
Open World Mindset
4. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
The Linked Data Universe:
http://www.linkeddata.org (july 2009)
4
5. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22 The Linked Data Universe:
http://www.linkeddata.org (july 2010)
6. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
Example: BBC Wildlife Finder
7. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
Humboldt Squid page, pulled together from a diversity of Linked Data
sources
Animal Diversity Web:
Nocturnal way of life
BBC TV Documentary
BBC News item
Wikipedia
8. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
RDF– a grammar for the language of data
Resource
relatedTo
ResourceA ResourceB
Resource
describedBy
ResourceA Some text
1. Describe resources using interrelated “statements” (“triples”).
2. Use URIs – unique, globally managed identifiers –
as the “words” of statements.
9. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
•http://www.w3.org/2007/Talks/0221-Bangalore-IH/
RDF as a common format for merging data
10. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
Finding things related to “genes” across
databases
Source: Joanne Luciano, Mitre, and the W3C HCLS IG
11. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
Born as tools to assure consistency in the
indexing of library collections
Thesauri were based on “terms”, but terms
represented already concepts in a non
explicit way
Hierarchical and associative relationships
represented generic ontological domain
knowledge
Candidate building blocks for the semantic
web
Role of thesauri/concept schemes
12. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
..from thesaurus to Ontologies….
13. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
around 30,000 concepts
600000 labels in around 20 languages.
one-stop shop for terminological knowledge
related to agriculture in general
a knowledge base of related concepts organized
in ontological relationships (hierarchical,
associative, equivalence)
Is a concept/term/string based system
Concepts may be organized in multiple categories.
AGROVOC today
14. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
Semantic Relationships
Concept to
Concept
isA (hierarchy), isPestOf, hasPest
Concept to
Term
has_lexicalization
(links concepts to their lexical
realizations)
Term to
Term
isSynonymOf, isTranslationOf,
hasAcronym, hasAbbreviation
Term to
String
hasSpellingVariant, hasSingular
15. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
The AGROVOC SKOS-XL Model
8171
1474
12332
skosxl:altLabel
skosxl:prefLabel
skos:broader
SKOS
Label
skos:broader
SKOS
Concept
rdf:type
rdf:type
6211
skos:broader
Agrovoc
Concept
Scheme
skos:topConceptOfskos:inScheme
SKOS
Concept
Scheme
rdf:type
rdf:type
:bar
:foo
“corn”
“maize”
skosxl:literalForm
skosxl:literalForm
rdf:type
rdf:type
rdf:type
16. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
http://www.w3.org/2004/02/skos/
17. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
SKOS-XL output
<rdf:Description
rdf:about="http://aims.fao.org/aos/agrovoc/agrovocScheme"> <rdf:type
rdf:resource="http://www.w3.org/2004/02/skos/core#ConceptScheme"/></rdf
:Description><rdf:Description
rdf:about="http://aims.fao.org/aos/agrovoc/c_330829"> <rdf:type
rdf:resource="http://www.w3.org/2004/02/skos/core#Concept"/>
<skos:inScheme
rdf:resource="http://aims.fao.org/aos/agrovoc/agrovocScheme"/>
<skos:topConceptOf
rdf:resource="http://aims.fao.org/aos/agrovoc/agrovocScheme"/></rdf:Descri
ption><rdf:Description
rdf:about="http://aims.fao.org/aos/agrovoc/xl_en_1278479064610">
<literalForm xmlns="http://www.w3.org/2008/05/skos-xl#"
xml:lang="en">subjects</literalForm> <rdf:type
rdf:resource="http://www.w3.org/2008/05/skos-xl#Label"/></rdf:Description>
URI of AGROVOC concept
18. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
AGROVOC EUROVOC UNBIS Relationship
http://aims.fao.
org/aos/agrovoc
/c_207
http://eurovoc
.europa.eu/21
9055
agroforestry skos:exactMatch
/ owl:sameAs
http://aims.fao.
org/aos/agrovoc
/c_4826
http://eurovoc
.europa.eu/22
0018
MILK skos:exactMatch
/ owl:sameAs
http://aims.fao.
org/aos/agrovoc
/c_12332
http://eurovoc
.europa.eu/21
9871
MAIZE skos:exactMatch
/ owl:sameAs
Linking vocabularies
19. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
http://agris.fao.org/agris-search/search/display.do?f=2004/ZA/ZA04002.xml;ZA2004000049
20. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
http://aims.fao.org/aos/agrovoc/c_7825
http://eurovoc.europa.eu/218754
21. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
http://eurovoc.europa.eu/
219871
Maize
skosxl: literalForm
Maize
http://aims.fao.org/ao
s/agrovoc/c_12332
AGROVOC
skosxl: literalForm
Maize
http://aims.fao.org/aos/agrovoc/c_12332 owl:sameAs http://eurovoc.europa.eu/219871
owl:sameAs/exactMatch
http://agris.fao.org/agris-
search/search/display.do?f=1996
/TR/TR96001.xml;TR9600026
Linking data through common URIs
skosxl: literalForm
owl:sameAs/exactMatch
http://eur-
lex.europa.eu/LexUriServ/LexUriSe
rv.do?uri=OJ:L:2010:202:0011:001
5:EN:PDF
http://unbisnet.un.org:8080/ipac20/ipac.j
sp?session=128F308557F34.283092&pr
ofile=bib&uri=full=3100001~!685149~!1&
ri=1&aspect=subtab124&menu=search&
source=~!horizon
Maize
Eurovoc
UNBIS
22. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
What are we doing with unstructured data?
• We have enormous amounts of unstructured
material
• Still most of the documents that we are
producing are mostly semantically
unstructured
• Human work to catalogue and index is
becoming always more rare
• We need machines to do automatic semantic
mark ups of text
• If machines are trained and based on concept
schemes, ther are able to do so
23. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
24. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
• Does Concept identification in unstructured
texts
• Uses Agrovoc as a controlled vocabulary
• Prototype under testing with excellent
results (entire repository of ICARDA
indexed)
• Will produce in future Structured RDF files
that can be used to link data like “open
Calais”
•
AgroTagger
25. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
26. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
27. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
28. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
Life Demo: Semantic mark ups:
http://viewer.opencalais.com/
http://agropedialabs.iitk.ac.in/Tagger/Agrotagger_text.php
29. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
The concept scheme workbench
30. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
Is a web-based working environment for managing the
AGROVOC Concept Server
Facilitate the collaborative editing of multilingual
terminology and semantic concept information
It includes administration and group management
features
It includes workflows for maintenance, validation and
quality assurance of the data pool
The CS is accessible freely to everybody to facilitates
collaborative editing
The workbench
31. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
Group/Action/Status
GROUP
Non registered users
Term editors
Ontology editors
Validators
Publishers
Administrators
ACTION
concept-create
concept-delete
concept-edit
term-create
term-edit
term-delete
..........
STATUS
Proposed by guest
Proposed
Revised by guest
Revised
Validated
Published
Proposed deprecated
Deprecated
32. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
3
Concept Life Cycle
GUEST
<concept-create>
Proposed by guest
VALIDATOR
<validates>
Validated
PUBLISHER
<publishes>
Published
TERM EDITOR
<concept-edit>
Revised
ADMINISTRATOR
<validates>
Published
ONTOLOGY EDITOR
<concept-delete>
Proposed deprecated
PUBLISHER
<validates>
Deprecated
33. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
Modules
• Home
• Search
• Concept/Term
Management
• Relationship
Management
• Classification Scheme
Management
• Validation
• Consistency Check
• Import/Export
• User/Group Management
• Statistics/Preferences
3
34. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
• by string: the user can specify if the system
should search by exact match, beginning with,
contains or fuzzy
• by URI or term code; or by range of term code
(e.g. between 123 and 9876)
• by classification schemes
• by creation or modification date
• by specific relationships (e.g. search all
concepts using the “has_pest”)
• by status, language
by notes/attributes
Search
3
35. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
3
Graph Visualization
Java Applets
based touch
graph
Visualizes
concepts and
its
relationships
with other
concepts in
graphical view
36. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
3
Web services
AGROVOC CS
WORKBENCH maintain access
response
uses
SKOS
Triple
Store
Other
Applications
37. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
AGROVOC Web Services
38. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
Architecture of the System
39. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
3
Front end Back end
Administrativ
e Database
(Mysql)
Protégé
Triple Store
(Mysql)
Middleware
Hibernate
Layer
Protégé
OWL API
Gilead
Intermediate
Layer
Google
Web
Toolkit
(GWT)
Graph
Visualizatio
n
GWT
Incubator
Web
services
System Overview
40. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
Giving it a try…….
A demo version of the AWB:
http://202.73.13.50:55234/agrovocdevv10d/ With all
functionalities, availabe to users for testing purpose.
Latest stable release version 1.0 : (read/write)
http://202.73.13.50:55381/agrovocv10i/
Latest stable release version 1.0 (Read only):
http://202.73.13.50:55481/agrovocv10i/ (Visitors only with only
view privilege)
41. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
…and more: http://aims.fao.org
42. dr johannes keizer - FAO of the United Nations - knowledge and capacity for development
ThesaurusWorkshop–CASBeijing,2010-10-22
Thank You!
Editor's Notes
Thisgraphelaboratedby Nova Spivacksfrom Radar Networksispopular at the moment. The Y-Axisisfor the increaseof information connections. The X-Axisisfor the increaseof social connections. Whereas the Web Operating System in 2030 isstill a brilliantguess in the future, the developmentof the Semantic Web, or Web 3.0 hasnowgotconsiderablemomentum
Oneof the key development in the semantic web are “Linked Open Data”. The Linked Open Data paradigmclaimsthatexistingstructured data needtobereleasedfrom the proprietary silos in whichthey are at the moment. With the existenceof RDF (ResourceDescriptionFramework) there are the semantictoolsto do so. Thereisalsotechnologytouse RDF. More tothislater.
Thisis a snapshotoneyearlater. The growthisenormous. A centralpointisDBPedia, “triplified” information fromWikipedia. The differentcoloursrepresent the different information types, being “life sciences” and “publications” the mostpopulatedareas, butwith the area “government” stronglygrowingInterestingnewcomers in the last months are the two VIVO datasetsfrom the UnitedStatesdescriping expertise in Science. Vivo isactually a project thatstarted the agriculturallibraryofCornellUniversity
Whatdoesthismean in practice? I will show thiswithanexamplefrom the BBC. The biggestconsumers (and producers) of LOD are as I know the BBC and the New York times (Butnowalso the US government)
During the Web 1.0 phase, Webpageswerecomposedbyhumans. Todaymostwebpages are drivenbydatabasesthat can bedynamicallyqueried. Theycontainthrough RSS feedsalso data fromotherwebsitesThis BBC webpageis a big jumpfurther. I hasnotbeencomposedbyhumans and itisnotfromone database generated. Itisgeneratedfromdifferentdatasourcesthatwerepresentaslinked open data, linkedonlythrough common URIs
The “technology” thatmakeslinked open data possibleis RDF. Everything in RDF ismadeof “triples”, A triple means a statement with “Subject-Predicate-Object” asshown in thisexample. Ideally, allelementsof a triple are representedbyan URI, anunambiguousdefinitionof a concept, whichismachinereadable, buttriples can bebuiltalsofromsimpleletterstrings.
Whatisnow the roleofthesauri and specifically the roleofourthesauri in this set up?
In our team wehadveryearly the idea thatthesauriwouldbecomeofimportance in the developmentof Web information management. Within the AOS (AgriculturalOntology Service) initiativewehavegone a long and winding road. The Google searchshowsour 2003 paper in JODI.Butnow AGROVOC hasbecome showcase for the useofthesauritobuildconceptschemes
Some auto appreciation
Thisis the AGROVOC SKOS modelthathasbeendeveloped and decided in April 2010 under activecollaborationfrom Tom Baker, whowasmemberof the W3C SKOS workinggroup.
SKOS-XL hasbeenpublishedas a W3C standard oneyear ago. The initialversionsof SKOS werenotsufficientto express the complexicitiesofmultilingualthesauri. Margherita Sini from FAO wasmemberof the SKOS workinggroup and we are vere satisfiedthat at then end a standard emergedthatcatersforourneeds
You can seehere the AGROVOC encoding in SKOS
The tableshows 3 descriptorsthat are in AGROVOC, EUROVOC and UNBIS. In AGROVOC and EUROVOC they are alreadyencodedasURIs. Easilywecouldestablishrelationshipslikeowl.sameAsbetween the concepts or skos:exactMatchbetweenlabels.
In a bibliographical record thereismuch more hidden information thandisplayedwith the metadata. Manyof the highlystructured data are linkingtoother information on the web. In AGRIS wehavenowintroducedsomethingwhatwecall “naivelinking”. An AGRIS record linksautomaticallyto Google Mapsfor the location of the center and to Google toretrieve the full text of the resource, citationlists or otherpublicationsfrom the authors. Thisoftenworks, butclearlynotalway, s asitisnotcontrolledbysemantics, butonlythroughidentyofstrings. Foranuneducatedmachineunfortunately COW and C.O.W. are the same, whereaspeanuts and groundnuts are somethingdifferent.
Ifresources are marked up withsemanticallydefined and machinereadableconcepts, they can belinked and mashed up preciselyaswehaveseen in the examplefrom the BBC.In thisexamplewe start withan AGRIS record on Hazardouswaste, whichisindexedwith AGROVOC. Alreadynowwe can easily link to material indexedwithEurovoc, hereanexamplefromEuroLex. If the UNBIS thesaurus wouldberestructuredto a conceptscheme and publishedas LOD, related UN documentscouldbeattachedautomaticallyby the machine.
How does this work: A resource is connected with each concept URI in the web. The concepts between three vocabularies are having same literal which is connected with owl:sameAS/exactMatch relationship. As we are speakingaboutthesauri and notontologieswekept the relation tobechosenpurposelyvague. The conceptscouldbematchedwithowl:sameAS or the termscouldbematcheswith SKOS:exactMatch. A lotofdiscussion on thisisongoing
Oneof the groundbreakingenterprises in this area isThomsonReuters “Open Calais”. Thisis a webservicethatprovidessemanticmark up foranyunstructured text thatyoufeedintotheir service The service is free ofCharge. Why? I will show youlater.
My team in collaborationwith the IndianInstituteofTechnology in Kanpur isdeveloping a similar service foroursubject area.
Wehavehere a text from 1964 without a bibliographic record at handabout a plantprotectionissue
Open Calais isverygood in thoseareas, in whichtheyhavetheirownelaboratedconceptschemeagainstwhich the texts are analyzed: “Places”, “Persons”, “Business Processes” , “IndustryTerms”, butitisweak in the specifictopicanalysis, whattheycall “social tags”
AgroTaggerstilllacksmanyof the sophisticated featuresof “Open Calais” ,butismuch, muchbetter in the subjectanalysisof the text
Wewillnowtry a life demo
During the discussions on the AGROVOC model, wealsodid some software engineering. The resultis the conceptschemeworkbench.Is a web-based working environment for managing the AGROVOC Concept Server Facilitate the collaborative editing of multilingual terminology and semantic concept information It includes administration and group management features It includes workflows for maintenance, validation and quality assurance of the data pool The CS is accessible freely to everybody to facilitates collaborative editing Alreadynownotonly AGROVOC is on the workbench, butalso the FAO OpenArchive authority data. We can hostanyconceptscheme