My presentation from Drupalaton 2013 - http://drupalaton.hu/schedule#speaker-30
This session will focus on the implementation of semantic services (automatic content enhancement, autotagging, content recommendation, reasoning) based on linked data datasets using the integration of Drupal with Apache Stanbol.
During the presentation the audience will find out about:
main features of Apache Stanbol and its integration with Drupal
how to discover and use custom/domain specific Linked Data datasets with Apache Stanbol/Drupal
how to build an advanced semantic processing chain in Apache Stanbol that will automatically annotate Drupal entities
how to implement a content recommendation/reasoning feature for Drupal based on Apache Stanbol services.
Apache Stanbol is an Open Source software stack designed to provide a powerful semantic engine via RESTful services returning results as RDF (Resource Description Language) and JSON. Unlike existing proprietary, commmerically oriented solutions such as OpenCalais, Apache Stanbol is highly customizable and may be trained to provide semantic services for virtually any language.
Unblocking The Main Thread Solving ANRs and Frozen Frames
Linked data based semantic annotation using Drupal and Apache Stanbol
1. Drupal and Apache Stanbol
LINKED DATA BASED SEMANTIC ANNOTATION
Gabriel Dragomir
Sunday, August 18, 13
2. The Semantic Web
Tim Berners Lee:
‘‘The first step is putting data on the Web in a form
that machines can naturally understand, or
converting it to that form. This creates what I call a
Semantic Web – a Web of data that can be
processed directly or indirectly by machines.’’
Sunday, August 18, 13
3. What’s the hype?
Most organizations need to organize/analyze/relate
huge amounts of textual, unstructured, dissipated data
Examples:
keyword extraction from content: annotate abstracts
text categorization: organize big volumes of text based
on a thesaurus
media monitoring of tags: occurences of a specific
keyword on social media channels
Sunday, August 18, 13
5. Linked data
Project started in 2007
Aimed at building the Web of Data by:
identifying open access data sets
converting them into RDF vocabularies
publish them as open access data sets
Sunday, August 18, 13
6. Linked data ecosystem
Linked Open Vocabularies (LOV): http://lov.okfn.org/
dataset/lov/
Provides a conceptual map of the vocabularies
Various providers: libraries, governmental actors, NGOs
Sunday, August 18, 13
7. Linked data ecosystem
Where to find other data sets?
http://www.w3.org/2001/sw/wiki/SKOS/Datasets
Swoogle: http://swoogle.umbc.edu/
PoolParty: http://vocabulary.semantic-web.at
Sunday, August 18, 13
9. Semantic annotation
Creates specific metadata that enable new ways to
retrieve and aggregate information
Annotations are done based on a conceptual scheme,
an ontology (ex. FOAF, DC Core)
For more on ontologies see: http://www.w3.org/wiki/
Good_Ontologies
The annotations build semantic relationships: e.g.
rdf:type, owl:sameAs
Sunday, August 18, 13
10. Semantic annotation
Most common uses:
Named Entity Linking: limited recognizing entities of
type person, organization, place (e.g. OpenCalais)
Entityhub Linking: annotation based on vocabularies
with no limitations of entity types. Requires more
natural language processing prior to annotation.
Sunday, August 18, 13
11. Apache Stanbol on the fly
Here comes Apache Stanbol
A new approach:
modular semantic analysis of documents
processing components can be built for virtually any
language
flexible workflows via semantic annotation chains
any vocabulary (Linked Data, custom) can be used
Sunday, August 18, 13
12. From IKS to Apache Stanbol
IKS - Interactive Knowledge Stack for small to medium
CMS providers - EU funded consortium
An open source software stack written in Java
Goal: extract and process semantic data from
documents
Project undergoing incubation at Apache Foundation
http://stanbol.apache.org
Sunday, August 18, 13
13. Service oriented architecture
Stanbol is designed to offer service oriented integration
RESTful web services API returning RDF or JSON/
JSON-LD
Each component exposes an endpoint independently
Open Services Gateway initiative compliant (OSGi) via
Apache Felix and Apache Sling
Remote component management
Sunday, August 18, 13
14. Implementation
OSGi layer: Apache Felix and Apache Sling
Build environment: Apache Maven
RDF framework: Apache Clerezza
Triples store, reasoning engine: Apache Jena
Indexing and semantic search: Apache Solr
Content analysis/metadata extraction: Apache Tika
Natural language processing: Apache OpenNLP
Sunday, August 18, 13
16. Components
Semantic layer:
Enhancer, EntityHub, ContentHub
Enhancement engines: internal, 3rd party
User interfaces
Knowledge integration (rule sets, reasoners)
Storage integration
Sunday, August 18, 13
17. Content enhancement
Examples:
retrieve additional metadata for a piece of content
identify the language of a text
extract entities (persons, places, organizations)
create annotations to external sources
use 3rd party services for named entities recognition
Sunday, August 18, 13
18. Drupal meets Stanbol
Several modules implement RDF support allowing data
transport to Stanbol semantic annotations
Taxonomy system allows for complex annotation
Fieldable taxonomy terms allow for storage of complex
semantic data
Sunday, August 18, 13
19. User scenarios
Semantic indexing via Stanbol (SOLR yard)
Content enrichment with semantically related
information (documents, factual data, images etc.)
Tag as you type: dynamic annotation of text in editors
Sunday, August 18, 13
20. How it works
POST request sends content via REST API
content is processed by an enhancement chain
Returns JSON-LD, RDF/XML, RDF/JSON etc
JSON-LD - JavaScript Object Notation for Linked Data
a human readable and simple linked data transport
format
for best results an enancement chain should do
language detection, tokenization, POS Tagging prior to
performing semantic annotation
http://drupalaton.jelastic.dogado.eu/stanbol/enhancer
Sunday, August 18, 13
22. Drupal distribution: IKS CE
IKS CE distribution - Wolfgang Ziegler (fago),
Stéphane Corlosquet (scor)
Components:
Search API Stanbol
VIE.js - semantic annotation UI
https://drupal.org/project/iksce
http://drupal.org/project/vie
http://drupal.org/project/search_api_stanbol
Sunday, August 18, 13
23. Search API Stanbol
enables the indexing of Drupal entities such as nodes,
users, taxonomy terms, files, etc. in Stanbol EntityHub.
data sent as RDF
data can be mashed up with data from other sources
(Managed Sites, Remote Sites)
Sunday, August 18, 13
24. VIE.js
“Vienna IKS Editables”
JavaScript library for implementing decoupled Content
Management Systems and semantic interaction in web
applications.
Sunday, August 18, 13
25. Monolitic vs Decoupled
Content Management
Monolitic vs Decoupled Content Management Systems
source: Henri Bergius - http://bergie.iki.fi
Sunday, August 18, 13
26. Demo setup
we store Drupal entities in a SOLR index
annotations are to be made based on:
DBPedia - bundled with Apache Stanbol
a custom vocabulary of terms related to semantic
web - Social Semantic Web Thesaurus
SemWeb is imported as a SOLR index into Apache
Stanbol
Sunday, August 18, 13
27. Custom vocabularies
Social Semantic Web Thesaurus
1959 concepts related to semantic web
Author: Andreas Blumauer
http://vocabulary.semantic-web.at/semweb.html
http://vocabulary.semantic-web.at/semweb/8.visual
Sunday, August 18, 13
28. Demo
index Drupal entities in Apache Stanbol
retrieve annotated entites via REST API
annotate entities using dbpedia and semweb indexes
edit Drupal entities and annotate on the fly
retrieve linked data tag recommendations
Sunday, August 18, 13