Talk at Insiders Technologies , 21.01.2010. It's about publishing RDF data with D2R-server, link the data to get Linked Data, query the data with SPARQL via SQUIN and finally annotate text with this data by using RDFa in Epiphany.
1. Insiders
January
2010
Using the Web of Data
for
Information Extraction
scoobie
sparql rdfa
D2R server rdf
squin epiphany
Linked Data
OBIE
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian
2. Insiders
Are you still surfing ... January
2010
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian
4. Insiders
A simple question ... January
2010
What are the cities of the universities in Rhineland Palatinate and
what is the unemployment rate of these cities?
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian
5. Insiders
A simple question ... January
2010
What are the cities of the universities in Rhineland Palatinate and
what is the unemployment rate of these cities?
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX eurostat: <http://www4.wiwiss.fu-berlin.de/eurostat/resource/eurostat/>
PREFIX dbpedia: <http://dbpedia.org/ontology/>
PREFIX dbpedia_cat: <http://dbpedia.org/resource/Category>
SELECT ?dbpcity ?cityName ?ur WHERE {
?uni skos:subject dbpedia_cat:Universities_and_colleges_in_Rhineland-Palatinate;
dbpedia:city ?dbpcity .
?dbpcity owl:sameAs ?statcity.
?statcity rdfs:label ?cityName ;
eurostat:unemployment_rate_total ?ur
}
http://www.w3.org/TR/rdf-sparql-query/
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian
6. Insiders
… and its answer. January
2010
dbpcity cityName ur
http://dbpedia.org/resource/Koblenz Koblenz 8.8
http://dbpedia.org/resource/Trier Trier 7.3
Data Sources:
http://epp.eurostat.ec.europa.eu http://wiki.dbpedia.org
http://www4.wiwiss.fu-berlin.de/eurostat/
Query Engine: SQUIN - Query the Web of Linked Data
http://squin.sourceforge.net/
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian
7. So much data out there, Insiders
January
too much? 2010
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian
8. Insiders
What data do you have? January
2010
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian
9. Insiders
Are you still surfing ... January
2010
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian
10. Insiders
Agenda January
2010
In order to use Web of Data for information
extraction, you have to understand its basics.
● RDF on one slide
● Publish data in RDF with D2R Server
● Publish RDF as Linked Data
● Query Linked Data with SPARQL and Squin
● Use RDF for information extraction
● Bring Linked Data to text via RDFa
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian
11. Insiders
Wouldn't this be nice. January
2010
Data
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian 11
12. Insiders
Wouldn't this be nice. January
2010
Data Text
User-defined Filter
Ex
tra
ct
io
n
Pi
pe
l in
e
Extraction
Results
enrich
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian 12
13. Insiders
Wouldn't this be nice. January
2010
annotated
Data Text text
User-defined Filter
Ex annotate
tra
ct
io
n
Pi
pe
l in
e
Extraction
Results
enrich
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian 13
14. Insiders
Wouldn't this be nice. January
2010
annotated
Data Text text
User-defined Filter
Ex annotate
tra
ct
io
n
Pi
pe
populate l in
e
Extraction
Results
enrich
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian 14
21. Insiders
RDF data is graph data. January
2010
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian
22. Publishing relational Insiders
January
data in RDF 2010
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian
23. Publishing relational Insiders
January
data in RDF 2010
D2R Server - Publishing Relational Databases on
the Semantic Web
http://www4.wiwiss.fu-berlin.de/bizer/d2r-server/
Two small command line calls:
./d2r-server
-p 80
-b http://projects.dfki.uni-kl.de/mydatabase/
mydatabase.n3
./generate-mapping
-o mydatabase.n3
-b http://projects.dfki.uni-kl.de/mydatabase/
jdbc:mysql://localhost:3306/mydatabase
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian
24. Linked Data: Linking RDF Insiders
January
data from different sources 2010
Customer DB Employees DB
How to interlink
these datasets?
Project DB DBpedia
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian
25. Linked Data: Linking RDF Insiders
January
data from different sources 2010
Linked Data Principles (TimBL, 2006)
1. Use URIs as names for things
(e.g., http://dbpedia.org/resource/Berlin)
2. Use HTTP-URIs so that people can look up those names
3. Provide useful information in RDF when someone looks up an URI
4. Include links to other URIs to enable discovery of more information
Example:
<http://dbpedia.org/resource/Berlin>
owl:sameAs opencyc:en/CityOfBerlinGermany ;
owl:sameAs opencyc:en/Berlin_StateGermany
owl:sameAs <http://sws.geonames.org/2950159/>
owl:sameAs <http://www4.wiwiss.fu-berlin.de/eurostat/resource/regions/Berlin>
owl:sameAs freebase:http://dbpedia.org/resource/Berlin
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian
26. SPARQL: Querying RDF Insiders
January
data 2010
SPARQL - the RDF query language.
In contrast to SQL, it's data model is not set oriented but graph oriented.
Some Examples:
Resulting in tuples:
SELECT ?interest ?friend WHERE {
<http://www.w3.org/People/BernersLee/card#i> foaf:knows ?friend .
?friend foaf:interest ?interest . }
Resulting as graph :
CONSTRUCT {?friend foaf:interest ?interest } WHERE {
<http://www.w3.org/People/BernersLee/card#i> foaf:knows ?friend .
?friend foaf:interest ?interest . }
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian
27. SPARQL: Query Linked Insiders
January
Data from different sources 2010
Customer DB Employees DB
How to access
these datasets
with a single
SPARQL query?
Project DB DBpedia
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian
28. SPARQL: Query Linked Insiders
January
Data from different sources 2010
Customer DB Employees DB Squin: Query the Web of
Linked Data
http://squin.sourceforge.net/
Squin follows a Link Traversal
D2R Server D2R Server approach over HTTP URIs.
SQUIN Remember:
SELECT DISTINCT ?c ?cityName ?ur
WHERE {
D2R Server D2R Server ?u skos:subject
dbpedia_cat:Universities_and_colleges_i
n_Rhineland-Palatinate;
dbpedia:city ?c .
?c owl:sameAs [ rdfs:label ?cityName ;
eurostat:unemployment_rate_total ?ur ]
}
Project DB DBpedia
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian
29. Using RDF and Linked Data Insiders
January
for Information Extraction 2010
User Linked Data Query
asks question
t
a bou
to answers
Text Extraction Result Graph
Pipeline
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian
30. Using RDF and Linked Data Insiders
January
for Information Extraction 2010
What data do we have?
Example RDF data
<http://dblp.l3s.de/d2r/resource/publications/dblp_pub:conf/icdar/SchulzEGAAD09>
rdf:type foaf:Document ;
dc:creator dblp_author:Markus_Ebbecke ;
dc:title „Seizing the Treasure: Transferring Knowledge in Invoice Analysis“ .
Classes Instances Datatype Properties Object Properties Literals
foaf:Document .../SchulzEGAAD09 dc:title dc:creator „Markus“
foaf:Person .../Markus_Ebbecke foaf:name foaf:knows „Ebbecke“
foaf:firstName „Seizing the
foaf:surName Treasure:
Transferring
Knowledge
in Invoice
Analysis“
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian
31. SCOOBIE Insiders
January
Domain Adaption 2010
Structured Text Corpus
Data Data
Patterns and
Gazetteers
Data
Vocabulary Data
Instance Data
Data Preprocessing Information
& Learning (offline) Extraction (online)
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian 31
32. SCOOBIE Insiders
January
Eco System 2010
Index Domain Knowledge Models
Text Training
Corpus Corpus
Session Data
Instances
Ontology Models
Patterns +
Gazetteers
Pre-
process Train Extract
Tasks
API
I O I
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian 32
33. SCOOBIE Insiders
January
OBIE Pipeline 2010
Normalization Text Extraction
Language Detection
Segmentation Tokenization
Sentence Extraction
POS-Tagging
Symbolization Named Entity Recognition
Structured Entity Recognition
Noun Phrase Chunking
Symbol Recognition
Instantiation Instance Recognition
Instance Disambiguation
Chunk Classification
Contextualization Fact Extraction
Fact Selection
Population Query Answering
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian 33
34. Used Machine Insiders
January
Learning Models 2010
Semi-Supervised Learning
CRF-based Noun Phrase Chunker
I
Supervised Learning
Gazetteer matching statistics (Named Entity Recognition)
I Regex matching statistics (Structured Entity Recognition)
Unsupervised or Instance-based Learning
TF/IDF-based instance re-ranking (Instance Disambiguation)
I K-Nearest-Neighbor chunk classifier (Chunk Classification)
Spreading Activation-based fact ranking (Fact Selection)
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian 34
35. Used Machine Learning: Insiders
January
Conditional Random Field 2010
CRFs are sequence taggers:
Train it with: Bill CAPITALIZED noun
slept LOWERCASE non-noun
here LOWERCASE non-noun
Test it with: He CAPITALIZED
visited LOWERCASE
London CAPITALIZED
CRF results: noun MALLET - MAchine Learning
non-noun for LanguagE Toolkit
non-noun
http://mallet.cs.umass.edu/
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian 35
36. Bringing Linked Data to Insiders
January
Text 2010
Annotate plain text or HTML with RDF data.
I'm working at DFKI.
RDFa offers an HTML extension:
I'm working at
<span about="dbpedia:DFKI" property="rdfs:label">
DFKI</span>
Now lets generate RDFa automatically ...
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian 36
37. Insiders
Do you remember? January
2010
annotated
Data Text text
User-defined Filter
Ex annotate
tra
ct
io
n
Pi
pe
populate l in
e
Extraction
Results
enrich
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian 37
38. Insiders
RDF Epiphany January
2010
Epiphany takes the
original webpage
…
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian 38
39. Insiders
RDF Epiphany January
2010
Epiphany takes the
original webpage
…
and SCOOBIE initialized
with an RDF data set
…
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian 39
40. Insiders
RDF Epiphany January
2010
Epiphany takes the
original webpage
…
and SCOOBIE initialized
with an RDF data set
…
It extracts RDF information
from text and annotates it as
RDFa
…
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian 40
41. Insiders
RDF Epiphany January
2010
Epiphany takes the
original webpage
…
and SCOOBIE initialized
with an RDF Linked Data set
…
It extracts RDF information
from text and annotates it as
RDFa
…
clicking on RDFa annotations
opens further information from
the Linked Data set
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian 41
42. Insiders
RDF Epiphany January
2010
At a glance
● Epiphany is a free web service.
● Epiphany uses SCOOBIE.
SCOOBIE
● Epiphany can be initialized with any RDF
Linked Data set.
● Epiphany generates an RDF document about
a web page.
● Epiphany annotates RDF as RDFa in the web
page.
http://projects.dfki.uni-kl.de/epiphany/
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian 42
43. Insiders
Summary January
2010
Customer DB Employees DB annotated
Text text
D2R D2R
Server
SQUIN
Server User-defined Filter
D2R D2R
Server Server
Project DB DBpedia Ex annotate
tra
ct
io
n
Pi
pe
populate l in
e
Extraction
Results
enrich
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian 43
44. Insiders
Outlook January
2010
Customer DB Employees DB
E-Mail
annotated
E-Mail
D2R D2R
Server
SQUIN
Server User-defined Filter
D2R D2R
Server Server
Project DB DBpedia Ex annotate
tra
ct
io
n
Pi
pe
populate l in
e
Extraction
Results
enrich
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian 44
45. Insiders
Thank you! January
2010
scoobie
sparql rdfa
D2R server rdf
squin epiphany
Linked Data
OBIE
Benjamin Adrian
http://www.dfki.uni-kl.de/~adrian 45