Linked Data (LD) datasets (e.g., DBpedia, Freebase) are
used in many knowledge extraction tasks due to the high variety of domains they cover. Unfortunately, many of these datasets do not provide a description for their properties and classes, reducing the users’ freedom to understand, reuse or enrich them. This work attempts to fill part of this lack by presenting an unsupervised approach to discover syntactic patterns in the properties used in LD datasets. This approach produces a content patterns database generated from the textual data (content) of
properties, which describes the syntactic structures that each property have. Our analysis enables (i) a human-understanding of syntactic patterns for properties in a LD dataset, and (ii) a structural description of properties that facilitates its reuse or extension. Results over DBpedia
dataset also show that our approach enables (iii) the detection of data inconsistencies, and (iv) the validation and suggestion of new values for a property. We also outline how the resulting database can be exploited
in several information extraction use cases.
1. Emir Muñoz
Fujitsu (Ireland) Limited
National University of Ireland Galway
LD4IE 2014 @ ISWC, Riva del Garda, Trentino, Italy. Oct 20th, 2014
http://bit.ly/1xYTR6Z
(@emir_munoz)
4. select distinct ?obj where
{?sub <http://dbpedia.org/property/isbn> ?obj}
Let’s run the following SPARQL query over endpoint…
And some more ...
The endpoint response is a table with the values for the isbn property:
So, what is the correct range for ?
4
0 71090 6176526 2 2.7073 140043853 1107020697 2940013968264 0978-02-02+02:00 http://dbpedia.org/resource/N/a "?"@en "ISBN 0-312-85182-0"@en "See text"@en "various"@en
"ISBN 978-0-465-02656-2, ISBN 0-14-017997-6"@en
"ISBN 0-553-07875-5 & ISBN 0-553-56166-9"@en
"The Claiming of Sleeping Beauty: ISBN 0-452-26656-4"@en
"-2.0"^^<http://dbpedia.org/datatype/second>
"TBA"@en
"not available"@en
"[[#Bibliography"@en
5. LOV Statistics (by July 7th, 2014):
446 vocabularies
10 classes and 20 properties in average
5
range of isbn is http://schema.org/Text
6. …but still, is it what I’m looking for? what is the syntax?
6
7. Etymology
apo- + apsis
Noun
apoapsis (plural apoapsides)
(astronomy) The point of a body's elliptical orbit about the system's centre of mass where the distance between the body and the centre of mass is at its maximum.
Property: apoapsis
[http://en.wiktionary.org/wiki/apoapsis]
Earth
Satellite
dbr:17049_Miron dbo:apoapsis 4.01288e+11
7
9. <subject, predicate, object>
1488-07-28+02:00
"September 2012"@en
"--08-26+02:00"^^<http://www.w3.org/2001/XMLSchema#gMonthDay>
1982-05-23+02:00
"August 2012"@en
"--01-24+02:00"^^<http://www.w3.org/2001/XMLSchema#gMonthDay>
2007-04-11+02:00
"July 2009"@en
"--06-11+02:00"^^<http://www.w3.org/2001/XMLSchema#gMonthDay>
Lerman et al. (JAIR 2003)
First column:
[NUM-NUM-NUM+NUM:NUM] (plain literal)
Second column:
[ALPHA<space>NUM] (plain literal + lang)
Third column:
[--NUM-NUM+NUM:NUM] (typed literal)
<http://dbpedia.org/property/date>
9
10. Let be the set of content patterns.
Lerman et al. (JAIR 2003)
More specific categories
For the input set:
That generates the following patterns:
Values are decomposed in tokens, and
each token is represented by a syntactic
class.
10
11. 2.4 billion RDF triples
53,230 properties
Version 3.9
Split
Method
19.25% plain literals
18.02% typed literals
62.73% without lang or datatype (xsd:string)
11
12. For apoapsis example, we extracted one pattern
And we also found some other related properties:
For date example, we extracted 7 patterns
http://dbpedia.org/ontology/apoapsis LARGE/FLOAT_NUMBER 1.0
http://dbpedia.org/ontology/Planet/apoapsis LARGE/FLOAT_NUMBER 1.0
http://dbpedia.org/ontology/Spacecraft/apoapsis LARGE/FLOAT_NUMBER 1.0
http://dbpedia.org/property/apoapsis NUMBER 0.9230769230769231
http://dbpedia.org/property/apoapsis LARGE/FLOAT_NUMBER 0.75213675
http://dbpedia.org/property/date -- SMALL_NUMBER - SMALL_NUMBER 0.2
http://dbpedia.org/property/date ALPHANUMERIC MEDIUM_NUMBER 0.166
http://dbpedia.org/property/date ALPHANUMERIC 2012 0.032
http://dbpedia.org/property/date ALPHANUMERIC.ALPHANUMERIC 0.012
And more …
12
13. The user has this value: “2014-10-20”.
What property can he use?
dbp:dateCreated, dbp:dateOfProduction, dbp:dateOpened, dbp:dateSigned, dbp:dateOfPremiere, dbp:date, among others.
What is the property dbp:admCtrOf used for?
"town of republic significance of Meleuz"@en (http://dbpedia.org/resource/Meleuz)
"town of oblast significance of Oktyabrsk"@en (http://dbpedia.org/resource/Oktyabrsk)
"town of republic significance of Sortavala"@en (http://dbpedia.org/resource/Sortavala)
it is used to declare Administrative Control Of
13
14. Check for atypical values (outliers)
Close look into the most (in)frequent patterns
Possible errors during automatic extraction
For the dbp:isbn property we can find the following values:
"summer or autumn 380"@en
"Late November"@en
"Fall 1040"@en
680
"December, 67 BC"@en
"April-July 1799"@en
http://dbpedia.org/resource/New_Year's_Day
http://dbpedia.org/resource/Second_Intermediate_Period_of_Egypt
"New moon day of Kartika, celebrations begin two days prior and end two days after that date"@en
Are they orvalues?
14
15. E-mail: user1@domain.com
Given name: John
Surname: Snow
Birthday: 1986-02-14
A vCard, may be annotated with microformat hCard
LD4IE Challenge 2014
vcard:email mailto : ALPHA PUNCTUATION ALL_LOWERCASE . ALL_LOWERCASE 0.82
vcard:email mailto : ALPHA PUNCTUATION ALL_LOWERCASE . com 0.69
vcard:email mailto : ALPHA @ ALPHANUMERIC . ALL_LOWERCASE 0.54
vcard:email mailto : ALPHA @ ALPHANUMERIC . com 0.46
vcard:email mailto : ALL_UPPERCASE ****@ ALL_LOWERCASE . ALL_LOWERCASE 0.36
We can use our database to extract and validate the email:
vcard:bday NUMBER - SMALL_NUMBER - SMALL_NUMBER 0.5
vcard:bday MEDIUM_NUMBER - SMALL_NUMBER - SMALL_NUMBER 0.5
…also the birthday
15
16. Extraction of lexico-syntactic patterns from LD datasets
Different use cases:
Search for properties
Validation of values
Information extraction based on patterns
Future work:
Study of consistency analysis of knowledge bases
Extension of patterns to cover other knowledge bases
Among others
16
500,000 content patterns