Micromeritics - Fundamental and Derived Properties of Powders
WikiTables DERI Talk
1. Extending DBpedia (LOD) using
WikiTables
Emir Muñoz
Unit for Reasoning and Querying
emir.munoz@deri.org
2. Linked Open Data
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
October 12, 2012 -- E. Muñoz
3. Linked Open Data
• DBpedia, an export of Wikipedia’s structured data
DBpedia provides RDF version of all wikipedia structured data (infoboxes)
October 12, 2012 -- E. Muñoz
4. Linked Open Data
• DBpedia, an export of Wikipedia’s structured data
DBpedia provides RDF version of all wikipedia structured data (infoboxes)
But not yet a version of all normal Wikipedia tables or wikitables
October 12, 2012 -- E. Muñoz
5. Tables as a source of LOD
Tables are inherently concise Infoboxes
as well as information rich (attr-value)
The values Column header represents
represent types of information Caption as
instances of that another row
types
http://en.wikipedia.org/wiki/Dublin
http://en.wikipedia.org/wiki/Galway
October 12, 2012 -- E. Muñoz
6. Reasoning over Wikipedia Tables
Recovering Table Semantics …
Dublin is twinned with the following places:
http://en.wikipedia.org/wiki/Dublin
October 12, 2012 -- E. Muñoz
7. Reasoning over Wikipedia Tables
Entity annotation for cells, mappings to DBpedia resources
http://en.wikipedia.org/wiki/Dublin
dbpedia.org/property/city dbpedia.org/property/nation dbpedia.org/property/since
dbpedia.org/resource/San_Jose,_California dbpedia.org/resource/United_States
dbpedia.org/resource/Liverpool dbpedia.org/resource/United_Kingdom
dbpedia.org/resource/Matsue,_Shimane dbpedia.org/resource/Japan
dbpedia.org/resource/Barcelona dbpedia.org/resource/Spain
dbpedia.org/resource/Beijing dbpedia.org/resource/People’s_Republic_of_China
(xsd:integer)
October 12, 2012 -- E. Muñoz
8. Reasoning over Wikipedia Tables
dbpedia.org/ontology/country
dbpedia.org/property/subdivisionName
Extracting relations
http://en.wikipedia.org/wiki/Dublin
dbpedia.org/property/city dbpedia.org/property/nation dbpedia.org/property/since
dbpedia.org/resource/San_Jose,_California dbpedia.org/resource/United_States
dbpedia.org/resource/Liverpool dbpedia.org/resource/United_Kingdom
dbpedia.org/resource/Matsue,_Shimane dbpedia.org/resource/Japan
dbpedia.org/resource/Barcelona dbpedia.org/resource/Spain
dbpedia.org/resource/Beijing dbpedia.org/resource/People’s_Republic_of_China
(xsd:integer)
is dbpedia.org/ontology/country of
October 12, 2012 -- E. Muñoz
11. Reasoning over Wikipedia Tables
• Let’s analyze these cases …
• Liverpool
• Matsue
• Beijing
October 12, 2012 -- E. Muñoz
12. Not that simple…
• Web tables usually don’t have explicit semantics
by themselves.
• Main issues:
– Complex tables with spans
– Captions inside the table as another row
– Not well-formed tables (i.e., not a matrix)
– We need filters (e.g., min 2 columns, 2 rows)
• We are extracting relations at row level and
between the main entity and the table resources
October 12, 2012 -- E. Muñoz
13. Parsing: Extracting Tables
First step: parsing Wiki format Caption as
another row
http://en.wikipedia.org/wiki/People%27s_Republic_of_China
Rowspans Table split
with pictures
October 12, 2012 -- E. Muñoz
14. Parsing: Extracting Tables
• Problems with parsing the cell’s content
http://en.wikipedia.org/wiki/Danny_Kaye
October 12, 2012 -- E. Muñoz
15. Parsing: Extracting Tables
• Problems with parsing the cell’s content
http://en.wikipedia.org/wiki/Danny_Kaye
October 12, 2012 -- E. Muñoz
16. Parsing: Extracting Tables
Same page link Many different
formats
Anchor text
vs.
Content text
http://en.wikipedia.org/wiki/List_of_animated_television_series_of_the_1990s
October 12, 2012 -- E. Muñoz
17. Extracting Relations
http://en.wikipedia.org/wiki/AFC_Ajax
A table
containing tables
October 12, 2012 -- E. Muñoz
18. Extracting Relations
• Also relations between the main entity and
the entities in the table http://en.wikipedia.org/wiki/AFC_Ajax
16 players
dbpedia.org/resource/AFC_Ajax
14 dbpedia.org/ontology/team
14 dbpedia.org/property/clubs
11 dbpedia.org/property/currentclub
3 dbpedia.org/property/youthclubs
In his dbpedia page
there is no mention
to AFC Ajax
October 12, 2012 -- E. Muñoz
20. Our Dataset
• enwiki dump from 2012-09-03 02:17:37
• 8.6 GB of Wikipedia pages that comprise
– 10,531,986 documents (HTML pages)
– Only 413,256 HTML contains tables
– 2,989,098 tables
– 905,929 tables after the filter
• 27.7% of the whole tables
– 0.46 tables per page (or 2.15 discarding pages
without tables)
October 12, 2012 -- E. Muñoz
22. Ranking of Relationships
• The current ranking function is naïve
𝑓 𝑟𝑒𝑙 http://en.wikipedia.org/wiki/AFC_Ajax
𝑠𝑐𝑜𝑟𝑒 =
𝑛 𝑟𝑜𝑤𝑠
16 players
freq relationship score
14 dbpedia.org/ontology/team 0,875
14 dbpedia.org/property/clubs 0,875
11 dbpedia.org/property/currentclub 0,6875
3 dbpedia.org/property/youthclubs 0,1875
October 12, 2012 -- E. Muñoz
23. Ranking of Relationships
• For this cases is not good and 𝑠𝑐𝑜𝑟𝑒 ∉ [0,1]
http://en.wikipedia.org/wiki/Danny_Kaye
October 12, 2012 -- E. Muñoz
24. Ongoing Work and Challenges
• Improve the ranking function for relations.
• Store the 5.5M DBpedia (transitive) redirects
locally (optimizing time).
• Statistical analysis of Wikipedia tables
– Number of columns, rows
– Headers, Captions
– External and internal links
• The big following challenge is the evaluation.
October 12, 2012 -- E. Muñoz
25. What’s next?
• Some ideas in mind:
– Use the extracted relations to classify WikiTables
– Define a similarity function for WikiTables
English Italian
October 12, 2012 -- E. Muñoz
27. What’s next?
http://dbpedia.org/page/Chlorous_acid
http://en.wikipedia.org/wiki/Electronegativity
Chlorous acid is a chlorite
http://en.wikipedia.org/wiki/Chlorine
October 12, 2012 -- E. Muñoz
28. Open problems
• Handle multiple-entities in the same cell
• Improve the ranking function
Thanks!
• Handle redirects before querying DBpedia
Q&A
• How to evaluate the outcome
Thanks!
Emir Muñoz
Unit for Reasoning and Querying
emir.munoz@deri.org
October 12, 2012 -- E. Muñoz