Talk about the use of Linked Data in historical research on census data. Has some slides about TabLInker as well (http://github.com/Data2Semantics/TabLinker). Part of the data2semantics project (http://data2semantics.org)
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
Linked Census Data to RDF
1. Linked Census Data
Rinke Hoekstra
CEDAR Kickoff, 26 January 2012
donderdag 26 januari 12
2. Overview
“Can Linked Data make a difference for historical analysis?”
Problem
Procedure (as I understand it)
Step-by-step
Vocabularies, tools
Conclusion
donderdag 26 januari 12
3. Problem
~519 Excel spreadsheets (more?... I heard 1200)
Want to do analysis over time and space, but...
Structure
Excel sheets cannot be readily imported in a database
Contents
Excel sheets are not normalised (age) nor harmonised (occupations/places)
Excel sheets contain errors (both original and data-entry)
Want to preserve all stages of data cleansing/harmonisation
donderdag 26 januari 12
4. Procedure
Verbatim import of sheets to
Archiving database/triple store
Correcting/ Add missing information (headers)
Documenting
Interpreting Add corrected information (data)
Normalising Interpret and correct objective information
Link information across sheets
Harmonising Link information to other datasets (e.g. locations)
Visualising Build (generic) visualisations of results
donderdag 26 januari 12
5. ... a bit about Linked Data
“Just another Data Model”
RDF ≠ Ontology (OWL)
RDF ≠ Taxonomy (RDFS/SKOS)
Globally Unique Identifiers (URI) for all entities
Dereferencable on the Web (URI = URL)
HTTP-accessible databases (triple stores, SPARQL)
Triples all the way <subject,
predicate,
object>
donderdag 26 januari 12
6. Spreadsheet ≠ Database
Primary Keys are entities
Column names are attributes
Cell values are attribute values
Secondary keys are relations to
other entities
donderdag 26 januari 12
7. Spreadsheet ≠ Database
Primary Keys are entities
Column names are attributes
Cell values are attribute values
Secondary keys are relations to
other entities
donderdag 26 januari 12
8. Spreadsheet ≠ Database
Primary Keys are entities
Column names are attributes
Cell values are attribute values
Secondary keys are relations to
other entities
donderdag 26 januari 12
9. Spreadsheet ≠ Database
No Primary Keys!
Anything can be an entity
Column headers are “types”
Row headers are “types”
Hierarchies!
Cell values are entity “values”
No relations to other entities
donderdag 26 januari 12
10. Anatomy of a Spreadsheet
Workbook
Cell Cell Cell
Sheet Cell Cell Cell
Cell Cell Cell
Cell Cell Cell
Sheet Cell Cell Cell
Cell Cell Cell
donderdag 26 januari 12
11. Anatomy of a Spreadsheet
Workbook1.xls
Sheet1:A1 Sheet1:B1 Sheet1:C1
Sheet1 Sheet1:A2 Sheet1:B2 Sheet1:C2
... ... ...
Sheet2:A1 Sheet2:B1 Sheet2:C1
Sheet2 Sheet2:A2 Sheet2:B2 Sheet2:C2
... ... ...
donderdag 26 januari 12
12. Anatomy of a Spreadsheet
Workbook1.xls
workers agriculture 12
Sheet1 industry 6
... ...
diamond
A 34
cutters
Sheet2 B 67
... ... ...
donderdag 26 januari 12
13. Anatomy of a Spreadsheet
Workbook1.xls
workers agriculture 12
Sheet1 industry 6
... ...
diamond
A 34
cutters
Sheet2 B 67
... ... ...
NB: all URIs scoped to sheet!
donderdag 26 januari 12
14. Data Cube
How to best represent numeric data, in a flexible way?
SDMX (Eurostat, World Bank, CBS, etc.)
Every data item is an observation
Every observation has a value
Every observation has one or more dimensions
donderdag 26 januari 12
15. Data Cube
How to best represent numeric data, in a flexible way?
SDMX (Eurostat, World Bank, CBS, etc.)
Every data item is an observation
Every observation has a value
Every observation has one or more dimensions
donderdag 26 januari 12
16. Data Cube
How to best represent numeric data, in a flexible way? 12
1878
SDMX (Eurostat, World Bank, CBS, etc.) M
O
I
leeftijd
nummer der beroepsklasse geboortejaar
Every data item is an observation geslacht
huwelijkse staat
E pannenbakkers
Every observation has a value beroep
positie
D 1
Every observation has one or more dimensions letter der beroepsklasse
donderdag 26 januari 12
17. Data Cube
How to best represent numeric data, in a flexible way? 12
1878
SDMX (Eurostat, World Bank, CBS, etc.) M
O
I
leeftijd ?
nummer der beroepsklasse ?
geboortejaar
Every data item is an observation ?
geslacht
?
huwelijkse staat
E pannenbakkers
Every observation has a value beroep
positie
D 1
Every observation has one or more dimensions letter der beroepsklasse
donderdag 26 januari 12
18. Anatomy of a Spreadsheet
Properties Headers
RowHeaders Data
donderdag 26 januari 12
19. Anatomy of a Spreadsheet
Properties Headers
RowHeaders Data
donderdag 26 januari 12
20. Anatomy of a Spreadsheet
Properties Headers
RowHeaders Data
http://github.com/Data2Semantics/TabLinker
donderdag 26 januari 12
23. What TabLinker can’t do
Annotations
“footnote”-style on separate sheet
Interpret functions
e.g. automatic sums
Integrate/harmonise across sheets/files
Additional useful functionality:
“checksum” functionality
Export to database tables
donderdag 26 januari 12
27. Harmonising
I
skos:broader
skos:broader
skos:broader
D E A
skos:broader skos:broader skos:broader
skos:broader
Fabricage van
Fabricage van steen aardewerk (incl.
Fabricage van Fabricage van dakpannen
(molensteen, steenbakkers, porcelein, terracotta,
kalk (pannenbakkers)
tegelbakkers) kachelbakkers,
pottenbakkers, enz.)
donderdag 26 januari 12
28. Harmonising
I
skos:broader
skos:broader
skos:broader
D E A
skos:broader skos:broader skos:broader
skos:broader
Fabricage van
Fabricage van steen aardewerk (incl.
Fabricage van Fabricage van dakpannen
(molensteen, steenbakkers, porcelein, terracotta,
kalk (pannenbakkers)
tegelbakkers) kachelbakkers,
pottenbakkers, enz.)
skos:exactMatch skos:broadMatch skos:broadMatch skos:closeMatch
skos:exactMatch skos:exactMatch
skos:exactMatch
HISCO:23811 HISCO:25281 HISCO:25281 HISCO:26345
HISCO:23810 HISCO:25281 HISCO:26340
donderdag 26 januari 12
29. Harmonising
I
skos:broader
skos:broader
skos:broader
D E A
skos:broader skos:broader skos:broader
skos:broader
Fabricage van
Fabricage van steen aardewerk (incl.
Fabricage van Fabricage van dakpannen
(molensteen, steenbakkers, porcelein, terracotta,
kalk (pannenbakkers)
tegelbakkers) kachelbakkers,
pottenbakkers, enz.)
Sheet1:I
skos:broader skos:broader
skos:broader
Sheet1:D Sheet1:E Sheet1:A
skos:broader skos:broader skos:broader
skos:broader
Sheet1:Fabricage van
Sheet1:Fabricage van steen Sheet1:Fabricage van aardewerk (incl.
Sheet1:Fabricage
(molensteen, steenbakkers, dakpannen porcelein, terracotta,
van kalk
tegelbakkers) (pannenbakkers) kachelbakkers,
pottenbakkers, enz.)
donderdag 26 januari 12
30. I
skos:broader
skos:broader
skos:broader
D E A
1889 skos:broader
skos:broader skos:broader skos:broader
Fabricage van
Fabricage van steen aardewerk (incl.
Fabricage van Fabricage van dakpannen
(molensteen, steenbakkers, porcelein, terracotta,
kalk (pannenbakkers)
tegelbakkers) kachelbakkers,
pottenbakkers, enz.)
skos:narrowMatch I skos:closeMatch
skos:exactMatch
skos:narrowMatch
skos:broader
skos:broader
skos:broader
D E A
skos:broader skos:broader skos:broader 1899
skos:broader
Fabricage van
Fabricage van steen aardewerk (incl.
Fabricage van Fabricage van dakpannen
(steenbakkers, porcelein,
kalk (pannenbakkers)
tegelbakkers) kachelbakkers,
pottenbakkers, enz.)
donderdag 26 januari 12
31. I
Is SKOS sufficient?
skos:broader
skos:broader
skos:broader
D E A
1889 skos:broader
skos:broader skos:broader skos:broader
Fabricage van
Fabricage van steen aardewerk (incl.
Fabricage van Fabricage van dakpannen
(molensteen, steenbakkers, porcelein, terracotta,
kalk (pannenbakkers)
tegelbakkers) kachelbakkers,
pottenbakkers, enz.)
skos:narrowMatch I skos:closeMatch
skos:exactMatch
skos:narrowMatch
skos:broader
skos:broader
skos:broader
D E A
skos:broader skos:broader skos:broader 1899
skos:broader
Fabricage van
Fabricage van steen aardewerk (incl.
Fabricage van Fabricage van dakpannen
(steenbakkers, porcelein,
kalk (pannenbakkers)
tegelbakkers) kachelbakkers,
pottenbakkers, enz.)
NB: These are not strings, but globally unique URIs, scoped within their spreadsheet (graph!) of origin.
donderdag 26 januari 12
32. I
Is SKOS sufficient?
skos:broader
skos:broader
skos:broader
D E A
1889 skos:broader
skos:broader skos:broader skos:broader
Fabricage van
Fabricage van steen aardewerk (incl.
Fabricage van Fabricage van dakpannen
(molensteen, steenbakkers, porcelein, terracotta,
kalk (pannenbakkers)
tegelbakkers) kachelbakkers,
pottenbakkers, enz.)
skos:narrowMatch I skos:closeMatch
skos:exactMatch
skos:narrowMatch
skos:broader
skos:broader
skos:broader
D E A
skos:broader skos:broader skos:broader 1899
skos:broader
Fabricage van
Fabricage van steen aardewerk (incl.
Fabricage van Fabricage van dakpannen
(steenbakkers, porcelein,
kalk (pannenbakkers)
tegelbakkers) kachelbakkers,
pottenbakkers, enz.)
NB: These are not strings, but globally unique URIs, scoped within their spreadsheet (graph!) of origin.
donderdag 26 januari 12
33. I
Is SKOS sufficient?
skos:broader
skos:broader
skos:broader
D E A
1889 skos:broader
skos:broader skos:broader skos:broader
Fabricage van
Fabricage van steen aardewerk (incl.
Fabricage van Fabricage van dakpannen
(molensteen, steenbakkers, porcelein, terracotta,
kalk (pannenbakkers)
tegelbakkers) kachelbakkers,
pottenbakkers, enz.)
skos:narrowMatch I skos:closeMatch
skos:exactMatch
skos:narrowMatch
skos:broader
skos:broader
skos:broader
D E A
skos:broader skos:broader skos:broader 1899
skos:broader
Fabricage van
Fabricage van steen aardewerk (incl.
Fabricage van Fabricage van dakpannen
(steenbakkers, porcelein,
kalk (pannenbakkers)
tegelbakkers) kachelbakkers,
pottenbakkers, enz.)
NB: These are not strings, but globally unique URIs, scoped within their spreadsheet (graph!) of origin.
donderdag 26 januari 12
34. Vocabularies, Tools
Vocabularies
Data Cube, SKOS, W3C Time, PROV-O
Excel + TabLinker
Semi-automatic conversion of Excel sheets to RDF
ProvTracer
Create PROV-O provenance trail for shell/python scripts
Visualization Prototype
SGVizler (SPARQL + Google Graph API)
donderdag 26 januari 12
35. Discussion
Advantages of Linked Data approach
Straightforward transformation from spreadsheets
Seamless integration of original, corrected and harmonised data
Ingestion of external (linked) data
Powerful documentation (provenance)
Everything is transparently query-able (SPARQL)
.... on the Web
donderdag 26 januari 12
36. Discussion
Disadvantages of Linked Data approach (subject to research)
Size? (300k * 519 sheets = 156M triples)
Only rudimentary support for arithmetical operations in queries
No dynamic/conditional ‘view’-like graphs
donderdag 26 januari 12
37. SPARQL vs. SQL?
Middle ground?
Expose database through D2RQ
donderdag 26 januari 12