This presenations provides an outlook of what we anticipate with the structured data hub: to create linkable datasets, enhance the use of provenance, add quality flags to data, answer new questions and finally, borrow from and provide to public sources such as dbpedia
2. Status quo
Many datasets currently live in isolation. They are stored on people’s computers and are not findable. Moreover little effort is given to link such datasets. When data is
being linked, it requires cleaning and harmonising the datasets, which is very time intensive. More importantly, such linkage efforts are seldom shared, literally providing
‘disposable research’
3. What we envisage
Is to select core micro, meso and macro datasets from the field of economic and social history and create a structured data hub from those.
4. What we envisage
Structured Data
Hub
Your data
Tooling
WWW
Next to allow you to connect your data and allow you to build such connections yourself, while we will ensure your data is findable and linkable to other datasets on the
(semantic) world wide web.
5. The Structured Data Hub
A place to
store data
augment data
link data
find data
ask questions! (for data analysis and visualization)
So, the structure data hub is a place to …. Now let’s go into more detail for some of these aspects.
6. Data augmentation
A first feature of the Structured Data Hub, is augmentation. With augmentation we refer to the process of enhancing your data with core variables from social,
demographic and economic sciences.
7. For example, think of this datasets containing individual characteristics, including occupation and HISCO code. If we wanted to know whether these person were
incumbents of high or low occupations we would needed to add a stratification measure.
8. Here, we add the universal HISCAM scale, but any other HISCO based stratification scale or class measure can be added.
9. We might also be interested in the area where people are working, here indicated by the place variable. If we wanted to map such values, or calculate distances between
these places, we would need information on the latitude and longitude.
10. Another type of data augmentation concerns the application of basic calculus to derive new variables. Income for example, is seldom analysed in its raw form, and is
often rescaled using a log transformation.
11. The Structured Data Hub facilitates in the creation and documentation of such newly derived variables.
12. Provenance tracking
A second feature of the Data Hub is traceable provenance. Currently bigger datasets such as Clio-Infra consists of a core part derived from a bigger statistical agency,
combined with many smaller datasets as well as ‘corrections’ of the data by the researcher. After an iteration it is hard to track who contributed what, or which number
was changed by whom for what reason. We therefore present provenance tracking.
13. version 2version 1
activity =+
The basic formula for provenance we use is that one version leads to the next as the result of an activity.
14. activity
who
when
what
how
For proper provenance it is crucial to describe this activity, at least in the terms of what the activity entailed, how the activity was performed, by whom and in which time
period.
15. surname occupa+on
Fumes cigar maker
Bridges civil engineer
Moves dancer
Bones undertaker
New PID!PID: ab.123 PID: bc.789
- added occupation Bones
- from Gravediggers Vol II
- 2015-12-09A09:30:17
- dai:richard.zijdeman
surname occupa+on
Fumes cigar maker
Bridges civil engineer
Moves dancer
Bones
In this example, the occupation for ‘Bones’ is added, which leads to a new version of the data, hence a new PID. Moreover, the action of adding the value for occupation,
is provided with as provenance.
16. Quality flags
An important aspect to consider when combining data is that datasets will come in various forms of quality.
17. Quality flags
Allow for quality flags of content
e.g. created by scientists
e.g. peer reviewed (by scientist)
created by public and peer reviewed
We will design a system in which datasets will be accompanied by a ‘quality flag’, an indicator of the trustworthiness of the dataset. This might involve simple reputation
effects, but could also provide more enhanced features, like whether other data confirms the values in this datasets. Work together with sestet on this
18. Basic visualisation
Focus on visual exploration of data and results
‘Ask’ question and get visual output:
e.g. bar, line graph etc.
get output on map or even as ‘movie’
A final feature that we want to highlight here is to ask questions and receive a ‘visual’ answer. Data visualisations are increasingly present in all sorts of media and our hub
will allow for such visualisations to answer basic questions on historical patterns.
20. From Science to Society
and back
Provide data to public: ‘enthusiasts’, journalists
Have enthusiasts add data to the hub (creating linked
data): e.g. stucadoors dataset, harbour datasets,
railway datasets, etc.
And back: link scientific data to crowd-projects like
dpbedia: enhance occupations with descriptions
The last point we want to make about the structured data hub, is that it is not just for academics, but we provide our tools for a broader audience too. This means that we
assume a lowish level of knowledge of history and technical skills. However, we also believe, that ‘the public’ is making quite interesting datasets from which we may
borrow, as well as may give back to, by enriching those with scientific knowledge.