Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Retooling a Research Data Repository: data.depositar.io
1. Retooling a Research Data Repository:
data.depositar.io
“Technology - Building Useful Tools”
ECAI at 20 Workshop in Conjunction with PNC 2017
November 9, 2017
NCKU, Tainan, Taiwan
莊庭瑞 Tyng-Ruey Chuang
黃韋菁 Andrea Wei-Ching Huang
李承錱 Cheng-Jen Lee
許煌鑫 Huang-Sin Syu
Institute of Information Science
Academia Sinica, Taipei, Taiwan
3. 3
Collaborative Research
● Collaboration is the process of two or more
people or organizations working together to
realize or achieve something successfully. –
Wikipedia
● To do collaborative research, we should make
– the research project and
– the research data
open to project members (or even everyone).
4. 4
Openness
● Libre
– can be used by people
● Digital
– can be used by machines and put online
● Raw
– can be modified and re-purposed
● Common (format & vocabulary)
– can be exchanged and interlinked
● Transparent
– (the process) can be fixed; meta-level
5. 5
Openness Benefits Research
● Help disseminate research findings.
● Help reproduce and re-purpose research
results.
● Help encourage research collaborations.
8. 8
A Web-based Research Data
Repository
● Built with CKAN
– A free and open source data management
system
– For self-hosted publishing, storing, managing,
showing, and using data.
● Manage research datasets
9. 9
Search and Discovery Data
● With free-text
● With filters
● With a given
spatial-temporal
extent
12. 12
Example: Map Comparison
① Showing places extracted from
map of Tainan, Taiwan in 1924 (blue
place marks).
② Overlaying places in 1924 upon
1896 Rapid Survey Map in Tainan,
Taiwan.
③ Learning the fact that the Koxing
a Temple (延平郡王祠) in 1896 had
been changed to Koxinga Ancestral
Shrine (開山神社) in 1924 since Tai
wan was under Japanese rule.
13. 13
Retooling a Research Data
Repository
● From Taijiang Research Data Repository (Since 2014)
– Taijiang.tw/en/
● To a general-purpose research data repository
– Data.depositar.io/en/
● Based on all the aforementioned functions
● With adjustments & enhancements
– Generalized and multilingual metadata
– Wikidata-powered keywords
– More fill-in snippets
– Latest CKAN goodies
14. 14
Generalized and Multilingual
Metadata
● One set of simpler metadata fields for all kinds
of datasets, with three categories:
– Basic Information: title, description, data type...
– Descriptive Information: language, temporal &
spatial information, keywords...
– Management Information: license, author, created
time, organization, maintainer…
● Result: ~35% less metadata fields than
previous version
16. 16
Wikidata-powered Keywords
● Keywords: controlled vocabularies for tagging
datasets
● Adding keywords to a predefined list
– A never-ending process…
● Use Wikidata as data source
– 37M+ entries
– Multilingual
– Semantic relations enable data inference
● Ex. Tainan is part of Taiwan
– Placenames with coordinates and geonames.org
information
17. 17
Wikidata-powered Keywords
1.Search and select keywords when creating a
new dataset
2. Keywords (as Wikidata IDs) are stored. Viewed
in English
3. Viewed in Chinese and other languages too!
18. 18
More Fill-in Snippets
✔ A checkbox to open the dataset to organization
members only (default is to open to all).
✔ Auto-completion of maintainer information (with
name and email from logged-in account).
✔ Generate better dataset URLs from their titles
(e.g. titles in Chinese characters).
19. 19
Latest CKAN Goodies
● Private datasets (which can only be seen by
organization members) are now included in the
dataset search results (for those who have
access).
● Separated site language translations from
CKAN core.
● Speed improvements for displaying a dataset.