Publicité
Publicité

Contenu connexe

Publicité
Publicité

An Ecosystem for Linked Humanities Data

  1. An Ecosystem for Linked Humanities Data Rinke Hoekstra
 Vrije Universiteit Amsterdam/University of Amsterdam
 rinke.hoekstra@vu.nl
 
 Albert Meroño-Peñuela, Kathrin Dentler, Auke Rijpma, Richard Zijdeman and Ivo Zandhuis legenddatalegenddata
  2. The Promise of Digital Humanities
  3. The Promise of Digital Humanities
  4. http://schoolofherring.com
  5. http://science-all.com/fishing.html
  6. The Problem of Digital Humanities Pacific Barreleye, http://imgur.com/gallery/Mzyb5 (can rotate its eyes forwards or upwards to look through the transparent head to prey above)
  7. http://www.asergeev.com/pictures/archives/compress/2012/1034/24.htm
  8. The Cost of Data Preparation Common Motifs in Scientific Workflows: An Empirical Analysis Daniel Garijo⇤, Pinar Alper †, Khalid Belhajjame†, Oscar Corcho⇤, Yolanda Gil‡, Carole Goble† ⇤Ontology Engineering Group, Universidad Polit´ecnica de Madrid. {dgarijo, ocorcho}@fi.upm.es †School of Computer Science, University of Manchester. {alperp, khalidb, carole.goble}@cs.manchester.ac.uk ‡Information Sciences Institute, Department of Computer Science, University of Southern California. gil@isi.edu Abstract—While workflow technology has gained momentum in the last decade as a means for specifying and enacting compu- tational experiments in modern science, reusing and repurposing existing workflows to build new scientific experiments is still a daunting task. This is partly due to the difficulty that scientists experience when attempting to understand existing workflows, which contain several data preparation and adaptation steps in addition to the scientifically significant analysis steps. One way to tackle the understandability problem is through providing abstractions that give a high-level view of activities undertaken within workflows. As a first step towards abstractions, we report in this paper on the results of a manual analysis performed over a set of real-world scientific workflows from Taverna and Wings systems. Our analysis has resulted in a set of scientific workflow motifs that outline i) the kinds of data intensive activities that are observed in workflows (data oriented motifs), and ii) the different manners in which activities are implemented within workflows (workflow oriented motifs). These motifs can be useful to inform workflow designers on the good and bad practices for workflow development, to inform the design of automated tools for the generation of workflow abstractions, etc. I. INTRODUCTION Scientific workflows have been increasingly used in the last decade as an instrument for data intensive scientific analysis. In these settings, workflows serve a dual function: first as detailed documentation of the method (i. e. the input sources and processing steps taken for the derivation of a certain data item) and second as re-usable, executable artifacts for data-intensive analysis. Workflows stitch together a variety of data manipulation activities such as data movement, data transformation or data visualization to serve the goals of the scientific study. The stitching is realized by the constructs made available by the workflow system used and is largely shaped by the environment in which the system operates and the function undertaken by the workflow. A variety of workflow systems are in use [10] [3] [7] [2] serving several scientific disciplines. A workflow is a software [14] and CrowdLabs [8] have made publishing and finding workflows easier, but scientists still face the challenges of re- use, which amounts to fully understanding and exploiting the available workflows/fragments. One difficulty in understanding workflows is their complex nature. A workflow may contain several scientifically-significant analysis steps, combined with various other data preparation activities, and in different implementation styles depending on the environment and context in which the workflow is executed. The difficulty in understanding causes workflow developers to revert to starting from scratch rather than re-using existing fragments. Through an analysis of the current practices in scientific workflow development, we could gain insights on the creation of understandable and more effectively re-usable workflows. Specifically, we propose an analysis with the following objec- tives: 1) To reverse-engineer the set of current practices in work- flow development through an analysis of empirical evi- dence. 2) To identify workflow abstractions that would facilitate understandability and therefore effective re-use. 3) To detect potential information sources and heuristics that can be used to inform the development of tools for creating workflow abstractions. In this paper we present the result of an empirical analysis performed over 177 workflow descriptions from Taverna [10] and Wings [3]. Based on this analysis, we propose a catalogue of scientific workflow motifs. Motifs are provided through i) a characterization of the kinds of data-oriented activities that are carried out within workflows, which we refer to as data- oriented motifs, and ii) a characterization of the different man- ners in which those activity motifs are realized/implemented within workflows, which we refer to as workflow-oriented motifs. It is worth mentioning that, although important, motifs Fig. 3. Distribution of Data-Oriented Motifs per domain Fig. 3. Distribution of Data-Oriented Motifs per domain Fig. 5. Data Preparation Motifs in the Genomics Workflows
  9. We do this repeatedly for the same datasets
  10. Top Down: Big Micro Data(sets) • North Atlantic Population Project (NAPP) • Integrated Public Use Microdata Series (IPUMS) • Mosaic

  11. Top Down: Big Micro Data(sets) • North Atlantic Population Project (NAPP) • Integrated Public Use Microdata Series (IPUMS) • Mosaic
 • Only data slices can be downloaded • Standardisation leads to loss of detail • Results are not mutually compatible • Large scale efforts are very expensive
  12. Top Down: Big Micro Data(sets) • North Atlantic Population Project (NAPP) • Integrated Public Use Microdata Series (IPUMS) • Mosaic
 • Only data slices can be downloaded • Standardisation leads to loss of detail • Results are not mutually compatible • Large scale efforts are very expensive … and they do not solve the problem!
  13. … the current workflow
  14. … the current workflow
  15. … the current workflow Do adverse conditions (Great Depression) around birth or early in life affect socioeconomic and health outcomes?
  16. … the current workflow Do adverse conditions (Great Depression) around birth or early in life affect socioeconomic and health outcomes? Does GDP per capita at birth year negatively affect occupational status in later life?
  17. … the current workflow Do adverse conditions (Great Depression) around birth or early in life affect socioeconomic and health outcomes? Dutch “Hunger-winter” studies (cf Lindeboom) Does GDP per capita at birth year negatively affect occupational status in later life?
  18. … the current workflow Do adverse conditions (Great Depression) around birth or early in life affect socioeconomic and health outcomes? Thomasson and Fishback. 2014. “Hard Times in the Land of Plenty: The Effect on Income and Disability Later in Life for People Born during the Great Depression.” Expl in Eco Hist 54: 64–78. Dutch “Hunger-winter” studies (cf Lindeboom) Does GDP per capita at birth year negatively affect occupational status in later life?
  19. … the current workflow bryr AGE OCCHISCO hiscocode hiscam gdppc 1870 21 98560 9-85.55 48.70 1694.525258 1870 21 99120 9-99.10 47.88 1694.525258 1873 18 53220 5-32.10 51.65 1841.878773 1870 21 13210 1-30.00 77.29 1694.525258 1873 18 54010 5-40.90 53.27 1841.878773 1874 17 61110 6-11.10 52.61 1853.715852
  20. … the current workflow bryr AGE OCCHISCO hiscocode hiscam gdppc 1870 21 98560 9-85.55 48.70 1694.525258 1870 21 99120 9-99.10 47.88 1694.525258 1873 18 53220 5-32.10 51.65 1841.878773 1870 21 13210 1-30.00 77.29 1694.525258 1873 18 54010 5-40.90 53.27 1841.878773 1874 17 61110 6-11.10 52.61 1853.715852 1. Gather and enter own data 2. Find data on multiple repositories 3. Download 4. Clean and reshape 5. Merge 6. Clean and reshape… 7. Analyse
  21. … the current workflow bryr AGE OCCHISCO hiscocode hiscam gdppc 1870 21 98560 9-85.55 48.70 1694.525258 1870 21 99120 9-99.10 47.88 1694.525258 1873 18 53220 5-32.10 51.65 1841.878773 1870 21 13210 1-30.00 77.29 1694.525258 1873 18 54010 5-40.90 53.27 1841.878773 1874 17 61110 6-11.10 52.61 1853.715852 Link occupations in census micro data… … to standardised occupations … … to appropriate occupational status scores … … to country level GDP at birth year 1. Gather and enter own data 2. Find data on multiple repositories 3. Download 4. Clean and reshape 5. Merge 6. Clean and reshape… 7. Analyse
  22. … the current workflow
  23. … the current workflow
  24. … the current workflow
  25. … the current workflow
  26. … the current workflow Not a very complicated research question…
  27. … the current workflow Not a very complicated research question… … only one sample …
  28. … the current workflow Not a very complicated research question… … only one sample … What if we want to answer more involved questions?
  29. "Studies that have plotted data set size against the number of data sources reliably uncover a skewed distribution. Well-organized big science efforts featuring homogenous, well-organized data represent only a small proportion of the total data collected by scientists. A very large proportion of scientific data falls in the long-tail of the distribution, with numerous small independent research efforts yielding a rich variety of specialty research data sets. The extreme right portion of the long tail includes data that are unpublished; such as siloed databases, null findings, laboratory notes, animal care records, etc. These dark data hold a potential wealth of knowledge but are often inaccessible to the outside world."
  30. In the fast moving data analysis industry, real-time traceability could help identify supply chain, brand and repetitional risks
  31. Our Goals • Empower individual researchers to • Code and harmonize individual datasets according to best practices of the community (e.g. HISCO, SDMX, World Bank, etc.) or against their colleagues • Share their own code lists with fellow researchers • Align code lists across datasets • Publish their standards-compliant datasets • Perform analyses across multiple datasets at the same time • While tracking provenance of both data and analyses
  32. A Linked Data Handbook for Historians? Nah…
  33. Exists Frequency Table Variable does not yet existVariables Mappings Publish Augment Includes both external Linked Data and standard vocabularies, e.g. World Bank External (Meta) Data Existing Variables & Codes Provenance tracking of all data External Datasets Structured Data Hub legenddatalegenddata
  34. Exists Frequency Table Variable does not yet existVariables Mappings Publish Augment Includes both external Linked Data and standard vocabularies, e.g. World Bank External (Meta) Data Existing Variables & Codes Provenance tracking of all data External Datasets Structured Data Hub legenddatalegenddata Linked Statistical Dimensions
  35. Dedicated Pipelines NAPP
  36. surname age occupation sex Fumes 20 cigar maker female Bridges 45 civil engineer female Moves 17 dancer male
  37. surname age occupation sex Fumes 20 cigar maker female Bridges 45 civil engineer female Moves 17 dancer male achternaam leeftijd beroep geslacht Fumes 20 sigarenmaker v Bridges 45 ingenieur v Moves 17 danser m
  38. surname age occupation sex Fumes 20 cigar maker female Bridges 45 civil engineer female Moves 17 dancer male achternaam leeftijd beroep geslacht Fumes 20 sigarenmaker v Bridges 45 ingenieur v Moves 17 danser m
  39. surname age occupation sex Fumes 20 cigar maker female Bridges 45 civil engineer female Moves 17 dancer male achternaam leeftijd beroep geslacht Fumes 20 sigarenmaker v Bridges 45 ingenieur v Moves 17 danser m
  40. surname age occupation sex Fumes 20 cigar maker female Bridges 45 civil engineer female Moves 17 dancer male achternaam leeftijd beroep geslacht Fumes 20 sigarenmaker v Bridges 45 ingenieur v Moves 17 danser m achternaam leeftijd beroep sdmx:Sex Fumes 20 sigarenmaker sdmx:F Bridges 45 ingenieur sdmx:F Moves 17 danser sdmx:M surname age occupation sdmx:Sex Fumes 20 cigar maker sdmx:F Bridges 45 civil engineer sdmx:F Moves 17 dancer sdmx:M
  41. surname age occupation sex Fumes 20 cigar maker female Bridges 45 civil engineer female Moves 17 dancer male achternaam leeftijd beroep geslacht Fumes 20 sigarenmaker v Bridges 45 ingenieur v Moves 17 danser m achternaam leeftijd beroep sdmx:Sex Fumes 20 sigarenmaker sdmx:F Bridges 45 ingenieur sdmx:F Moves 17 danser sdmx:M surname age occupation sdmx:Sex Fumes 20 cigar maker sdmx:F Bridges 45 civil engineer sdmx:F Moves 17 dancer sdmx:M
  42. Utrecht 1829 Utrecht 1839
  43. Utrecht 1829 Utrecht 1839
  44. An ecosystem is a community of living organisms in conjunction with the nonliving components of their environment (things like air, water and mineral soil), interacting as a system. - Wikipedia
  45. … the current workflow Does GDP per capita at birth year negatively affect occupational status in later life? ● ● ● ● ● ● ● ● ● ● ● ●● ● 20 30 40 50 60 70 3.984.004.024.04 Canada age log(hiscam) ● ● ● ● ● ● ● ● ● ● ● ● ● ● 6.8 7.0 7.2 7.4 3.984.004.024.04 Canada log(gdppc) log(hiscam) log(hiscam) log(hiscam) (Intercept) 4.420*** 3.616*** (0,039) (0,134) log(gdppc) -0.058*** 0.036** (0,005) (0,018) I(age^2) -0.000*** 0,000 age 0.007*** 0,000 R2 0,003 0,013 Adj. R2 0,003 0,012 Num. obs. 36201 36201 RMSE 0,142 0,142
  46. … the current workflow Does GDP per capita at birth year negatively affect occupational status in later life? ● ● ● ● ● ● ● ● ● ● ● ●● ● 20 30 40 50 60 70 3.984.004.024.04 Canada age log(hiscam) ● ● ● ● ● ● ● ● ● ● ● ● ● ● 6.8 7.0 7.2 7.4 3.984.004.024.04 Canada log(gdppc) log(hiscam) log(hiscam) log(hiscam) (Intercept) 4.420*** 3.616*** (0,039) (0,134) log(gdppc) -0.058*** 0.036** (0,005) (0,018) I(age^2) -0.000*** 0,000 age 0.007*** 0,000 R2 0,003 0,013 Adj. R2 0,003 0,012 Num. obs. 36201 36201 RMSE 0,142 0,142 Identify locally, extrapolate globally?
  47. … the new workflow Does GDP per capita at birth year negatively affect occupational status in later life?
  48. … the new workflow Does GDP per capita at birth year negatively affect occupational status in later life? 1. Discover data on datalegend 2. Explore 3. Build or reuse a query 4. Analyse
  49. … the new workflow Does GDP per capita at birth year negatively affect occupational status in later life? 1. Discover data on datalegend 2. Explore 3. Build or reuse a query 4. Analyse http://data.socialhistory.org/resource/napp/OCCHISCO/54020
  50. … the new workflow Does GDP per capita at birth year negatively affect occupational status in later life? 1. Discover data on datalegend 2. Explore 3. Build or reuse a query 4. Analyse http://data.socialhistory.org/resource/napp/OCCHISCO/54020 http://yasgui.org
  51. … the new workflow Does GDP per capita at birth year negatively affect occupational status in later life? 1. Discover data on datalegend 2. Explore 3. Build or reuse a query 4. Analyse http://data.socialhistory.org/resource/napp/OCCHISCO/54020 http://yasgui.org http://grlc.clariah-sdh.eculture.labs.vu.nl/clariah/wp4-queries/api-docs
  52. … the new workflow Does GDP per capita at birth year negatively affect occupational status in later life?
  53. … the new workflow Does GDP per capita at birth year negatively affect occupational status in later life?
  54. … the new workflow canada sweden (Intercept) 3.616*** 4.430*** (0,134) (0,033) log(gdppc) 0.036** -0.070*** (0,018) (0,004) I(age^2) -0.000*** -0.000*** 0,000 0,000 age 0.007*** 0.001*** 0,000 0,000 R2 0,013 0,021 Adj. R2 0,012 0,021 Num. obs. 36201 275127 RMSE 0,142 0,102 ● ● ● ● ● ● ● ● ● ● ● ●● ● 20 30 40 50 60 70 3.984.004.024.04 Canada age log(hiscam) ● ● ● ● ● ● ● ● ● ● ● ● ● ● 6.8 6.9 7.0 7.1 7.2 7.3 7.4 7.5 3.984.004.024.04 Canada log(gdppc) log(hiscam) ● ● ● ●● ● ● ● ●●●● ●● ● ● ● ● ● ● ● ● ●● ● ●● ●●●● ● ● ● ● ●●●● ● ● ● ●●● ● ●●● ● ● ● ● ● ●● ● ● 20 30 40 50 60 70 3.903.943.984.02 Sweden age log(hiscam) ● ●●● ● ●●● ●● ● ●●● ● ● ● ●●●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 6.8 6.9 7.0 7.1 7.2 7.3 3.903.943.984.02 Sweden log(gdppc) log(hiscam) Does GDP per capita at birth year negatively affect occupational status in later life?
  55. … the new workflow Does GDP per capita at birth year negatively affect occupational status in later life?
  56. … the new workflow Does GDP per capita at birth year negatively affect occupational status in later life?
  57. … the new workflow Does GDP per capita at birth year negatively affect occupational status in later life?
  58. Discussion • Data-driven research in the humanities is too expensive and confined to single datasets. • Linked Data can be a solution, but historians cannot be expected to change their current workflow, or craft RDF by hand. • QBer allows historians to upload their data, connect it to earlier work by peers, while preserving provenance of their steps. • The inspector view gives instant feedback of the impact on the network • Standard SPARQL queries are converted to APIs through grlc. • Research questions can thus be shared, replicated and applied to new data. • This gives rise to different roles of researchers in our ecosystem legenddatalegenddata
  59. Discussion • Data-driven research in the humanities is too expensive and confined to single datasets. • Linked Data can be a solution, but historians cannot be expected to change their current workflow, or craft RDF by hand. • QBer allows historians to upload their data, connect it to earlier work by peers, while preserving provenance of their steps. • The inspector view gives instant feedback of the impact on the network • Standard SPARQL queries are converted to APIs through grlc. • Research questions can thus be shared, replicated and applied to new data. • This gives rise to different roles of researchers in our ecosystem legenddatalegenddata
Publicité