Overview of the Research on Open Educational Resources for Development (ROER4D) Open Data initiative, highlighting data management principles, the five pillars of the ROER4D data publication approach and the project de-identification approach.
1. The ROER4D Open Data initiative
Michelle Willmers and Thomas King
January 2018
CC BY
2. Introduction to ROER4D
• Research on Open Educational Resources for Development project
– 18 sub-projects, across 26 countries in the Global South from Chile to
Mongolia, with 100 researchers, supported by a Network Hub team based in
the University of Cape Town and Wawasan Open University.
– Datasets in multiple languages (English, Spanish, Mongolian)
– Mostly mixed-methods data (mix of quantitative and qualitative)
• ROER4D Open Data initiative: supporting interested sub-projects in
sharing their data openly
3.
4. Research
On Open Educational Resources (OER)
for Development
• Imperative to establish empirical baseline research on OER in Global South
• 86 researchers in 26 countries across 3 continents
• Project ‘Open’ ethos manifests in Open Research strategy, bridging ‘Open’
silos
• Open content (typically used in a teaching and learning
content) that can be reused, revised, remixed,
redistributed and retained
• Made possible by open licensing, although increasing
focus on differentiating implicit vs. explicit open
content
• Focus on role OER can play in improving access to quality education
• Focus on role project can play in building Global South Open Education
research capacity
• Strong advocacy and activism component (NGO, CBO sectors – not only
career researchers)
Focus on empirical baseline manifests in focus on curatorial and publishing capacity within the
research project. The project acts as publisher, providing greater agency and control (but
presenting some challenges in terms of accreditation/reward).
Unpacking the “ROER4D” project title…
5. Curation & Dissemination strategy
• Provide a content management and publishing service to SP researchers and the
Network Hub team in order to advance research capacity development efforts and
increase visibility of outputs.
• Support Principal Investigators and SP researchers in editorial development of
ROER4D outputs.
• Address infrastructure deficits and provide content management solutions
(including content hosting) in a research community with uneven institutional
support and capacity challenges.
• Ensure that the ROER4D legacy is freely accessible for reuse in line with international
curatorial and publishing standards.
• Complement Network Hub Communications efforts in an integrated
communications/dissemination approach.
6. • Data sharing as component of generalised open content focus.
• Organising and profiling open content increases the potential for reuse and citation
(impact).
• Well-organised, strategic research management and content organisation promotes
rigour in the research process.
• Copyright vests with the author > data-sharing activity determined by their willingness
and capacity to engage.
• Format and platform/tool agnostic.
• Share openly by default on condition that it is valuable, legal and ethical
Data management principles
7. Research Data
Management
Collect data
Organise data
Refine data
Share data
Document
data
Store data
Backup, archive, on-
site storage, cloud
storage
Metadata, dataset
description
De-identification,
publishing, open
data
Ethics clearance,
methodology,
instruments Formats, naming
conventions
Verification,
validation
8. The two pillars of Open Data sharing
Consensual
ethical
legal
Comprehensible
coherent
valuable
Research Data Management &
Open Data sharing
9. Project archive
(external)
Zenodo
Researcher
ROER4D archive (internal)
Google, Vula, UCT eResearch
Centre
Publisher
DataFirst
Network Hub
(Google, Vula)
ROER4D project data flow
Internal
sharing and
collaboration
External
sharing and
collaboration
10. Open Data terminology
• Open Data = Microdata
– Unit record data (survey data, census data)
– Interview and Focus Group transcripts
– i.e. the ‘raw material’ from which outputs, reports, publications etc. are
produced.
• Supportive documentation = Metadata
– Dataset descriptions
– Study descriptions (methods/methodology, data collection schedules
– Data processing information (e.g. de-identification schema)
11. Terms and definitions
TERM DEFINITION
Microdata (aka Unit
Record Data)
The information that underlies a research project’s analysis (i.e. the
‘thing’)
Metadata Data that describes a file or record on a database (for example,
keywords, author fields, ISBNs, DOIs)
Research Data
Management (RDM)
Overall term for how individuals/projects/institutions manage their
data
Data Management Plan
(DMP)
Outlines an individual or project’s strategy around all aspects of data
management
Curation Organising, storing/archiving and describing data to ensure & control
its long-term accessibility and usability. May include
collating/concatenating from other sources
De-identification Removing, eliding or replacing pieces of information that reveal
research participants’ (possibly also referents’) identity
Anonymity Personal details (identifiers) are not gathered
Confidentiality Personal details (identifiers) are not shared
Curation platform An on-premises or cloud-based storage space that contains metadata
capabilities, Search Engine Optimisation, and backup capabilities
12. Why should researchers share data?
• ROER4D motivations:
– Build the empirical base for future research
– Coherent with our generally ‘open’ approach – publishing open
access outputs, actively communicating with audiences and
stakeholders, etc.
• Good practice – many research funders now require some sort of data-
sharing activity or plan
• Improve rigour
– Sharing data openly demands that the dataset is well described
and organised
– Increased scrutiny of the dataset often leads to more refined
analysis
16. Recruiting participants
• Emphasising social justice through sharing
– Sharing open data allows for latitudinal studies using data from multiple sites
• Emphasising personal reputation
– Sharing open data as a means of building one’s personal profile as a
researcher
• Emphasising rigour
– Sharing data openly enhances the quality of the research
17. • Check ethics approval and consent
• Ensure first-tier de-identification takes place prior to Network Hub transfer in
order to ensure research subject confidentiality
• ROER4D agnostic in its approach (in terms of scale, format and technical
sophistication)
• Challenges of varying researcher sophistication in terms of data collection and
presentation
• Challenges of varying researcher sophistication in terms of technology employed
to capture, present, and analyse data
Step 3: Source sub-project micro-data
18. • Archive in LMS and secure institutional archive
• Network Hub C&D team audits researchers’ submitted dataset
> What is the dataset comprised of?
> Are all the pieces there?
> What were the data collection processes, and do we have all the instruments to
share?
> What languages are represented?
> Does something else like it exist?
> Who might it be of use to?
• Address file naming and format issues
• Articulate sub-project-specific data management plan
Step 4: Network Hub curation and quality
assurance
19. • Scope and conceptualise the dataset
> Which components of the project-generated micro-data are you ethically and
legally allowed to share?
> Which components of the project-generated micro-data will you invest
resources in curating and sharing?
> Which instruments will you include?
• Identify focus of data and points of sensitivity
• Define appropriate second-tier de-identification approach
Step 5: Preparing data for publication
20. READ
DATA
Coherence
Format &
layout Editing
Fix typos &
identify
anomalous data
1.
2.
3.
4.
5.
De-identifying
Remove
identifiers
Validation
Identify and
account for missing
data
ROER4D data
interrogation
process
21. The de-identification balancing act
First, do no harm
Remove as much as needed to ensure the
confidentiality or anonymity of the
research participants.
Ensure that all ethical and consent
processes have been adhered to.
Don’t go overboard
Remove as little as is ethical to ensure the
richness of the data.
Take the unit of analysis as the guide – de-
identify up to the Unit of Analysis.
E.g: If Study X compares two universities,
you can safely remove all identifiers lower
than the university affiliation.
HOWEVER
Your data may be useful to others. The
purpose of de-identification is to preserve
confidentiality – don’t de-identify for the
sake of it
22. ROER4D de-identification process
1. First-level de-identification by researcher
– Removal of direct identifiers (names of people/institutions/companies, ID
numbers, etc.)
– Important to ensure that raw data is not shared
2. Second-level de-identification by C&D team to catch remaining direct
identifiers
3. In-depth sweep of the text to identify indirect identifiers
– Meticulous, thorough, repeated reading of the text (which ties back to
general data enhancement)
23. Qualitative de-identification
• De-identification located in the same ecosystem as data cleaning and data
validation – no clear line between data improvement and de-identification
– Cleaning up typos
– Standardising presentation and layout
– Identifying unanswered questions (or additional questions), mislabelled
responses, etc.
• Much of these also apply to quantitative data
• Articulation of principles in RDM and description of these processes included in
metadata
24. Qualitative de-identification example
• Raw data
– Well my name is Susan Tsvangirai, and I’m the Head of the
Anthropology department at the University of Zimbabwe. I first
started getting involved in publishing my data – see I’m the only
person in the country who works on human ecologies, well it’s me
and Ishaan at Wits, but I’m the only one locally, and I started out
using the institutional repository but it didn’t really work. It kept
timing out when I tried to upload resources. So I switched the Zenodo
which was fine but it felt a little bit sterile…
• Cleaned/processed data
– Well my name is [redacted], and I’m the Head of [my] department at
the University of Zimbabwe. I first started getting involved in
publishing my data – see I’m the only person in the country who
works [in my area], well it’s me and [a colleague] at Wits, but I’m the
only one locally, and I started out using the institutional repository
but it didn’t really work. It kept timing out when I tried to upload
resources. So I switched the Zenodo which was fine but it felt a little
bit sterile…
25. • Generate metadata and dataset description (accompanying narrative)
• Submit content to publisher (in ROER4D instance, DataFirst)
• Link to published outputs
• Include description of process in research Methodology statements
• Profile in project communications activity
Step 6: Publish
26. Challenges
• Data collected in multiple languages
– De-identification (particularly in qualitative data) far more difficult –
greater reliance on the researcher to identify disclosive information
• Post-hoc consent process
– Departments merge or close, participants retire or disappear
• Data collected by multiple researchers
– Different collection strategies, adherence to interview schedules, use/non-
use of clarifying questions, etc.
27. Ways forward: ‘Open by design’
• Help researchers write consent forms to facilitate ethical open
data sharing.
• ‘Red flag’ clauses abound in template consent forms,
including:
– “will be used for research purposes only”
– “data will be destroyed after use”
– “only researchers will have access to the data”
• More open consent forms allow for data sharing but do not
mandate it.
28. Lessons learned
1. Openness increases rigour. Preparing data for publication promotes professional approach to
research process.
2. Preparing data for publication exposes weaknesses in instrument design and research
process.
3. Introducing C&D and data-sharing focus midway through a project poses many challenges,
particularly in terms of ethical and consent components.
4. Data sharing drives focus on reproducibility, transforming traditional approach to crafting
methodology statements.
5. The data preparation process takes time (approx. one week of researchers’ time in ROER4D
context).
6. Obtaining balance between utility and adequate protection in de-identification of qualitative
data is a challenge.
7. Openness is threatening to researchers in terms of exposing weakness in processes and
perceived threat of losing publication advantage.
8. C&D and data sharing activity require support, capacity development and resourcing.
Notes de l'éditeur
The ROER4D project, conceived in 2012 and running from 2013 to the end of 2017, was explicitly scoped with an ambition to conduct Open Research inasmuch as that proved viable and valuable. An early ambition mentioned in the scoping document was the desire to share data openly, but this process was not begun until 2015 with the elevation of Curation and Dissemination as a core project objective and the subsequent launch of the Open Data Initiative.
The graphic above shows where the ROER4D sub-projects were located and where they conducted their research activities. The research participants included high-school (secondary) and university (tertiary) students, teachers in secondary and tertiary education, government officials, and members of NGOs,.
In the networked model of the ROER4D project, the Curation and Dissemination team were not involved in the gathering and validation of data, but supported the sub-projects in processing and organising their data for long-term curation and storage, and in some cases for publication and sharing as Open Data. Due to contractual requirements, the Network Hub
There are two competing influences on Open Data sharing, namely the ethical imperative – the requirement to actively inform research participants of the Open Data process and protect them from potential negative consequences – and ensuring the integrity and value of the shared dataset by not removing so much content that the final product is incomprehensible or so sparse as to lack value.
The ROER4D Open Data initiative was scoped to serve internal curation purposes (professionalising data stewardship) as well as external, public sharing of micro-data (where ethically and legally possible). The internal curation component was crucial in terms of keeping track of and curating the large amount of data produced by the 17 sub-projects, particularly as relates to the project’s meta-synthesis activities.
Microdata is the raw material that underpins the analysis of a research project. It can consist of quantitative (large-scale datasets, often represented in tabular form) or qualitative data (personal observations, field notes, interview and focus group transcripts). Metadata is the data that supports and describes microdata, and can consist of some or all of the following: dataset descriptions, study methods or methodologies, production dates, data collection and processing schema, etc.
Difference between anonymity and confidentiality: an anonymous survey contains no questions about personal identifiers; a confidential survey does contain these questions, but will not share/publish them.
While there are potential and real benefits to civil society and government from sharing Open Data, there is also a case to be made for the individual benefits accruing to researchers from sharing their data. Open data sharing is increasingly being mandated by funder institutions, particularly large national and regional funders in the Global North, and so familiarising oneself with open data principles and sharing data openly is good practice for those interested in applying to these funders. Finally, and significantly, the process of preparing one’s data for open sharing necessitates deep and thorough data sophistication, through improvement of the microdata and/or metadata.
As the Open Data Initiative was a voluntary activity (not mandated in the original project scoping), participants had to be persuaded to participate. The three primary methods used to encourage participation were through:
1) An appeal to the project’s overall Open Research agenda, by emphasising the value of Open Data for future studies and potentially latitudinal research2) An appeal to the benefits accruing to contributors’ personal reputation, through the production of a citable research object (an open dataset)3) An emphasis on the rigour-enhancement inherent in preparing a dataset for open sharing.
As the project was conceived with an explicit open agenda, much of the first strategy was implicit in the project’s general Open Research orientation. The second emphasis (personal reputation) relied on the standard practice of measuring citations as a means of measuring an academic’s public profile. Finally, the third strategy highlighted the Open Data sharing process as serving the core academic principle of ethical, rigorous research practice.
The ROER4D Network Hub conceives of data publishing as a ‘data interrogation’ process that may result in published Open Data, but still provides value even if the decision is made not to publish. The data interrogation process relies on frequently returning to read the original data in between coherence checking, editorial work, validation and verification activity, and finally de-identification. This process helps surface issues, particularly indirect identifiers, that are particularly relevant and prevalent in qualitative data.
While ethical considerations and the protection of research participants must come first, part of the value of Open Data lies in part in the ability of other researchers to mine datasets according to different conceptual and analytical frameworks. In such instances, a de-identification approach that only retains such content as supports the original study’s analysis limits the reusability, and thus the value of the dataset.
In quantitative data, disclosive information is typically isolated to specific variables or data values that can be identified and removed automatically. In qualitative data however the interplay between otherwise nondisclosive information or insights may potentially be disclosive. Therefore, more attention must be paid to identifying and removing, eliding or obfuscating these indirect identifiers, which may only be recognisable after repeated passes of the data.
The above is an excerpt from a fictional qualitative dataset with the disclosive information indicated in red. The bold text indicates an indirect identifier that, in combination with the directly disclosive information, becomes disclosive itself. The second paragraph serves as an example of one way of de-identifying this excerpt.
As a networked project ROER4D covered a vast area with different linguistic and cultural norms, and contained sub-projects with different research methodologies. This introduced complexity into the data cleaning process, made even more complicated by the fact that the Open Data Initiative had not formed part of the original scoping and therefore in some cases research participants had to be recontacted in order to gain consent for their data to be shared.