Efficient & effective data management for research projects : ILRI's Data Management Platform
1. Efficient & effective
data management for research projects
ILRI's Data Management
Platform
Carlos Quiros
June, 2015
2. • Back in 2011
• Current status
• How we did it
• Example of a process
• CKAN
• Key decisions made
• Technology and skills required
Contents
3. Back in 2011
Survey design
• Too many
• Not common indicators
• <> Variables
• <> Calculations
Survey implementation
• Too many tools
• No protocols
• Poor field data
cleaning
• No standard process
Storage
• In files
• Too many formats
• Too many versions
• Messy data cleaning
• No accountability
Availability & accessibility
• Nothing
Now
Survey design
• Too many
• Common indicators
• = Variables
• = Calculations
Storage
• Server database
• No formats
• One version
• Central cleaning
• Accountability
Availability & accessibility
• CKAN
• OData
Survey implementation
• 2 tools (ODK, CSPro)
• Protocols
• Field data cleaning
• Standard process
• Standard tools
4. How we went around it
Storage• Server database
• How to integrate ODK and CSPro?
• How to make it easy for scientists?
• How to manage user decentralization?
• Increase accountability?
Availability and accessibility• What to use? CKAN, Dataverse, etc.
CKAN
• How to extend it to serve our purpose?
• How to integrate it with a server database?
• How to manage our metadata and vocabularies?
• How to do this?
• Data interoperability? RDF, OData, Gdata, etc?
OData
• How to do it?
Survey implementation• Support only two tools
• Wrote protocols
• Wrote field data cleaning applications
• Wrote policies and implementation plans
• Wrote standard processes and tools for processing the data
• Worked closely with teams
• Created a central place for all the surveys
• Separated surveys in modules
• Worked on common indicators
• Management supports this process
Survey design (ongoing)
5. Example of a process
Testing &
Review (.xls)
Uploaded to
Formhub to test
account
Testing &
Review
(ODK Collect)
Ok
?
Field
Deployment
Uploaded to
Formhub to
project account
Data
collection
Upload data
to Formhub
End of
Data
Collecti
on
Sharing in
Data Portal
Data Cleaning from
server using MySQL for
Excel
Detailed breakdown of ILRI’s RMD workflow with ODK
Coding
.doc .xls
Start
Draft tool
(.doc) Consultation
Final tool
(.doc)
Who
Code
s
RMG Staff
Project Team Member
Create MySQL
schema with
ODKToMySQL
MySQL
schema in
server
Convert data to
JSON with
FormhubToJSO
N
Data in
JSON
format
Upload JSON into
MySQL Schema
with
JSONToMySQL
Metadata
for portal
Initialize META in
schema
S = Scientist input / usage
S S S
S
S
S
S
6. ILRI’s data portal (CKAN) – http://data.ilri.org/portal/
• CKAN?
• The Open Knowledge Foundation
• Biggest deployed data portal software
• USA data portal
• UK data portal
• EU data portal
• Open Africa
• What do you get out of the box?
• Create datasets with minimum metadata
• Name, Abstract, Author, Date
• Tags into controlled vocabulary
• Powerful search engine
• Public / private access to datasets
• Able to attach resources (files) to a dataset
• Data interoperability through powerful API and RDF
• Arrange datasets into organization and topics
• What can you do by creating extensions
• Add new vocabularies (e.g., Language, Countries, etc.)
• Add new metadata fields
• Visualize different kinds of data (e.g., maps)
• Change theme (colors, logos, fonts, etc.)
• Create data hubs by harvesting other CKANs
• What ever else you want…..
7. Key decisions made
• Use open source for all RDM
Pros:
• Bigger pool of tools
• Flexible
• Innovation
Cons:
• Complex skill set
• Learning curve
• Relational Database Management System (RDMS)
Pros:
• Central place
• Auditing
Cons:
• DB management skill set
• Scientist have no idea on how to work with a RDMS
• CKAN
Pros:
• There is nothing better out there
• Flexible and extendible
Cons:
• Programming in several languages is required
• Learning curve
8. Technology and skills required
• Server
• Linux (Ubuntu server) [Linux administration]
• http://www.ubuntu.com/download/server
• Database server
• MySQL – An open source database system [DB administration, SQL]
• http://www.mysql.com/
• Data processing software [Linux, C++, Python]
• ODK – A toolset for collecting data on mobile devices.
• https://opendatakit.org/
• CSPro – A software for creating data entry applications.
• https://www.census.gov/population/international/software/cspro/
• Formhub – A software tools that collects ODK data.
• https://github.com/SEL-Columbia/formhub
• ODK Tools – A toolbox for processing ODK survey data into MySQL databases.
• https://github.com/ilri/odktools
• META – A toolbox for managing research data in MySQL databases.
• https://github.com/ilri/meta
• CSProTools – A toolbox for processing CSPro survey data into MySQL databases.
• https://github.com/ilri/csprotools
• Data sharing and interoperability
• CKAN – The open source data portal software. [Linux, Python, WebDev]
• http://ckan.org/
• http://docs.ckan.org/en/latest/maintaining/installing/index.html
• http://docs.ckan.org/en/latest/extensions/index.html
• Odata – Allow the creation and consumption of queryable and interoperable data
resources in a simple and standard way. [Linux, Java, WebDev]
• http://www.odata.org/