3. BIOMEDICAL METADATA ON THE WEB — SIGNIFICANCE
3
➤ For (re-)using this data, we need to understand the
structure of datasets and the experimental conditions under
which they were produced
➤ We require accurate, structured and complete description of
the data -- defined as metadata
➤ Good quality metadata is essential in finding, interpreting, and
reusing existing data beyond what the original investigators
envisioned
➤ Facilitates a data-driven approach by combining and analyzing
similar data to uncover novel insights or even more subtle
trends in the data
4. BIOMEDICAL METADATA ON THE WEB - CHALLENGES
4
SIZE complexity QUALITY measures
TIME consuming COSTLY, requires experts
6. CROWDSOURCING - WHAT & WHY?
6
TIME MONEY
➤ Highly parallelizable tasks
➤ Work is broken down into
smaller — ‘micro’ — pieces
that can be solved
independently
➤ Tasks based on human skills
not easily replicable by machines
➤ Non-expert workers can perform
the tasks with a minimal
payment
Consolidated answers solve scientific problems !!
7. RELATED WORK - CROWDSOURCING BIOMEDICAL RESEARCH
➤ Improve automated mining of biomedical text for annotating
diseases [1]
➤ Curation of gene-mutation relations [2]
➤ Identifying relationships between drugs and side-effects [3],
drugs and their indications [4]
➤ Annotation of microRNA functions [5].
7
8. GENE EXPRESSION OMNIBUS
➤ Unstructured
➤ Spreadsheet submission
➤ No controlled vocabulary
➤ Heterogeneity of terms
➤ Size complexity
➤ ~Billion records
8
9. Meta-analysis from GEO
data
A common rejection module (CRM) for acute rejection across multiple
organs identifies novel therapeutics for organ transplantation
Khatri et al. JEM. 210 (11): 2205; DOI: 10.1084/jem.20122709
Metadata issues:
• Missing
• Incomplete
• Inaccurate
11. GEO METADATA - QUALITY PROBLEMS FOR KEYS
➤ Minor spelling discrepancies
➤ genotype/varaiation, genotype/varat,
genotype/varation, genotype/variaion,
genotype/variataion, genotype/variation
➤ Different syntactic representations
➤ age (years), age(yrs) and age_year
➤ Different terms to denote one concept
➤ disease, illness, healthy control
➤ Two different key categories in one key name
➤ disease/cell type, tissue/cell line,
treatment age
11
14. MICRO TASKS — SETTINGS
14
• 3 workers per task
• ‘Dynamic Judgment’ to 7 workers, with 0.8 confidence
• No. of gold standard questions — 60
• Min. accuracy — 80%
• 5 cents per judgment
• 10 tasks per page
15. RESULTS OVERVIEW
15
No. of microtasks (keys) 1643
Total no. of workers 145
Total no. of judgments 7835
Overall accuracy 0.934
No. of gold standard questions 60
Accuracy on gold standard questions 0.930
Total cost $451
Total time 1 hour
17. RESULTS FOR EACH KEY CATEGORY — EXAMPLES (1)
17
Workers classified incorrectly for:
• Cell line
• cell line initiation date, cell line source age
• Disease
• diseasestatus
• Gender
• cell sex
• Strain
• strain ID
• Tissue
• tissue & age, tissue/development stage
18. CONCLUSIONS & LIMITATIONS
18
• Crowdsourcing i.e. non-expert workers can be used to curate
large-scale digital gene expression metadata on the Web.
• Several keys that did not achieve consensus amongst the
workers due to either
• lack of semantically annotated values
• ambiguous nomenclature of keys as well as the values
• values indicating that keys belong to more than one
category
• inconsistent usage of the particular metadata key
19. CROWDSOURCING GEO METADATA QUALITY — FUTURE WORK
19
• Perform crowdsourcing on values and key: value pairs
• Implement a semi-automated approach to identify similar keys
using ontologies
• Design a pipeline to involve semi-automated method+
crowdsourcing + experts
20. REFERENCES
[1] Benjamin, M. G., Max, N., Chunlei, W. U. & Andrew, I. S. in
Biocomputing 2015 282–293World Scientific (2014).
[2]Burger, J. D. et al. Hybrid curation of gene–mutation relations
combining automated extraction and crowdsourcing. Database
2014, bau094 (2014).
[3] Gottlieb, A., Hoehndorf, R., Dumontier, M. & Altman, R. B.
Ranking adverse drug reactions with crowdsourcing. J. Med.
Internet Res. 17, e80 (2015).
[4] Khare, R. et al. Scaling drug indication curation through
crowdsourcing. Database 2015, bav016 (2015).
[5] Vergoulis, T. et al. mirPub: a database for searching microRNA
publications. Bioinformatics 31, 1502–1504 (2015).
20