SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
MetaCrowd: Crowdsourcing
Gene Expression Metadata
Quality Assessment
Amrapali Zaveri and Michel Dumontier
@AmrapaliZamrapali.zaveri@maastrichtuniversity.nl
Bio-ontologies 2017 July 24-25th, 2017
BIOMEDICAL DATA ON THE WEB
2
BIOMEDICAL METADATA ON THE WEB — SIGNIFICANCE
3
➤ For (re-)using this data, we need to understand the
structure of datasets and the experimental conditions under
which they were produced
➤ We require accurate, structured and complete description of
the data -- defined as metadata
➤ Good quality metadata is essential in finding, interpreting, and
reusing existing data beyond what the original investigators
envisioned
➤ Facilitates a data-driven approach by combining and analyzing
similar data to uncover novel insights or even more subtle
trends in the data
BIOMEDICAL METADATA ON THE WEB - CHALLENGES
4
SIZE complexity QUALITY measures
TIME consuming COSTLY, requires experts
HYPOTHESIS
Crowdsourcing i.e. non-expert workers can
be used to curate large-scale digital
biomedical metadata on the Web.
5
CROWDSOURCING - WHAT & WHY?
6
TIME MONEY
➤ Highly parallelizable tasks
➤ Work is broken down into
smaller — ‘micro’ — pieces
that can be solved
independently
➤ Tasks based on human skills
not easily replicable by machines
➤ Non-expert workers can perform
the tasks with a minimal
payment
Consolidated answers solve scientific problems !!
RELATED WORK - CROWDSOURCING BIOMEDICAL RESEARCH
➤ Improve automated mining of biomedical text for annotating
diseases [1]
➤ Curation of gene-mutation relations [2]
➤ Identifying relationships between drugs and side-effects [3],
drugs and their indications [4]
➤ Annotation of microRNA functions [5].
7
GENE EXPRESSION OMNIBUS
➤ Unstructured
➤ Spreadsheet submission
➤ No controlled vocabulary
➤ Heterogeneity of terms
➤ Size complexity
➤ ~Billion records
8
Meta-analysis from GEO
data
A common rejection module (CRM) for acute rejection across multiple
organs identifies novel therapeutics for organ transplantation
Khatri et al. JEM. 210 (11): 2205; DOI: 10.1084/jem.20122709
Metadata issues:
• Missing
• Incomplete
• Inaccurate
GEO METADATA - EXAMPLE
10
44,000,000
Key: value pairs
GEO METADATA - QUALITY PROBLEMS FOR KEYS
➤ Minor spelling discrepancies
➤ genotype/varaiation, genotype/varat,
genotype/varation, genotype/variaion,
genotype/variataion, genotype/variation
➤ Different syntactic representations
➤ age (years), age(yrs) and age_year
➤ Different terms to denote one concept
➤ disease, illness, healthy control
➤ Two different key categories in one key name
➤ disease/cell type, tissue/cell line,
treatment age
11
METACROWD METHODOLOGY
12
GEO
Metadata
8 GEO Keys
5 Values (each)
• cell line
• disease
• gender/sex
• genotype
• strain
• time
• tissue
• treatment
Key Definitions
SemanticScience
Integration
Ontology
MICRO TASKS — CROWDFLOWER
13
MICRO TASKS — SETTINGS
14
• 3 workers per task
• ‘Dynamic Judgment’ to 7 workers, with 0.8 confidence
• No. of gold standard questions — 60
• Min. accuracy — 80%
• 5 cents per judgment
• 10 tasks per page
RESULTS OVERVIEW
15
No. of microtasks (keys) 1643
Total no. of workers 145
Total no. of judgments 7835
Overall accuracy 0.934
No. of gold standard questions 60
Accuracy on gold standard questions 0.930
Total cost $451
Total time 1 hour
RESULTS FOR EACH KEY CATEGORY
16
Key Category No. of Keys
True Positive,
False Positive
Accuracy
Cell line 109 711, 21 0.955
Disease 85 412, 10 0.937
Gender 72 645, 23 0.902
Genotype 112 566, 10 0.984
Strain 181 788, 4 0.966
Time 698 2489, 120 0.908
Tissue 145 567, 6 0.947
Treatment 242 846, 49 0.944
RESULTS FOR EACH KEY CATEGORY — EXAMPLES (1)
17
Workers classified incorrectly for:
• Cell line
• cell line initiation date, cell line source age
• Disease
• diseasestatus
• Gender
• cell sex
• Strain
• strain ID
• Tissue
• tissue & age, tissue/development stage
CONCLUSIONS & LIMITATIONS
18
• Crowdsourcing i.e. non-expert workers can be used to curate
large-scale digital gene expression metadata on the Web.
• Several keys that did not achieve consensus amongst the
workers due to either
• lack of semantically annotated values
• ambiguous nomenclature of keys as well as the values
• values indicating that keys belong to more than one
category
• inconsistent usage of the particular metadata key
CROWDSOURCING GEO METADATA QUALITY — FUTURE WORK
19
• Perform crowdsourcing on values and key: value pairs
• Implement a semi-automated approach to identify similar keys
using ontologies
• Design a pipeline to involve semi-automated method+
crowdsourcing + experts
REFERENCES
[1] Benjamin, M. G., Max, N., Chunlei, W. U. & Andrew, I. S. in
Biocomputing 2015 282–293World Scientific (2014).
[2]Burger, J. D. et al. Hybrid curation of gene–mutation relations
combining automated extraction and crowdsourcing. Database
2014, bau094 (2014).
[3] Gottlieb, A., Hoehndorf, R., Dumontier, M. & Altman, R. B.
Ranking adverse drug reactions with crowdsourcing. J. Med.
Internet Res. 17, e80 (2015).
[4] Khare, R. et al. Scaling drug indication curation through
crowdsourcing. Database 2015, bav016 (2015).
[5] Vergoulis, T. et al. mirPub: a database for searching microRNA
publications. Bioinformatics 31, 1502–1504 (2015).
20
THANK YOU!
QUESTIONS?
21
@AmrapaliZamrapali.zaveri@maastrichtuniversity.nl

Contenu connexe

Tendances

Career oppurtunities in the field of Bioinformatics
Career oppurtunities in the field of BioinformaticsCareer oppurtunities in the field of Bioinformatics
Career oppurtunities in the field of BioinformaticsShikha Thakur
 
2014 agbt giab data integration poster 140206
2014 agbt giab data integration poster 1402062014 agbt giab data integration poster 140206
2014 agbt giab data integration poster 140206GenomeInABottle
 
JPROT-TargetedProteomics-CallforPapers
JPROT-TargetedProteomics-CallforPapersJPROT-TargetedProteomics-CallforPapers
JPROT-TargetedProteomics-CallforPapersmanrai1953
 
Cracking the (bio)code -- Professional Development Session at SACNAS 2014
Cracking the (bio)code -- Professional Development Session at SACNAS 2014Cracking the (bio)code -- Professional Development Session at SACNAS 2014
Cracking the (bio)code -- Professional Development Session at SACNAS 2014Tracy Heath
 
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...Benjamin Good
 
Experimental Designs in Next Generation Sequencing
Experimental Designs in Next Generation Sequencing Experimental Designs in Next Generation Sequencing
Experimental Designs in Next Generation Sequencing GuttiPavan
 
Enhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort DataEnhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort DataBarry Smith
 
Postdoctoral Position in the Translational Glycomaterials Laboratory
Postdoctoral Position in the Translational Glycomaterials LaboratoryPostdoctoral Position in the Translational Glycomaterials Laboratory
Postdoctoral Position in the Translational Glycomaterials LaboratoryLohitash Karumbaiah
 
Bioinformatics tools for development, analysis, and preclinical testing of in...
Bioinformatics tools for development, analysis, and preclinical testing of in...Bioinformatics tools for development, analysis, and preclinical testing of in...
Bioinformatics tools for development, analysis, and preclinical testing of in...Malachi Griffith
 
Master's Thesis - deep genomics: harnessing the power of deep neural networks...
Master's Thesis - deep genomics: harnessing the power of deep neural networks...Master's Thesis - deep genomics: harnessing the power of deep neural networks...
Master's Thesis - deep genomics: harnessing the power of deep neural networks...Enrico Busto
 
Model Organism Linked Data
Model Organism Linked DataModel Organism Linked Data
Model Organism Linked DataMichel Dumontier
 
Oskar Laur-resume
Oskar Laur-resumeOskar Laur-resume
Oskar Laur-resumeOskar Laur
 
Gcc talk baltimore july 2014
Gcc talk baltimore july 2014Gcc talk baltimore july 2014
Gcc talk baltimore july 2014pratikomics
 
Using ADAGE for pathway-style analyses
Using ADAGE for pathway-style analysesUsing ADAGE for pathway-style analyses
Using ADAGE for pathway-style analysesCasey Greene
 
No Boundary Thinking in Bioinformatics Workshop Keynote
No Boundary Thinking in Bioinformatics Workshop KeynoteNo Boundary Thinking in Bioinformatics Workshop Keynote
No Boundary Thinking in Bioinformatics Workshop KeynoteCasey Greene
 
140127 Performance Metrics WG
140127 Performance Metrics WG140127 Performance Metrics WG
140127 Performance Metrics WGGenomeInABottle
 
Tools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databasesTools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databasesValery Tkachenko
 

Tendances (20)

Career oppurtunities in the field of Bioinformatics
Career oppurtunities in the field of BioinformaticsCareer oppurtunities in the field of Bioinformatics
Career oppurtunities in the field of Bioinformatics
 
2014 agbt giab data integration poster 140206
2014 agbt giab data integration poster 1402062014 agbt giab data integration poster 140206
2014 agbt giab data integration poster 140206
 
JPROT-TargetedProteomics-CallforPapers
JPROT-TargetedProteomics-CallforPapersJPROT-TargetedProteomics-CallforPapers
JPROT-TargetedProteomics-CallforPapers
 
Cracking the (bio)code -- Professional Development Session at SACNAS 2014
Cracking the (bio)code -- Professional Development Session at SACNAS 2014Cracking the (bio)code -- Professional Development Session at SACNAS 2014
Cracking the (bio)code -- Professional Development Session at SACNAS 2014
 
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
 
Experimental Designs in Next Generation Sequencing
Experimental Designs in Next Generation Sequencing Experimental Designs in Next Generation Sequencing
Experimental Designs in Next Generation Sequencing
 
Enhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort DataEnhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort Data
 
Postdoctoral Position in the Translational Glycomaterials Laboratory
Postdoctoral Position in the Translational Glycomaterials LaboratoryPostdoctoral Position in the Translational Glycomaterials Laboratory
Postdoctoral Position in the Translational Glycomaterials Laboratory
 
Bioinformatics tools for development, analysis, and preclinical testing of in...
Bioinformatics tools for development, analysis, and preclinical testing of in...Bioinformatics tools for development, analysis, and preclinical testing of in...
Bioinformatics tools for development, analysis, and preclinical testing of in...
 
03 Guerra, Rudy
03 Guerra, Rudy03 Guerra, Rudy
03 Guerra, Rudy
 
Master's Thesis - deep genomics: harnessing the power of deep neural networks...
Master's Thesis - deep genomics: harnessing the power of deep neural networks...Master's Thesis - deep genomics: harnessing the power of deep neural networks...
Master's Thesis - deep genomics: harnessing the power of deep neural networks...
 
DanaM 0116 plus R6
DanaM 0116 plus R6DanaM 0116 plus R6
DanaM 0116 plus R6
 
V.A. Westbrook Resume
V.A. Westbrook ResumeV.A. Westbrook Resume
V.A. Westbrook Resume
 
Model Organism Linked Data
Model Organism Linked DataModel Organism Linked Data
Model Organism Linked Data
 
Oskar Laur-resume
Oskar Laur-resumeOskar Laur-resume
Oskar Laur-resume
 
Gcc talk baltimore july 2014
Gcc talk baltimore july 2014Gcc talk baltimore july 2014
Gcc talk baltimore july 2014
 
Using ADAGE for pathway-style analyses
Using ADAGE for pathway-style analysesUsing ADAGE for pathway-style analyses
Using ADAGE for pathway-style analyses
 
No Boundary Thinking in Bioinformatics Workshop Keynote
No Boundary Thinking in Bioinformatics Workshop KeynoteNo Boundary Thinking in Bioinformatics Workshop Keynote
No Boundary Thinking in Bioinformatics Workshop Keynote
 
140127 Performance Metrics WG
140127 Performance Metrics WG140127 Performance Metrics WG
140127 Performance Metrics WG
 
Tools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databasesTools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databases
 

Similaire à MetaCrowd: Crowdsourcing Gene Expression Metadata Quality Assessment

CEDAR work bench for metadata management
CEDAR work bench for metadata managementCEDAR work bench for metadata management
CEDAR work bench for metadata managementPistoia Alliance
 
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...Amit Sheth
 
EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017 EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017 EITESANGO
 
2016 09 cxo forum
2016 09 cxo forum2016 09 cxo forum
2016 09 cxo forumChris Dwan
 
Fore FAIR ISMB 2019
Fore FAIR ISMB 2019Fore FAIR ISMB 2019
Fore FAIR ISMB 2019Ian Fore
 
Friend NAS 2013-01-10
Friend NAS 2013-01-10Friend NAS 2013-01-10
Friend NAS 2013-01-10Sage Base
 
GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517GenomeInABottle
 
Understanding Gaps between Data Quality Checks and Research Capabilities in a...
Understanding Gaps between Data Quality Checks and Research Capabilities in a...Understanding Gaps between Data Quality Checks and Research Capabilities in a...
Understanding Gaps between Data Quality Checks and Research Capabilities in a...The Children's Hospital of Philadelphia
 
Ben Goertzel AIs, Superflies and the Path to Immortality - singsum au 2011
Ben Goertzel AIs, Superflies and the Path to Immortality - singsum au 2011Ben Goertzel AIs, Superflies and the Path to Immortality - singsum au 2011
Ben Goertzel AIs, Superflies and the Path to Immortality - singsum au 2011Adam Ford
 
FAIR and metadata standards - FAIRsharing and Neuroscience
FAIR and metadata standards - FAIRsharing and NeuroscienceFAIR and metadata standards - FAIRsharing and Neuroscience
FAIR and metadata standards - FAIRsharing and NeuroscienceSusanna-Assunta Sansone
 
Biostatistics and Statistical Bioinformatics
Biostatistics and Statistical BioinformaticsBiostatistics and Statistical Bioinformatics
Biostatistics and Statistical BioinformaticsSetia Pramana
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshopGenomeInABottle
 
Amia tb-review-08
Amia tb-review-08Amia tb-review-08
Amia tb-review-08Russ Altman
 
Ontologies: What Librarians Need to Know
Ontologies: What Librarians Need to KnowOntologies: What Librarians Need to Know
Ontologies: What Librarians Need to KnowBarry Smith
 
The End of the Drug Development Casino?
The End of the Drug Development Casino?The End of the Drug Development Casino?
The End of the Drug Development Casino?Paul Agapow
 
Systems genetics approaches to understand complex traits
Systems genetics approaches to understand complex traitsSystems genetics approaches to understand complex traits
Systems genetics approaches to understand complex traitsSOYEON KIM
 
Going FAIR: premises, promises and challenges of interoperability standards
Going FAIR: premises, promises and challenges of interoperability standardsGoing FAIR: premises, promises and challenges of interoperability standards
Going FAIR: premises, promises and challenges of interoperability standardsSusanna-Assunta Sansone
 
Evolution of Knowledge Discovery and Management
Evolution of Knowledge Discovery and Management Evolution of Knowledge Discovery and Management
Evolution of Knowledge Discovery and Management inscit2006
 

Similaire à MetaCrowd: Crowdsourcing Gene Expression Metadata Quality Assessment (20)

CEDAR work bench for metadata management
CEDAR work bench for metadata managementCEDAR work bench for metadata management
CEDAR work bench for metadata management
 
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
 
EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017 EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017
 
2016 09 cxo forum
2016 09 cxo forum2016 09 cxo forum
2016 09 cxo forum
 
Fore FAIR ISMB 2019
Fore FAIR ISMB 2019Fore FAIR ISMB 2019
Fore FAIR ISMB 2019
 
Friend NAS 2013-01-10
Friend NAS 2013-01-10Friend NAS 2013-01-10
Friend NAS 2013-01-10
 
GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517
 
Understanding Gaps between Data Quality Checks and Research Capabilities in a...
Understanding Gaps between Data Quality Checks and Research Capabilities in a...Understanding Gaps between Data Quality Checks and Research Capabilities in a...
Understanding Gaps between Data Quality Checks and Research Capabilities in a...
 
Ben Goertzel AIs, Superflies and the Path to Immortality - singsum au 2011
Ben Goertzel AIs, Superflies and the Path to Immortality - singsum au 2011Ben Goertzel AIs, Superflies and the Path to Immortality - singsum au 2011
Ben Goertzel AIs, Superflies and the Path to Immortality - singsum au 2011
 
FAIR and metadata standards - FAIRsharing and Neuroscience
FAIR and metadata standards - FAIRsharing and NeuroscienceFAIR and metadata standards - FAIRsharing and Neuroscience
FAIR and metadata standards - FAIRsharing and Neuroscience
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
Biostatistics and Statistical Bioinformatics
Biostatistics and Statistical BioinformaticsBiostatistics and Statistical Bioinformatics
Biostatistics and Statistical Bioinformatics
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
Amia tb-review-08
Amia tb-review-08Amia tb-review-08
Amia tb-review-08
 
Ontologies: What Librarians Need to Know
Ontologies: What Librarians Need to KnowOntologies: What Librarians Need to Know
Ontologies: What Librarians Need to Know
 
The End of the Drug Development Casino?
The End of the Drug Development Casino?The End of the Drug Development Casino?
The End of the Drug Development Casino?
 
Systems genetics approaches to understand complex traits
Systems genetics approaches to understand complex traitsSystems genetics approaches to understand complex traits
Systems genetics approaches to understand complex traits
 
Sabina Leonelli
Sabina LeonelliSabina Leonelli
Sabina Leonelli
 
Going FAIR: premises, promises and challenges of interoperability standards
Going FAIR: premises, promises and challenges of interoperability standardsGoing FAIR: premises, promises and challenges of interoperability standards
Going FAIR: premises, promises and challenges of interoperability standards
 
Evolution of Knowledge Discovery and Management
Evolution of Knowledge Discovery and Management Evolution of Knowledge Discovery and Management
Evolution of Knowledge Discovery and Management
 

Plus de Amrapali Zaveri, PhD

Data Quality and the FAIR principles
Data Quality and the FAIR principlesData Quality and the FAIR principles
Data Quality and the FAIR principlesAmrapali Zaveri, PhD
 
Workshop on Data Quality Management in Wikidata
Workshop on Data Quality Management in WikidataWorkshop on Data Quality Management in Wikidata
Workshop on Data Quality Management in WikidataAmrapali Zaveri, PhD
 
CrowdED: Guideline for optimal Crowdsourcing Experimental Design
CrowdED: Guideline for optimal Crowdsourcing Experimental DesignCrowdED: Guideline for optimal Crowdsourcing Experimental Design
CrowdED: Guideline for optimal Crowdsourcing Experimental DesignAmrapali Zaveri, PhD
 
smartAPI: Towards a more intelligent network of Web APIs
smartAPI: Towards a more intelligent network of Web APIssmartAPI: Towards a more intelligent network of Web APIs
smartAPI: Towards a more intelligent network of Web APIsAmrapali Zaveri, PhD
 
Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentAmrapali Zaveri, PhD
 
Linked Data Quality Assessment: A Survey
Linked Data Quality Assessment: A SurveyLinked Data Quality Assessment: A Survey
Linked Data Quality Assessment: A SurveyAmrapali Zaveri, PhD
 
Towards Biomedical Data Integration for Analyzing the Evolution of Cognition
Towards Biomedical Data Integration for Analyzing the Evolution of CognitionTowards Biomedical Data Integration for Analyzing the Evolution of Cognition
Towards Biomedical Data Integration for Analyzing the Evolution of CognitionAmrapali Zaveri, PhD
 
User-driven Quality Evaluation of DBpedia
User-driven Quality Evaluation of DBpediaUser-driven Quality Evaluation of DBpedia
User-driven Quality Evaluation of DBpediaAmrapali Zaveri, PhD
 

Plus de Amrapali Zaveri, PhD (16)

Data Quality and the FAIR principles
Data Quality and the FAIR principlesData Quality and the FAIR principles
Data Quality and the FAIR principles
 
Workshop on Data Quality Management in Wikidata
Workshop on Data Quality Management in WikidataWorkshop on Data Quality Management in Wikidata
Workshop on Data Quality Management in Wikidata
 
ESOF Panel 2018
ESOF Panel 2018ESOF Panel 2018
ESOF Panel 2018
 
CrowdED: Guideline for optimal Crowdsourcing Experimental Design
CrowdED: Guideline for optimal Crowdsourcing Experimental DesignCrowdED: Guideline for optimal Crowdsourcing Experimental Design
CrowdED: Guideline for optimal Crowdsourcing Experimental Design
 
smartAPI: Towards a more intelligent network of Web APIs
smartAPI: Towards a more intelligent network of Web APIssmartAPI: Towards a more intelligent network of Web APIs
smartAPI: Towards a more intelligent network of Web APIs
 
Introduction to Bio SPARQL
Introduction to Bio SPARQL Introduction to Bio SPARQL
Introduction to Bio SPARQL
 
Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality Assessment
 
Linked Data Quality Assessment: A Survey
Linked Data Quality Assessment: A SurveyLinked Data Quality Assessment: A Survey
Linked Data Quality Assessment: A Survey
 
Amrapali Zaveri Defense
Amrapali Zaveri DefenseAmrapali Zaveri Defense
Amrapali Zaveri Defense
 
LDQ 2014 DQ Methodology
LDQ 2014 DQ MethodologyLDQ 2014 DQ Methodology
LDQ 2014 DQ Methodology
 
LOD-SEM
LOD-SEMLOD-SEM
LOD-SEM
 
TripleCheckMate
TripleCheckMateTripleCheckMate
TripleCheckMate
 
Towards Biomedical Data Integration for Analyzing the Evolution of Cognition
Towards Biomedical Data Integration for Analyzing the Evolution of CognitionTowards Biomedical Data Integration for Analyzing the Evolution of Cognition
Towards Biomedical Data Integration for Analyzing the Evolution of Cognition
 
User-driven Quality Evaluation of DBpedia
User-driven Quality Evaluation of DBpediaUser-driven Quality Evaluation of DBpedia
User-driven Quality Evaluation of DBpedia
 
Converting GHO to RDF
Converting GHO to RDFConverting GHO to RDF
Converting GHO to RDF
 
ReDD-Observatory
ReDD-ObservatoryReDD-Observatory
ReDD-Observatory
 

Dernier

Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxEsquimalt MFRC
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Pooja Bhuva
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfDr Vijay Vishwakarma
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Association for Project Management
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jisc
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...Nguyen Thanh Tu Collection
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseAnaAcapella
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 

Dernier (20)

Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 

MetaCrowd: Crowdsourcing Gene Expression Metadata Quality Assessment

  • 1. MetaCrowd: Crowdsourcing Gene Expression Metadata Quality Assessment Amrapali Zaveri and Michel Dumontier @AmrapaliZamrapali.zaveri@maastrichtuniversity.nl Bio-ontologies 2017 July 24-25th, 2017
  • 2. BIOMEDICAL DATA ON THE WEB 2
  • 3. BIOMEDICAL METADATA ON THE WEB — SIGNIFICANCE 3 ➤ For (re-)using this data, we need to understand the structure of datasets and the experimental conditions under which they were produced ➤ We require accurate, structured and complete description of the data -- defined as metadata ➤ Good quality metadata is essential in finding, interpreting, and reusing existing data beyond what the original investigators envisioned ➤ Facilitates a data-driven approach by combining and analyzing similar data to uncover novel insights or even more subtle trends in the data
  • 4. BIOMEDICAL METADATA ON THE WEB - CHALLENGES 4 SIZE complexity QUALITY measures TIME consuming COSTLY, requires experts
  • 5. HYPOTHESIS Crowdsourcing i.e. non-expert workers can be used to curate large-scale digital biomedical metadata on the Web. 5
  • 6. CROWDSOURCING - WHAT & WHY? 6 TIME MONEY ➤ Highly parallelizable tasks ➤ Work is broken down into smaller — ‘micro’ — pieces that can be solved independently ➤ Tasks based on human skills not easily replicable by machines ➤ Non-expert workers can perform the tasks with a minimal payment Consolidated answers solve scientific problems !!
  • 7. RELATED WORK - CROWDSOURCING BIOMEDICAL RESEARCH ➤ Improve automated mining of biomedical text for annotating diseases [1] ➤ Curation of gene-mutation relations [2] ➤ Identifying relationships between drugs and side-effects [3], drugs and their indications [4] ➤ Annotation of microRNA functions [5]. 7
  • 8. GENE EXPRESSION OMNIBUS ➤ Unstructured ➤ Spreadsheet submission ➤ No controlled vocabulary ➤ Heterogeneity of terms ➤ Size complexity ➤ ~Billion records 8
  • 9. Meta-analysis from GEO data A common rejection module (CRM) for acute rejection across multiple organs identifies novel therapeutics for organ transplantation Khatri et al. JEM. 210 (11): 2205; DOI: 10.1084/jem.20122709 Metadata issues: • Missing • Incomplete • Inaccurate
  • 10. GEO METADATA - EXAMPLE 10 44,000,000 Key: value pairs
  • 11. GEO METADATA - QUALITY PROBLEMS FOR KEYS ➤ Minor spelling discrepancies ➤ genotype/varaiation, genotype/varat, genotype/varation, genotype/variaion, genotype/variataion, genotype/variation ➤ Different syntactic representations ➤ age (years), age(yrs) and age_year ➤ Different terms to denote one concept ➤ disease, illness, healthy control ➤ Two different key categories in one key name ➤ disease/cell type, tissue/cell line, treatment age 11
  • 12. METACROWD METHODOLOGY 12 GEO Metadata 8 GEO Keys 5 Values (each) • cell line • disease • gender/sex • genotype • strain • time • tissue • treatment Key Definitions SemanticScience Integration Ontology
  • 13. MICRO TASKS — CROWDFLOWER 13
  • 14. MICRO TASKS — SETTINGS 14 • 3 workers per task • ‘Dynamic Judgment’ to 7 workers, with 0.8 confidence • No. of gold standard questions — 60 • Min. accuracy — 80% • 5 cents per judgment • 10 tasks per page
  • 15. RESULTS OVERVIEW 15 No. of microtasks (keys) 1643 Total no. of workers 145 Total no. of judgments 7835 Overall accuracy 0.934 No. of gold standard questions 60 Accuracy on gold standard questions 0.930 Total cost $451 Total time 1 hour
  • 16. RESULTS FOR EACH KEY CATEGORY 16 Key Category No. of Keys True Positive, False Positive Accuracy Cell line 109 711, 21 0.955 Disease 85 412, 10 0.937 Gender 72 645, 23 0.902 Genotype 112 566, 10 0.984 Strain 181 788, 4 0.966 Time 698 2489, 120 0.908 Tissue 145 567, 6 0.947 Treatment 242 846, 49 0.944
  • 17. RESULTS FOR EACH KEY CATEGORY — EXAMPLES (1) 17 Workers classified incorrectly for: • Cell line • cell line initiation date, cell line source age • Disease • diseasestatus • Gender • cell sex • Strain • strain ID • Tissue • tissue & age, tissue/development stage
  • 18. CONCLUSIONS & LIMITATIONS 18 • Crowdsourcing i.e. non-expert workers can be used to curate large-scale digital gene expression metadata on the Web. • Several keys that did not achieve consensus amongst the workers due to either • lack of semantically annotated values • ambiguous nomenclature of keys as well as the values • values indicating that keys belong to more than one category • inconsistent usage of the particular metadata key
  • 19. CROWDSOURCING GEO METADATA QUALITY — FUTURE WORK 19 • Perform crowdsourcing on values and key: value pairs • Implement a semi-automated approach to identify similar keys using ontologies • Design a pipeline to involve semi-automated method+ crowdsourcing + experts
  • 20. REFERENCES [1] Benjamin, M. G., Max, N., Chunlei, W. U. & Andrew, I. S. in Biocomputing 2015 282–293World Scientific (2014). [2]Burger, J. D. et al. Hybrid curation of gene–mutation relations combining automated extraction and crowdsourcing. Database 2014, bau094 (2014). [3] Gottlieb, A., Hoehndorf, R., Dumontier, M. & Altman, R. B. Ranking adverse drug reactions with crowdsourcing. J. Med. Internet Res. 17, e80 (2015). [4] Khare, R. et al. Scaling drug indication curation through crowdsourcing. Database 2015, bav016 (2015). [5] Vergoulis, T. et al. mirPub: a database for searching microRNA publications. Bioinformatics 31, 1502–1504 (2015). 20