This document discusses mining data availability statements from publications in Europe PMC to find statements about genomic data from genome-wide association studies (GWAS). It describes how the GWAS Catalog contains over 4,000 publications and 7,600 studies linking genetic variants to traits. Machine learning has improved the efficiency of identifying relevant publications for the catalog compared to manual searching. Data availability statements commonly mention making data publicly available in repositories like dbGaP and EGA which are cited in millions of publications.
2. GWAS and the GWAS Catalog
• GWAS
analyse
variants
across the
genome to
identify loci
associated
with a
disease or
phenotype
Study metadata
including:
- Trait
- Sample
information
Publication
information
Results
- Lead
associations
- Summary
statistics
GWAS
Catalog
data
3. GWAS Catalog content
As of October 2019
• 4,220 publications
• 7,661 studies
• 157,336 variant-trait assoc.
• 276 pubs with summary
statistics, >8,000 datasets
www.ebi.ac.uk/gwas
4. What is Europe PMC?
Europe PMC– free digital archive of
biomedical and life sciences research publications
5. Content in Europe PMC
Europe PMC is a partner in PubMed Central International
11. <title> and XML path
Title XML path Frequency
Data Availability article:front:notes 90,928
Data accessibility article:back:sec 2,694
Data Availability article:back:sec:fn-group 2,580
Data article:body:sec 2,265
Availability of supporting data article:body:sec 1,593
Major datasets article:back:sec:sec 1,074
Database survey article:body:sec 986
Extended Data article:body:sec 851
Data availability article:body:sec 795
Extended Data Figure 1 article:body:sec:SecTag:fig 689
Top 10 combinations of <title> content containing “data” and XML path
15. GWAS Catalog literature identification:
Query based vs machine learning
Query-based Machine learning
Precision 6% 27%
Recall 100% 96%
Improved efficiency
80% reduction in publications to review
average 144 to 30/week
16. Summary statistics in the GWAS Catalog by publication year
% of publications with summary statistics over time & in the whole Catalog
20. GWAS Catalog literature identification
• Previously used manual query based search term
• Query: genomewide OR genome wide OR genome-wide OR GWAS
• Now replaced with machine learning based search
• convolutional neural net trained on corpus of GWAS Catalog
publications
• Collaboration with Zhiyong Lu’s group
Lee et al, PMID 30102703 , PloS Comp Bio
• ML results triaged by curator in custom Pubtator interface
21. Old literature search and triage
process
• Manual search in PubMed
• Query: genomewide OR genome wide OR genome-
wide OR GWAS
• Curator assesses each publication for eligibility for inclusion in
GWAS Catalog
• Specific eligibility criteria
https://www.ebi.ac.uk/gwas/docs/methods/criteria
• Genome wide association study of >100,000 variants distributed
genome
22. Deep learning algorithm (convolutional neural net) trained on corpus of
GWAS Catalog publications)
Figure 1. Lee et al, PMID 30102703 , PloS Comp Bio
Machine learning search
Corpus of
GWAS Catalog
publications
23. GWAS Catalog machine learning literature
search method
• Precision 27%
• Recall 96%
Table 3. Lee et al, PMID 30102703 , PloS Comp Bio
24. Machine learning:
• Improved efficiency (80% reduction in publications to review, 144 to 30/week)
• Similar capture of eligible studies
GWAS Catalog machine learning literature search method vs
query based search
Table 3. Lee et al, PMID 30102703 , PloS Comp Bio