SlideShare une entreprise Scribd logo
1  sur  14
Data availability and
feasibility of validation –
A genomics case study
Mike Thelwall, Marcus Munafò, Amalia Mas Bleda, Emma
Stuart, Meiko Makita, Verena Weigert, Chris Keene,
Nushrat Khan, Katie Drax, Kayvan Kousha
University of Wolverhampton, University of Bristol & UK
Reproducibility Network & JISC
Data sharing experiment goals
• Find out how often data is shared in a field with
apparently ideal conditions
• Write a program to automatically identify shared
data of a specified type
• Write a program to validate the quality of shared
data of a specified type
• As a step towards more general automatic shared
data discovery and quality control
The ideal case study topic? GWAS
• Genome Wide Association Study (GWAS) summary
statistics
• Variation likelihood at large sets of locations of the human
genome for measurable traits (e.g. disease susceptibility)
• Data is high value and expensive to collect
• Often stored in a standard format for internal sharing
by consortia
• An international repository exists for hosting it,
emphasising its importance
• NHGRI-EBI Catalog of published genome-wide association
studies
• Meta-analyses benefit from shared files – increased
power and population triangulation
• Genomics has a reputation for data sharing
https://www.ebi.ac.uk/gwas/diagram
Each dot represents a point on the human genome that at least one
research study has found to associate with a measurable trait
Methods
• Medline search for articles that could be primary
human GWAS
"Molecular Epidemiology"[Majr] AND "Genome-
Wide Association Study"[Majr]
• Restriction to 2010 and 2017 to identify trends
• Three human coders classified 1799 articles for
being (a) primary human GWAS and (b) publicly
sharing complete primary human GWAS summary
statistics
• MT and MM follow-up checks of results
https://www.biorxiv.org/content/10.1101/622795v1
Results
Data availability information 2010 2017 Total Percent
GWAS location not stated in article 156 139 295 89.4%
Broken link or not findable at stated location 3 1 4 1.2%
On request to the authors 0 8 8 2.4%
On request via dbGaP 2 5 7 2.1%
On request via EGA 1 3 4 1.2%
On request via another portal 0 3 3 0.9%
Free online without login, proprietary format 1 0 1 0.3%
Free online without login, plain text 0 8 8 2.4%
10.6% reported sharing GWAS summary statistics in some form
Article descriptions of the availability
of GWAS summary statistics
• Usually in a Data Availability article section (26 out of
35).
• Data availability more difficult to identify from the
methods (4 articles) and results (3 articles).
• Only five data sharing statements described the shared
data as GWAS summary statistics, and all five used
different phrases
• “full GWAS summary statistics”, “Case Oncoarray GWAS data”,
“Summary GWAS estimates”, “Summary statistics for the
genome-wide association study”, “genome-wide set of
summary association statistics”
• Descriptions are therefore hard to automatically
identify from articles.
Conclusions
• Data sharing is unlikely to become near-universal
when it is optional.
• Policy initiatives or mandates are needed to
promote data sharing.
• Automatically identifying shared data is difficult or
impossible in practice because of:
• the complexity of articles (multiple data sources and
article structures)
• a lack of standardisation of terminology
• - but data availability statements help
Mike Thelwall, Marcus Munafò, Amalia Mas Bleda, Emma Stuart, Meiko Makita, Verena
Weigert, Chris Keene, Nushrat Khan, Katie Drax, Kayvan Kousha
University of Wolverhampton, University of Bristol & UK Reproducibility Network & JISC
Follow-up study: Investigating
data availability statements
• A program was written to extract data sharing
statements from full text articles in XML
• Free software Webometric Analyst
(http://lexiurl.wlv.ac.uk/), menu: Citations > PMC full
text > Data availability statements extract
• Manual content analysis for types of information in
extracted PMC Open Access Subset data availability
statements (n=500)
• Test machine learning for classifying data sharing
methods from data availability statements
Result - how is data shared?
Almost all papers with D.S.S. claim
to share data.
Standardised wordings common
e.g., “All relevant data are within
the paper.”
Results – what data is shared?
38% of data sharing
statements specify that all
data is shared
Results – why is data [not] shared?
91% of data sharing
statements give no
explanation for their
data sharing policy
Machine learning
• Simple support vector machines (SVM) test for
detecting sharing methods from data sharing
statements
• 87% accurate for: How is the data shared
• 89% accurate for: is all the data shared (binary)
Summary
• Data sharing seems to need mandates to become
widespread, even in otherwise best case fields
• Shared data is hard to detect precisely because of
article complexity and language variation.
• Basic information about whether data is shared and
where can be extracted automatically from data
availability statements.
• Mike Thelwall, Marcus Munafò, Amalia Mas Bleda,
Emma Stuart, Meiko Makita, Verena Weigert, Chris
Keene, Nushrat Khan, Katie Drax, Kayvan Kousha
• University of Wolverhampton, University of Bristol
& UK Reproducibility Network & JISC

Contenu connexe

Tendances

RDAP14: University-wide Research Data Management Policy
RDAP14: University-wide Research Data Management PolicyRDAP14: University-wide Research Data Management Policy
RDAP14: University-wide Research Data Management PolicyASIS&T
 
RDAP14: Emerging role of UC Libraries in research data management education
RDAP14: Emerging role of UC Libraries in research data management educationRDAP14: Emerging role of UC Libraries in research data management education
RDAP14: Emerging role of UC Libraries in research data management educationASIS&T
 
Branch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiersBranch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiersBenjamin Good
 
Link Analysis of Life Sciences Linked Data
Link Analysis of Life Sciences Linked DataLink Analysis of Life Sciences Linked Data
Link Analysis of Life Sciences Linked DataMichel Dumontier
 
OpenTox - an open community and framework supporting predictive toxicology an...
OpenTox - an open community and framework supporting predictive toxicology an...OpenTox - an open community and framework supporting predictive toxicology an...
OpenTox - an open community and framework supporting predictive toxicology an...Barry Hardy
 
Research Data Overview
Research Data OverviewResearch Data Overview
Research Data Overviewntunmg
 
Overlapping Experiments Infrastructure
Overlapping Experiments InfrastructureOverlapping Experiments Infrastructure
Overlapping Experiments InfrastructureSrihari Sriraman
 
Mapping to the Metabolomic Manifold
Mapping to the Metabolomic ManifoldMapping to the Metabolomic Manifold
Mapping to the Metabolomic ManifoldDmitry Grapov
 
Research Summaries: An Evolving Tool in the KMb Tool Box
Research Summaries: An Evolving Tool in the KMb Tool BoxResearch Summaries: An Evolving Tool in the KMb Tool Box
Research Summaries: An Evolving Tool in the KMb Tool BoxShawna Reibling
 
Helping Faculty Help Themselves: Open Access and Data Management Consulting A...
Helping Faculty Help Themselves: Open Access and Data Management Consulting A...Helping Faculty Help Themselves: Open Access and Data Management Consulting A...
Helping Faculty Help Themselves: Open Access and Data Management Consulting A...Spencer Keralis
 
National Bibliographic Knowledgebase survey: Data Quality Subgroup initial su...
National Bibliographic Knowledgebase survey: Data Quality Subgroup initial su...National Bibliographic Knowledgebase survey: Data Quality Subgroup initial su...
National Bibliographic Knowledgebase survey: Data Quality Subgroup initial su...CILIP MDG
 
Omic Data Integration Strategies
Omic Data Integration StrategiesOmic Data Integration Strategies
Omic Data Integration StrategiesDmitry Grapov
 
Pizza club - March 2017 - Gaia
Pizza club - March 2017 - GaiaPizza club - March 2017 - Gaia
Pizza club - March 2017 - GaiaRSG Luxembourg
 
Addressing the wicked problem of learning data privacy though principle and p...
Addressing the wicked problem of learning data privacy though principle and p...Addressing the wicked problem of learning data privacy though principle and p...
Addressing the wicked problem of learning data privacy though principle and p...Jisc
 
Working Effectively with Medicare Data: Limits and Opportunities
Working Effectively with Medicare Data: Limits and OpportunitiesWorking Effectively with Medicare Data: Limits and Opportunities
Working Effectively with Medicare Data: Limits and OpportunitiesCTSI at UCSF
 
National Data Archive (NADA) 3.0
National Data Archive (NADA) 3.0National Data Archive (NADA) 3.0
National Data Archive (NADA) 3.0mehmood78
 
RDAP14 Poster: The DCC’s institutional engagement program: changing approache...
RDAP14 Poster: The DCC’s institutional engagement program: changing approache...RDAP14 Poster: The DCC’s institutional engagement program: changing approache...
RDAP14 Poster: The DCC’s institutional engagement program: changing approache...ASIS&T
 
Evaluation of virtual classroom technology - Blackboard Collaborate
Evaluation of virtual classroom technology - Blackboard CollaborateEvaluation of virtual classroom technology - Blackboard Collaborate
Evaluation of virtual classroom technology - Blackboard CollaborateSharon Karasmanis
 
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...Enrico Glaab
 

Tendances (20)

RDAP14: University-wide Research Data Management Policy
RDAP14: University-wide Research Data Management PolicyRDAP14: University-wide Research Data Management Policy
RDAP14: University-wide Research Data Management Policy
 
RDAP14: Emerging role of UC Libraries in research data management education
RDAP14: Emerging role of UC Libraries in research data management educationRDAP14: Emerging role of UC Libraries in research data management education
RDAP14: Emerging role of UC Libraries in research data management education
 
Branch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiersBranch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiers
 
Link Analysis of Life Sciences Linked Data
Link Analysis of Life Sciences Linked DataLink Analysis of Life Sciences Linked Data
Link Analysis of Life Sciences Linked Data
 
OpenTox - an open community and framework supporting predictive toxicology an...
OpenTox - an open community and framework supporting predictive toxicology an...OpenTox - an open community and framework supporting predictive toxicology an...
OpenTox - an open community and framework supporting predictive toxicology an...
 
Research Data Overview
Research Data OverviewResearch Data Overview
Research Data Overview
 
Overlapping Experiments Infrastructure
Overlapping Experiments InfrastructureOverlapping Experiments Infrastructure
Overlapping Experiments Infrastructure
 
Mapping to the Metabolomic Manifold
Mapping to the Metabolomic ManifoldMapping to the Metabolomic Manifold
Mapping to the Metabolomic Manifold
 
Research Summaries: An Evolving Tool in the KMb Tool Box
Research Summaries: An Evolving Tool in the KMb Tool BoxResearch Summaries: An Evolving Tool in the KMb Tool Box
Research Summaries: An Evolving Tool in the KMb Tool Box
 
Helping Faculty Help Themselves: Open Access and Data Management Consulting A...
Helping Faculty Help Themselves: Open Access and Data Management Consulting A...Helping Faculty Help Themselves: Open Access and Data Management Consulting A...
Helping Faculty Help Themselves: Open Access and Data Management Consulting A...
 
National Bibliographic Knowledgebase survey: Data Quality Subgroup initial su...
National Bibliographic Knowledgebase survey: Data Quality Subgroup initial su...National Bibliographic Knowledgebase survey: Data Quality Subgroup initial su...
National Bibliographic Knowledgebase survey: Data Quality Subgroup initial su...
 
Omic Data Integration Strategies
Omic Data Integration StrategiesOmic Data Integration Strategies
Omic Data Integration Strategies
 
Pizza club - March 2017 - Gaia
Pizza club - March 2017 - GaiaPizza club - March 2017 - Gaia
Pizza club - March 2017 - Gaia
 
Addressing the wicked problem of learning data privacy though principle and p...
Addressing the wicked problem of learning data privacy though principle and p...Addressing the wicked problem of learning data privacy though principle and p...
Addressing the wicked problem of learning data privacy though principle and p...
 
Working Effectively with Medicare Data: Limits and Opportunities
Working Effectively with Medicare Data: Limits and OpportunitiesWorking Effectively with Medicare Data: Limits and Opportunities
Working Effectively with Medicare Data: Limits and Opportunities
 
HSL: Synergistic Partnerships
HSL: Synergistic PartnershipsHSL: Synergistic Partnerships
HSL: Synergistic Partnerships
 
National Data Archive (NADA) 3.0
National Data Archive (NADA) 3.0National Data Archive (NADA) 3.0
National Data Archive (NADA) 3.0
 
RDAP14 Poster: The DCC’s institutional engagement program: changing approache...
RDAP14 Poster: The DCC’s institutional engagement program: changing approache...RDAP14 Poster: The DCC’s institutional engagement program: changing approache...
RDAP14 Poster: The DCC’s institutional engagement program: changing approache...
 
Evaluation of virtual classroom technology - Blackboard Collaborate
Evaluation of virtual classroom technology - Blackboard CollaborateEvaluation of virtual classroom technology - Blackboard Collaborate
Evaluation of virtual classroom technology - Blackboard Collaborate
 
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
 

Similaire à Data availability Study

CINECA webinar slides: Making cohort data FAIR
CINECA webinar slides: Making cohort data FAIRCINECA webinar slides: Making cohort data FAIR
CINECA webinar slides: Making cohort data FAIRCINECAProject
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)Michael Atkins
 
CI4CC sustainability-panel
CI4CC sustainability-panelCI4CC sustainability-panel
CI4CC sustainability-panelRavi Madduri
 
FedCentric_Presentation
FedCentric_PresentationFedCentric_Presentation
FedCentric_PresentationYatpang Cheung
 
Investigating plant systems using data integration and network analysis
Investigating plant systems using data integration and network analysisInvestigating plant systems using data integration and network analysis
Investigating plant systems using data integration and network analysisCatherine Canevet
 
CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...CINECAProject
 
Hospital Cloud Forum - thoughts for panel
Hospital Cloud Forum - thoughts for panelHospital Cloud Forum - thoughts for panel
Hospital Cloud Forum - thoughts for panelKent State University
 
A FAIR Data Sharing Framework for Large-Scale Human Cancer Proteogenomics
A FAIR Data Sharing Framework for Large-Scale Human Cancer ProteogenomicsA FAIR Data Sharing Framework for Large-Scale Human Cancer Proteogenomics
A FAIR Data Sharing Framework for Large-Scale Human Cancer ProteogenomicsBrett Tully
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?Robert Grossman
 
The Human Variome Database in Australia in 2014 - Graham Taylor
The Human Variome Database in Australia in 2014 - Graham TaylorThe Human Variome Database in Australia in 2014 - Graham Taylor
The Human Variome Database in Australia in 2014 - Graham TaylorHuman Variome Project
 
NCI Cancer Genomic Data Commons for NCAB September 2016
NCI Cancer Genomic Data Commons for NCAB September 2016NCI Cancer Genomic Data Commons for NCAB September 2016
NCI Cancer Genomic Data Commons for NCAB September 2016Warren Kibbe
 
Provenance abstraction for implementing security: Learning Health System and ...
Provenance abstraction for implementing security: Learning Health System and ...Provenance abstraction for implementing security: Learning Health System and ...
Provenance abstraction for implementing security: Learning Health System and ...Vasa Curcin
 
Why should researchers care about data curation?
Why should researchers care about data curation?Why should researchers care about data curation?
Why should researchers care about data curation?Varsha Khodiyar
 
Cancer Moonshot, Data sharing and the Genomic Data Commons
Cancer Moonshot, Data sharing and the Genomic Data CommonsCancer Moonshot, Data sharing and the Genomic Data Commons
Cancer Moonshot, Data sharing and the Genomic Data CommonsWarren Kibbe
 
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...Amit Sheth
 
A Vision for a Cancer Research Knowledge System
A Vision for a Cancer Research Knowledge SystemA Vision for a Cancer Research Knowledge System
A Vision for a Cancer Research Knowledge SystemWarren Kibbe
 
FAIR Data Management and FAIR Data Sharing
FAIR Data Management and FAIR Data SharingFAIR Data Management and FAIR Data Sharing
FAIR Data Management and FAIR Data SharingMerce Crosas
 
NCI Cancer Genomics, Open Science and PMI: FAIR
NCI Cancer Genomics, Open Science and PMI: FAIR NCI Cancer Genomics, Open Science and PMI: FAIR
NCI Cancer Genomics, Open Science and PMI: FAIR Warren Kibbe
 

Similaire à Data availability Study (20)

CINECA webinar slides: Making cohort data FAIR
CINECA webinar slides: Making cohort data FAIRCINECA webinar slides: Making cohort data FAIR
CINECA webinar slides: Making cohort data FAIR
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)
 
KnetMiner Overview Oct 2017
KnetMiner Overview Oct 2017KnetMiner Overview Oct 2017
KnetMiner Overview Oct 2017
 
CI4CC sustainability-panel
CI4CC sustainability-panelCI4CC sustainability-panel
CI4CC sustainability-panel
 
FedCentric_Presentation
FedCentric_PresentationFedCentric_Presentation
FedCentric_Presentation
 
Investigating plant systems using data integration and network analysis
Investigating plant systems using data integration and network analysisInvestigating plant systems using data integration and network analysis
Investigating plant systems using data integration and network analysis
 
CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...
 
Hospital Cloud Forum - thoughts for panel
Hospital Cloud Forum - thoughts for panelHospital Cloud Forum - thoughts for panel
Hospital Cloud Forum - thoughts for panel
 
A FAIR Data Sharing Framework for Large-Scale Human Cancer Proteogenomics
A FAIR Data Sharing Framework for Large-Scale Human Cancer ProteogenomicsA FAIR Data Sharing Framework for Large-Scale Human Cancer Proteogenomics
A FAIR Data Sharing Framework for Large-Scale Human Cancer Proteogenomics
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?
 
The Human Variome Database in Australia in 2014 - Graham Taylor
The Human Variome Database in Australia in 2014 - Graham TaylorThe Human Variome Database in Australia in 2014 - Graham Taylor
The Human Variome Database in Australia in 2014 - Graham Taylor
 
NCI Cancer Genomic Data Commons for NCAB September 2016
NCI Cancer Genomic Data Commons for NCAB September 2016NCI Cancer Genomic Data Commons for NCAB September 2016
NCI Cancer Genomic Data Commons for NCAB September 2016
 
Provenance abstraction for implementing security: Learning Health System and ...
Provenance abstraction for implementing security: Learning Health System and ...Provenance abstraction for implementing security: Learning Health System and ...
Provenance abstraction for implementing security: Learning Health System and ...
 
Shifting the goal post – from high impact journals to high impact data
 Shifting the goal post – from high impact journals to high impact data Shifting the goal post – from high impact journals to high impact data
Shifting the goal post – from high impact journals to high impact data
 
Why should researchers care about data curation?
Why should researchers care about data curation?Why should researchers care about data curation?
Why should researchers care about data curation?
 
Cancer Moonshot, Data sharing and the Genomic Data Commons
Cancer Moonshot, Data sharing and the Genomic Data CommonsCancer Moonshot, Data sharing and the Genomic Data Commons
Cancer Moonshot, Data sharing and the Genomic Data Commons
 
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
 
A Vision for a Cancer Research Knowledge System
A Vision for a Cancer Research Knowledge SystemA Vision for a Cancer Research Knowledge System
A Vision for a Cancer Research Knowledge System
 
FAIR Data Management and FAIR Data Sharing
FAIR Data Management and FAIR Data SharingFAIR Data Management and FAIR Data Sharing
FAIR Data Management and FAIR Data Sharing
 
NCI Cancer Genomics, Open Science and PMI: FAIR
NCI Cancer Genomics, Open Science and PMI: FAIR NCI Cancer Genomics, Open Science and PMI: FAIR
NCI Cancer Genomics, Open Science and PMI: FAIR
 

Plus de Verena139

Peer judge: Praise and Criticism Detection in F1000Research reviews
Peer judge: Praise and Criticism Detection in F1000Research reviews Peer judge: Praise and Criticism Detection in F1000Research reviews
Peer judge: Praise and Criticism Detection in F1000Research reviews Verena139
 
Tracking data
Tracking dataTracking data
Tracking dataVerena139
 
Metrics for oa monographs - introduction
Metrics for oa monographs - introductionMetrics for oa monographs - introduction
Metrics for oa monographs - introductionVerena139
 
Thoughts on metrics for OA monographs
Thoughts on metrics for OA monographsThoughts on metrics for OA monographs
Thoughts on metrics for OA monographsVerena139
 
Operas Metrics Service
Operas Metrics Service Operas Metrics Service
Operas Metrics Service Verena139
 
Reproducibility Analytics Lab
Reproducibility Analytics Lab Reproducibility Analytics Lab
Reproducibility Analytics Lab Verena139
 
Prediction markets
Prediction markets  Prediction markets
Prediction markets Verena139
 
Jisc R&D work in Research Analytics
Jisc R&D work in Research AnalyticsJisc R&D work in Research Analytics
Jisc R&D work in Research AnalyticsVerena139
 
ORCID: Jisc&ARMA final meeting update by Josh Brown
ORCID: Jisc&ARMA final meeting update by Josh BrownORCID: Jisc&ARMA final meeting update by Josh Brown
ORCID: Jisc&ARMA final meeting update by Josh BrownVerena139
 
Orcid implementation in uk 29092014
Orcid implementation in uk 29092014Orcid implementation in uk 29092014
Orcid implementation in uk 29092014Verena139
 
ORCID: Jisc&ARMA progress meeting update by Josh Brown
ORCID: Jisc&ARMA progress meeting update by Josh Brown ORCID: Jisc&ARMA progress meeting update by Josh Brown
ORCID: Jisc&ARMA progress meeting update by Josh Brown Verena139
 
Jisc-ARMA ORCID pilot start-up meeting - presentation by Laure Haak (ORCID)
Jisc-ARMA ORCID pilot start-up meeting - presentation by Laure Haak (ORCID)Jisc-ARMA ORCID pilot start-up meeting - presentation by Laure Haak (ORCID)
Jisc-ARMA ORCID pilot start-up meeting - presentation by Laure Haak (ORCID)Verena139
 
Thunderbolts and lightning outputs
Thunderbolts and lightning outputsThunderbolts and lightning outputs
Thunderbolts and lightning outputsVerena139
 
Weathering the storm outputs
Weathering the storm outputsWeathering the storm outputs
Weathering the storm outputsVerena139
 

Plus de Verena139 (14)

Peer judge: Praise and Criticism Detection in F1000Research reviews
Peer judge: Praise and Criticism Detection in F1000Research reviews Peer judge: Praise and Criticism Detection in F1000Research reviews
Peer judge: Praise and Criticism Detection in F1000Research reviews
 
Tracking data
Tracking dataTracking data
Tracking data
 
Metrics for oa monographs - introduction
Metrics for oa monographs - introductionMetrics for oa monographs - introduction
Metrics for oa monographs - introduction
 
Thoughts on metrics for OA monographs
Thoughts on metrics for OA monographsThoughts on metrics for OA monographs
Thoughts on metrics for OA monographs
 
Operas Metrics Service
Operas Metrics Service Operas Metrics Service
Operas Metrics Service
 
Reproducibility Analytics Lab
Reproducibility Analytics Lab Reproducibility Analytics Lab
Reproducibility Analytics Lab
 
Prediction markets
Prediction markets  Prediction markets
Prediction markets
 
Jisc R&D work in Research Analytics
Jisc R&D work in Research AnalyticsJisc R&D work in Research Analytics
Jisc R&D work in Research Analytics
 
ORCID: Jisc&ARMA final meeting update by Josh Brown
ORCID: Jisc&ARMA final meeting update by Josh BrownORCID: Jisc&ARMA final meeting update by Josh Brown
ORCID: Jisc&ARMA final meeting update by Josh Brown
 
Orcid implementation in uk 29092014
Orcid implementation in uk 29092014Orcid implementation in uk 29092014
Orcid implementation in uk 29092014
 
ORCID: Jisc&ARMA progress meeting update by Josh Brown
ORCID: Jisc&ARMA progress meeting update by Josh Brown ORCID: Jisc&ARMA progress meeting update by Josh Brown
ORCID: Jisc&ARMA progress meeting update by Josh Brown
 
Jisc-ARMA ORCID pilot start-up meeting - presentation by Laure Haak (ORCID)
Jisc-ARMA ORCID pilot start-up meeting - presentation by Laure Haak (ORCID)Jisc-ARMA ORCID pilot start-up meeting - presentation by Laure Haak (ORCID)
Jisc-ARMA ORCID pilot start-up meeting - presentation by Laure Haak (ORCID)
 
Thunderbolts and lightning outputs
Thunderbolts and lightning outputsThunderbolts and lightning outputs
Thunderbolts and lightning outputs
 
Weathering the storm outputs
Weathering the storm outputsWeathering the storm outputs
Weathering the storm outputs
 

Dernier

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...SOFTTECHHUB
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...gragchanchal546
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...kumargunjan9515
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numberssuginr1
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...HyderabadDolls
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...gajnagarg
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...HyderabadDolls
 

Dernier (20)

Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 

Data availability Study

  • 1. Data availability and feasibility of validation – A genomics case study Mike Thelwall, Marcus Munafò, Amalia Mas Bleda, Emma Stuart, Meiko Makita, Verena Weigert, Chris Keene, Nushrat Khan, Katie Drax, Kayvan Kousha University of Wolverhampton, University of Bristol & UK Reproducibility Network & JISC
  • 2. Data sharing experiment goals • Find out how often data is shared in a field with apparently ideal conditions • Write a program to automatically identify shared data of a specified type • Write a program to validate the quality of shared data of a specified type • As a step towards more general automatic shared data discovery and quality control
  • 3. The ideal case study topic? GWAS • Genome Wide Association Study (GWAS) summary statistics • Variation likelihood at large sets of locations of the human genome for measurable traits (e.g. disease susceptibility) • Data is high value and expensive to collect • Often stored in a standard format for internal sharing by consortia • An international repository exists for hosting it, emphasising its importance • NHGRI-EBI Catalog of published genome-wide association studies • Meta-analyses benefit from shared files – increased power and population triangulation • Genomics has a reputation for data sharing
  • 4. https://www.ebi.ac.uk/gwas/diagram Each dot represents a point on the human genome that at least one research study has found to associate with a measurable trait
  • 5. Methods • Medline search for articles that could be primary human GWAS "Molecular Epidemiology"[Majr] AND "Genome- Wide Association Study"[Majr] • Restriction to 2010 and 2017 to identify trends • Three human coders classified 1799 articles for being (a) primary human GWAS and (b) publicly sharing complete primary human GWAS summary statistics • MT and MM follow-up checks of results https://www.biorxiv.org/content/10.1101/622795v1
  • 6. Results Data availability information 2010 2017 Total Percent GWAS location not stated in article 156 139 295 89.4% Broken link or not findable at stated location 3 1 4 1.2% On request to the authors 0 8 8 2.4% On request via dbGaP 2 5 7 2.1% On request via EGA 1 3 4 1.2% On request via another portal 0 3 3 0.9% Free online without login, proprietary format 1 0 1 0.3% Free online without login, plain text 0 8 8 2.4% 10.6% reported sharing GWAS summary statistics in some form
  • 7. Article descriptions of the availability of GWAS summary statistics • Usually in a Data Availability article section (26 out of 35). • Data availability more difficult to identify from the methods (4 articles) and results (3 articles). • Only five data sharing statements described the shared data as GWAS summary statistics, and all five used different phrases • “full GWAS summary statistics”, “Case Oncoarray GWAS data”, “Summary GWAS estimates”, “Summary statistics for the genome-wide association study”, “genome-wide set of summary association statistics” • Descriptions are therefore hard to automatically identify from articles.
  • 8. Conclusions • Data sharing is unlikely to become near-universal when it is optional. • Policy initiatives or mandates are needed to promote data sharing. • Automatically identifying shared data is difficult or impossible in practice because of: • the complexity of articles (multiple data sources and article structures) • a lack of standardisation of terminology • - but data availability statements help Mike Thelwall, Marcus Munafò, Amalia Mas Bleda, Emma Stuart, Meiko Makita, Verena Weigert, Chris Keene, Nushrat Khan, Katie Drax, Kayvan Kousha University of Wolverhampton, University of Bristol & UK Reproducibility Network & JISC
  • 9. Follow-up study: Investigating data availability statements • A program was written to extract data sharing statements from full text articles in XML • Free software Webometric Analyst (http://lexiurl.wlv.ac.uk/), menu: Citations > PMC full text > Data availability statements extract • Manual content analysis for types of information in extracted PMC Open Access Subset data availability statements (n=500) • Test machine learning for classifying data sharing methods from data availability statements
  • 10. Result - how is data shared? Almost all papers with D.S.S. claim to share data. Standardised wordings common e.g., “All relevant data are within the paper.”
  • 11. Results – what data is shared? 38% of data sharing statements specify that all data is shared
  • 12. Results – why is data [not] shared? 91% of data sharing statements give no explanation for their data sharing policy
  • 13. Machine learning • Simple support vector machines (SVM) test for detecting sharing methods from data sharing statements • 87% accurate for: How is the data shared • 89% accurate for: is all the data shared (binary)
  • 14. Summary • Data sharing seems to need mandates to become widespread, even in otherwise best case fields • Shared data is hard to detect precisely because of article complexity and language variation. • Basic information about whether data is shared and where can be extracted automatically from data availability statements. • Mike Thelwall, Marcus Munafò, Amalia Mas Bleda, Emma Stuart, Meiko Makita, Verena Weigert, Chris Keene, Nushrat Khan, Katie Drax, Kayvan Kousha • University of Wolverhampton, University of Bristol & UK Reproducibility Network & JISC

Notes de l'éditeur

  1. “A single-nucleotide polymorphism, often abbreviated to SNP, is a substitution of a single nucleotide that occurs at a specific position in the genome, where each variation is present to some appreciable degree within a population (e.g. > 1%).” https://en.wikipedia.org/wiki/Single-nucleotide_polymorphism