SlideShare une entreprise Scribd logo
1  sur  1
Télécharger pour lire hors ligne
Introduction
Next-generation sequencing technologies have led to a rapid
production of high-throughput sequence data characterizing
adaptive immune-receptor repertoires (AIRRs). As part of the
AIRR community (http://airr-community.org) data standards
working group, we have developed an initial set of metadata
recommendations for publishing AIRR sequencing studies.
These recommendations will be implemented in several
public repositories, including the NCBI sequence read archive
(SRA). Submissions to SRA typically use a flat-file template
and include only a minimal amount of term validation. In
order to ease the metadata authoring and to implement the
ontological terms validation of repertoire sequence data, we
are developing an interactive template through CEDAR
workbench that will allow for ontological validation, and
subsequent deposition in SRA. CEDAR workbench also allows
the user to populate the template with metadata for data
submission to various data repositories. The incorporation of
template-element level ontology mapping not only facilitates
validation of data submission, but also enables intelligent
queries within and across repositories.
High-quality Metadata and Challenges
High-quality metadata are seen as crucial to facilitate
knowledge discovery. The biomedical community has a
strong history of tackling metadata challenge by driving the
development of metadata templates. These templates focus
on addressing the reproducibility challenge by providing
detailed checklists of the metadata needed to describe
particular types of experimental data sources. The key goal is
to provide sufficient metadata to enable the source studies
to be reproduced. While individual metadata templates can
provide a standard format for a particular data source, they
rarely share common structure or semantics. There is also a
disconnect between the high-level checklist-based template
definitions developed by scientific communities and the
submission formats required by metadata repositories.
Moreover, different repositories provide their locally defined
templates for describing metadata. These templates lack the
use of common data elements and standard vocabularies.
This creates a barrier for sharing and using metadata to
enable knowledge discovery. We use CEDAR workbench to
create common templates for entering metadata. To
enhance machine readability, we use CEDAR’s capability to
link individual data elements and their values to ontology
concepts
AIRR Data Submission to SRA Leveraging
CEDAR Workbench
CEDAR is supported by grant U54 AI117925 awarded by the National Institute of Allergy and Infectious Diseases through funds provided by the trans-NIH Big Data toKnowledge (BD2K) initiative (www.bd2k.nih.gov).
Syed Ahmad Chan Bukhari1, Martin J. O'Connor2, John Graybeal2, Mark A. Musen2, Kei-Hoi Cheung3, Steven H. Kleinstein1
Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Immune
Receptor Repertoire Data to the Sequence Read Archive (SRA)
1Department of Pathology, Yale School of Medicine, New Haven, CT , 2Center for Expanded Data Annotation and Retrieval, Stanford Center for Biomedical
Informatics Research, Stanford University and 3Department of Emergency Medicine, Yale School of Medicine, New Haven, CT
Figure 1. Metadata Life Cycle in CEDAR Workbench
The CEDAR Workbench
The Center for Expanded Data Annotation and Retrieval is studying the creation of comprehensive and expressive metadata for biomedical
datasets to facilitate data discovery, data interpretation, and data reuse. CEDAR takes advantage of emerging community-based standard
templates for describing different kinds of biomedical datasets. CEDAR workbench investigates the use of computational techniques to help
investigators to assemble templates and to fill in their values. We are creating a repository of metadata from which we plan to identify
metadata patterns that will drive predictive data entry when filling in metadata templates. The metadata repository not only will capture
annotations specified when experimental datasets are initially created, but also will incorporate links to the published literature, including
secondary analyses and possible refinements or retractions of experimental interpretations. By working initially with the Human Immunology
Project Consortium and the developers of the ImmPort data repository, we are developing and evaluating an end-to-end solution to the
problems of metadata authoring and management that will generalize to other data-management environments.
CEDAR
CENTER FOR EXPANDED DATA
ANNOTATION AND RETRIEVAL
CEDAR
R EXPANDED DATA
ON AND RETRIEVAL
Minimal Standards WG Recommendations for The AIRR Sequencing Data
As high throughput experiments become more prevalent
in the field of Immunology and elsewhere, there is an
increased need for collective organization of data and
standardized methods of data reporting. No current
standards exist for adaptive immune receptor repertoire
sequencing data. Data and metadata formats need to be
harmonized so that data from different experiments can
be mined. Once recovered, the mined data need to have
sufficient descriptive metadata in order to be useful. To
fulfill these unmet needs, we propose a set of minimal
standards that we recommend journals adopt and that
could form the requirements for submission to a public
data repository:
1. The experimental study design including sample data
relationships (e.g., which raw data file(s) relate to
which sample, which samples are technical, which are
biological replicates).
2. The essential sample annotation including
experimental factors and their values (e.g., the set of
markers used to sort the cell population being
studied).
3. Sufficient	annotation	of	the	amplicon being	sequenced that	would	allow	the	
raw	data	to	be	transformed	into	the	processed	sequences	(e.g.,	barcodes,	
primers,	unique	molecular	identifiers).
4. The	raw	data for	each	sequencing	run	(e.g.,	FASTQ	files)
5. The	essential	laboratory	and	data	processing	protocols (e.g.,	software	tools	
with	version	numbers,	quality	thresholds,	 primer	match	cutoffs,	etc.)	that	
have	been	used	to	obtain	the	final	processed	data.	
6. The	final	processed	antigen	receptor	sequences for	the	set	of	samples	in	the	
experiment	(e.g.,	the	set	of	sequences	used	for	V(D)J	assignment),	 along	
with	the	V(D)J	assignments	for	each	sequence.
Figure 2. Overview of the six high-level principles and associated data elements that
comprise the AIRR standard draft agreed to at the second annualAIRR Community
meeting in 2016.
Figure 4. CEDAR Workbench to SRA Conversion Workflow
c
a b
Figure 3(a). ARR Minimal Standard Data Elements, 3(b) Ontology Controlled
Template Authoring Through CEDAR Workbench and 3(c) AIRR Data
Submission Template
CEDAR JSON-LD to SRA XML Converter Demo
References
1- Musen, Mark A., et al. "The center for expanded data annotation and retrieval." Journal of the American Medical
Informatics Association 22.6 (2015): 1148-1152.
2- Leinonen, Rasko, Hideaki Sugawara, and Martin Shumway. "The sequence read archive." Nucleic acids research (2010): gkq1019.
Acknowledgement: We acknowledge Dr. Ben Busby from NCBI for his valuable suggestions during this research
work.

Contenu connexe

Tendances

Semantic enrichment and similarity approximation for biomedical sequence images
Semantic enrichment and similarity approximation for biomedical sequence imagesSemantic enrichment and similarity approximation for biomedical sequence images
Semantic enrichment and similarity approximation for biomedical sequence images
Syed Ahmad Chan Bukhari, PhD
 
Nucl. Acids Res.-2014-Howe-nar-gku1244
Nucl. Acids Res.-2014-Howe-nar-gku1244Nucl. Acids Res.-2014-Howe-nar-gku1244
Nucl. Acids Res.-2014-Howe-nar-gku1244
Yasel Cruz
 

Tendances (18)

Semantic enrichment and similarity approximation for biomedical sequence images
Semantic enrichment and similarity approximation for biomedical sequence imagesSemantic enrichment and similarity approximation for biomedical sequence images
Semantic enrichment and similarity approximation for biomedical sequence images
 
Nucl. Acids Res.-2014-Howe-nar-gku1244
Nucl. Acids Res.-2014-Howe-nar-gku1244Nucl. Acids Res.-2014-Howe-nar-gku1244
Nucl. Acids Res.-2014-Howe-nar-gku1244
 
Next-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information RetrievalNext-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information Retrieval
 
The Electronic Notebook Ontology
The Electronic Notebook OntologyThe Electronic Notebook Ontology
The Electronic Notebook Ontology
 
B.3.5
B.3.5B.3.5
B.3.5
 
Rescuing Data from Decaying and Moribund Clinical Information Systems
Rescuing Data from Decaying and Moribund Clinical Information SystemsRescuing Data from Decaying and Moribund Clinical Information Systems
Rescuing Data from Decaying and Moribund Clinical Information Systems
 
Making it Easier, Possibly Even Pleasant, to Author Rich Experimental Metadata
Making it Easier, Possibly Even Pleasant, to Author Rich Experimental MetadataMaking it Easier, Possibly Even Pleasant, to Author Rich Experimental Metadata
Making it Easier, Possibly Even Pleasant, to Author Rich Experimental Metadata
 
Data mining weka
Data mining wekaData mining weka
Data mining weka
 
PNNL April 2011 ogce
PNNL April 2011 ogcePNNL April 2011 ogce
PNNL April 2011 ogce
 
Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...
 
Knowledge Discovery in an Agents Environment
Knowledge Discovery in an Agents EnvironmentKnowledge Discovery in an Agents Environment
Knowledge Discovery in an Agents Environment
 
Content Modelling for VIEW Datasets Using Archetypes
Content Modelling for VIEW Datasets Using ArchetypesContent Modelling for VIEW Datasets Using Archetypes
Content Modelling for VIEW Datasets Using Archetypes
 
Drug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge GraphsDrug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge Graphs
 
NETTAB 2012
NETTAB 2012NETTAB 2012
NETTAB 2012
 
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
 
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
 
FAIRness Assessment of the Library of Integrated Network-based Cellular Signa...
FAIRness Assessment of the Library of Integrated Network-based Cellular Signa...FAIRness Assessment of the Library of Integrated Network-based Cellular Signa...
FAIRness Assessment of the Library of Integrated Network-based Cellular Signa...
 
Claudia medina: Linking Health Records for Population Health Research in Brazil.
Claudia medina: Linking Health Records for Population Health Research in Brazil.Claudia medina: Linking Health Records for Population Health Research in Brazil.
Claudia medina: Linking Health Records for Population Health Research in Brazil.
 

Similaire à Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Immune Receptor Repertoire Data to the Sequence Read Archive (SRA)

eTRIKS Data Harmonization Service Platform
eTRIKS Data Harmonization Service PlatformeTRIKS Data Harmonization Service Platform
eTRIKS Data Harmonization Service Platform
ibemam
 
Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005
Paolo Missier
 
FAIR sequencing data repository based on iRODS
FAIR sequencing data repository based on iRODSFAIR sequencing data repository based on iRODS
FAIR sequencing data repository based on iRODS
Felipe Gutierrez
 

Similaire à Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Immune Receptor Repertoire Data to the Sequence Read Archive (SRA) (20)

eTRIKS Data Harmonization Service Platform
eTRIKS Data Harmonization Service PlatformeTRIKS Data Harmonization Service Platform
eTRIKS Data Harmonization Service Platform
 
The swings and roundabouts of a decade of fun and games with Research Objects
The swings and roundabouts of a decade of fun and games with Research Objects The swings and roundabouts of a decade of fun and games with Research Objects
The swings and roundabouts of a decade of fun and games with Research Objects
 
The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...
The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...
The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...
 
Integrated research data management in the Structural Sciences
Integrated research data management in the Structural SciencesIntegrated research data management in the Structural Sciences
Integrated research data management in the Structural Sciences
 
Being FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data ScienceBeing FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data Science
 
2014 genome informatics Linked Data
2014 genome informatics Linked Data2014 genome informatics Linked Data
2014 genome informatics Linked Data
 
Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005
 
Microarrays Databases.pptx
Microarrays Databases.pptxMicroarrays Databases.pptx
Microarrays Databases.pptx
 
FAIR Data Knowledge Graphs
FAIR Data Knowledge GraphsFAIR Data Knowledge Graphs
FAIR Data Knowledge Graphs
 
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical DataA Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
 
FAIR Data Knowledge Graphs–from Theory to Practice
FAIR Data Knowledge Graphs–from Theory to PracticeFAIR Data Knowledge Graphs–from Theory to Practice
FAIR Data Knowledge Graphs–from Theory to Practice
 
CEDAR Technologies for AIRR Submissions
CEDAR Technologies for AIRR SubmissionsCEDAR Technologies for AIRR Submissions
CEDAR Technologies for AIRR Submissions
 
FAIR sequencing data repository based on iRODS
FAIR sequencing data repository based on iRODSFAIR sequencing data repository based on iRODS
FAIR sequencing data repository based on iRODS
 
Standardization of the HIPC Data Templates
Standardization of the HIPC Data TemplatesStandardization of the HIPC Data Templates
Standardization of the HIPC Data Templates
 
Standardization of the HIPC Data Templates: The Story So Far
Standardization of the HIPC Data Templates: The Story So FarStandardization of the HIPC Data Templates: The Story So Far
Standardization of the HIPC Data Templates: The Story So Far
 
Stephen Friend CRUK-MD Anderson Cancer Workshop 2012-02-28
Stephen Friend CRUK-MD Anderson Cancer Workshop 2012-02-28Stephen Friend CRUK-MD Anderson Cancer Workshop 2012-02-28
Stephen Friend CRUK-MD Anderson Cancer Workshop 2012-02-28
 
SooryaKiran Bioinformatics
SooryaKiran BioinformaticsSooryaKiran Bioinformatics
SooryaKiran Bioinformatics
 
Finding and Reusing Biomedical Datasets using CEDAR Metadata Repository and T...
Finding and Reusing Biomedical Datasets using CEDAR Metadata Repository and T...Finding and Reusing Biomedical Datasets using CEDAR Metadata Repository and T...
Finding and Reusing Biomedical Datasets using CEDAR Metadata Repository and T...
 
CDISC2RDF poster for Conference on Data Integration in the Life Sciences 2013
CDISC2RDF poster for Conference on Data Integration in the Life Sciences 2013CDISC2RDF poster for Conference on Data Integration in the Life Sciences 2013
CDISC2RDF poster for Conference on Data Integration in the Life Sciences 2013
 
Poster (1)
Poster (1)Poster (1)
Poster (1)
 

Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Immune Receptor Repertoire Data to the Sequence Read Archive (SRA)

  • 1. Introduction Next-generation sequencing technologies have led to a rapid production of high-throughput sequence data characterizing adaptive immune-receptor repertoires (AIRRs). As part of the AIRR community (http://airr-community.org) data standards working group, we have developed an initial set of metadata recommendations for publishing AIRR sequencing studies. These recommendations will be implemented in several public repositories, including the NCBI sequence read archive (SRA). Submissions to SRA typically use a flat-file template and include only a minimal amount of term validation. In order to ease the metadata authoring and to implement the ontological terms validation of repertoire sequence data, we are developing an interactive template through CEDAR workbench that will allow for ontological validation, and subsequent deposition in SRA. CEDAR workbench also allows the user to populate the template with metadata for data submission to various data repositories. The incorporation of template-element level ontology mapping not only facilitates validation of data submission, but also enables intelligent queries within and across repositories. High-quality Metadata and Challenges High-quality metadata are seen as crucial to facilitate knowledge discovery. The biomedical community has a strong history of tackling metadata challenge by driving the development of metadata templates. These templates focus on addressing the reproducibility challenge by providing detailed checklists of the metadata needed to describe particular types of experimental data sources. The key goal is to provide sufficient metadata to enable the source studies to be reproduced. While individual metadata templates can provide a standard format for a particular data source, they rarely share common structure or semantics. There is also a disconnect between the high-level checklist-based template definitions developed by scientific communities and the submission formats required by metadata repositories. Moreover, different repositories provide their locally defined templates for describing metadata. These templates lack the use of common data elements and standard vocabularies. This creates a barrier for sharing and using metadata to enable knowledge discovery. We use CEDAR workbench to create common templates for entering metadata. To enhance machine readability, we use CEDAR’s capability to link individual data elements and their values to ontology concepts AIRR Data Submission to SRA Leveraging CEDAR Workbench CEDAR is supported by grant U54 AI117925 awarded by the National Institute of Allergy and Infectious Diseases through funds provided by the trans-NIH Big Data toKnowledge (BD2K) initiative (www.bd2k.nih.gov). Syed Ahmad Chan Bukhari1, Martin J. O'Connor2, John Graybeal2, Mark A. Musen2, Kei-Hoi Cheung3, Steven H. Kleinstein1 Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Immune Receptor Repertoire Data to the Sequence Read Archive (SRA) 1Department of Pathology, Yale School of Medicine, New Haven, CT , 2Center for Expanded Data Annotation and Retrieval, Stanford Center for Biomedical Informatics Research, Stanford University and 3Department of Emergency Medicine, Yale School of Medicine, New Haven, CT Figure 1. Metadata Life Cycle in CEDAR Workbench The CEDAR Workbench The Center for Expanded Data Annotation and Retrieval is studying the creation of comprehensive and expressive metadata for biomedical datasets to facilitate data discovery, data interpretation, and data reuse. CEDAR takes advantage of emerging community-based standard templates for describing different kinds of biomedical datasets. CEDAR workbench investigates the use of computational techniques to help investigators to assemble templates and to fill in their values. We are creating a repository of metadata from which we plan to identify metadata patterns that will drive predictive data entry when filling in metadata templates. The metadata repository not only will capture annotations specified when experimental datasets are initially created, but also will incorporate links to the published literature, including secondary analyses and possible refinements or retractions of experimental interpretations. By working initially with the Human Immunology Project Consortium and the developers of the ImmPort data repository, we are developing and evaluating an end-to-end solution to the problems of metadata authoring and management that will generalize to other data-management environments. CEDAR CENTER FOR EXPANDED DATA ANNOTATION AND RETRIEVAL CEDAR R EXPANDED DATA ON AND RETRIEVAL Minimal Standards WG Recommendations for The AIRR Sequencing Data As high throughput experiments become more prevalent in the field of Immunology and elsewhere, there is an increased need for collective organization of data and standardized methods of data reporting. No current standards exist for adaptive immune receptor repertoire sequencing data. Data and metadata formats need to be harmonized so that data from different experiments can be mined. Once recovered, the mined data need to have sufficient descriptive metadata in order to be useful. To fulfill these unmet needs, we propose a set of minimal standards that we recommend journals adopt and that could form the requirements for submission to a public data repository: 1. The experimental study design including sample data relationships (e.g., which raw data file(s) relate to which sample, which samples are technical, which are biological replicates). 2. The essential sample annotation including experimental factors and their values (e.g., the set of markers used to sort the cell population being studied). 3. Sufficient annotation of the amplicon being sequenced that would allow the raw data to be transformed into the processed sequences (e.g., barcodes, primers, unique molecular identifiers). 4. The raw data for each sequencing run (e.g., FASTQ files) 5. The essential laboratory and data processing protocols (e.g., software tools with version numbers, quality thresholds, primer match cutoffs, etc.) that have been used to obtain the final processed data. 6. The final processed antigen receptor sequences for the set of samples in the experiment (e.g., the set of sequences used for V(D)J assignment), along with the V(D)J assignments for each sequence. Figure 2. Overview of the six high-level principles and associated data elements that comprise the AIRR standard draft agreed to at the second annualAIRR Community meeting in 2016. Figure 4. CEDAR Workbench to SRA Conversion Workflow c a b Figure 3(a). ARR Minimal Standard Data Elements, 3(b) Ontology Controlled Template Authoring Through CEDAR Workbench and 3(c) AIRR Data Submission Template CEDAR JSON-LD to SRA XML Converter Demo References 1- Musen, Mark A., et al. "The center for expanded data annotation and retrieval." Journal of the American Medical Informatics Association 22.6 (2015): 1148-1152. 2- Leinonen, Rasko, Hideaki Sugawara, and Martin Shumway. "The sequence read archive." Nucleic acids research (2010): gkq1019. Acknowledgement: We acknowledge Dr. Ben Busby from NCBI for his valuable suggestions during this research work.