SlideShare une entreprise Scribd logo
1  sur  32
Télécharger pour lire hors ligne
WHAT'S IN A NAME?
Better vocabulary = better bioinformatics???

From flickr user giantginkgo
# Author: Keith Bradnam, Genome Center, UC Davis
# This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike
3.0 Unported License.
http://biomickwatson.wordpress.com

Most of the interesting 'stuff' that I discover about bioinformatics and genomics comes from
a) twitter, b) blogs, and c) papers (in that order). Mick Watson has fun and engaging blog
about bioinformatics and today he raised an important point: the lack of standardization in
scientific databases leads to frustration (and frustration leads to...suffering).
http://biomickwatson.wordpress.com

These are some terms that appear in the same database. You can code solutions for some of
this variation (e.g. British/American English differences or presence/absence of underscore vs
space character), but who wants to waste time doing that? Shouldn't these databases be
using controlled vocabularies?
This infamous paper from 2004 reveals how easy it is to introduce errors into biological
databases.
First highlighted column = actual gene name.
Second highlighted column = what Excel will automatically assume you mean.
RIKEN ID: 2310009E13

Happens for other identifiers as well. This RIKEN ID will change if it ever ends up in Excel...
RIKEN ID: 2.31E+13

...now it appears as a number in scientific notation.
The paper shows that these 'dates-as-gene-names' ended up propagating to other
databases.
I searched today for '2-Sep' at GenBank and this was the only hit. It's possible that this is an
intended gene-name variant, but Septin 2 is usually referred to as sep2/sept2/sep-2 etc. So
this is possibly another Excel-based error.
Sometimes people make assumptions that gene names are unique to a specific function.
DEC1 (one of the Excel-ified gene names mentioned in the earlier paper) can mean one thing
to people working on many vertebrate species...
...but something else if you work on fruit flies. Dangerous to make any assumptions when it
comes to gene names.
Consider one worm gene...

Here is one Caenorhabiditis elegans gene (abu-11) in WormBase. There is the official gene
name, a sequence name, 'other' names, the WormBase gene ID, plus other identifiers for
external databases which also describe the gene (there's also a protein ID, not shown here).
In C. elegans, gene names have a central naming authority (the CGC) but genes often get
renamed. Just look at these pqn genes which have been renamed or merged with other
genes.
This is the current view of the twk-43 gene in C. elegans (aka F32H5.7[abc]).
WormBase allows you to see the history behind genes. This gene started out as just F32H5.2,
a gene with no splice isoforms.
Then at some point it was split into 3 genes...
...before being converted into the current one gene (with four splice isoforms). Genes are
split and merged and renamed all the time. Relying on the common gene name (e.g. twk-43)
or the sequence identifier (F32H5.7) can get you into trouble.
SOLUTIONS

What can be done to help with these sorts of problems?
Use ontologies and understand what those ontologies do.
Three main parts to a Gene Ontology term (GO term):
1) The name
2) The accession
3) The definition (which can change)
A fourth major part of a GO term is that it has ancestors and children. A single term is 'part
of' other terms and also 'is' examples of other terms. E.g. a nuclear outer membrane *is* a
nuclear membrane and is *part of* the cell.
Most model organism databases are loaded up with GO terms. E.g. you can search GO terms
from the 'front door' of FlyBase.
In WormBase, the same GO term search takes you directly to a gene page.
Scroll down on that gene page and we see the specified GO term...but what is an 'evidence
code', and what does 'IDA' mean?
Sadly the majority of people who use GO terms (as part of 'DAVID' analyses etc.) have no
knowledge of evidence codes
All GO terms should be connected to genes (or other database entries) with evidence codes.
Gives you an idea of how robust the assignment is. Databases like WormBase have curators
that scan papers (by eye, but also with software) to find suitable GO terms that can be added
to genes on the basis of experiments described in the paper.
Most of the GO terms you will ever see have this evidence code. It is among the weakest of all
evidence (avoid any evidence which is 'non-traceable author statement'). It could simply
mean that a human protein (with some known information) was BLASTed against a yeast
genome and the resulting yeast match acquired the human meta-information as GO terms.
IEA codes should be treated with some suspicion.
48.2% of GO annotations
— in one of the best annotated eukaryotic animal genomes —
are generated automatically
The Gene Ontology website shows how many GO terms are attached to genes in different
organisms. Even in C. elegans (with >15 years of gene annotation), about half of the GO
terms are all in the IEA category.
Gene Ontology is not the only game in town. Sequence Ontology (SO) is widely used and a
subset of SO terms are used in GFF files to describe features (or at least they should be!).
GO and SO are part of OBO (Open Biological Ontologies: http://www.obofoundry.org).There
may be a community developing an ontology for your field of interest. This site lists them all.
Some get very specific.
SUMMARY
Use ontologies whenever possible
Don't assume that identifiers in existing databases are
the correct (or only) identifiers
Be careful when inflicting new database identifiers on
to the world!

On the last point, check whether your identifiers (even if they end up buried in supplementary
material somewhere) don't conflict with other databases out there. Long and boring
identifiers are usually the most stable and more easily parsed by scripts (although they are
the least human-friendly). But no spaces or asterisks in identifiers please!
This talk is KORF_labtalk_00000315

Contenu connexe

Tendances

Genome assembly: then and now — v1.2
Genome assembly: then and now — v1.2Genome assembly: then and now — v1.2
Genome assembly: then and now — v1.2Keith Bradnam
 
Genome assembly: then and now — with notes — v1.1
Genome assembly: then and now — with notes — v1.1Genome assembly: then and now — with notes — v1.1
Genome assembly: then and now — with notes — v1.1Keith Bradnam
 
2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptxc.titus.brown
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...Torsten Seemann
 
Lets Make a Mammoth
Lets Make a Mammoth  Lets Make a Mammoth
Lets Make a Mammoth Cheche Salas
 
Apollo - A webinar for the Phascolarctos cinereus research community
Apollo - A webinar for the Phascolarctos cinereus research communityApollo - A webinar for the Phascolarctos cinereus research community
Apollo - A webinar for the Phascolarctos cinereus research communityMonica Munoz-Torres
 
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.jennomics
 
Comparative Genomics and Visualisation - Part 2
Comparative Genomics and Visualisation - Part 2Comparative Genomics and Visualisation - Part 2
Comparative Genomics and Visualisation - Part 2Leighton Pritchard
 
How to sequence a large eukaryotic genome
How to sequence a large eukaryotic genomeHow to sequence a large eukaryotic genome
How to sequence a large eukaryotic genomeLex Nederbragt
 
Plant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In SequencesPlant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In SequencesLeighton Pritchard
 
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015
Long read sequencing -  WEHI  bioinformatics seminar - tue 16 june 2015Long read sequencing -  WEHI  bioinformatics seminar - tue 16 june 2015
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015Torsten Seemann
 

Tendances (20)

Genome assembly: then and now — v1.2
Genome assembly: then and now — v1.2Genome assembly: then and now — v1.2
Genome assembly: then and now — v1.2
 
Genome assembly: then and now — with notes — v1.1
Genome assembly: then and now — with notes — v1.1Genome assembly: then and now — with notes — v1.1
Genome assembly: then and now — with notes — v1.1
 
Genome Assembly 2018
Genome Assembly 2018Genome Assembly 2018
Genome Assembly 2018
 
2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
Bio IGCSE- Genetic Engineering.
Bio IGCSE- Genetic Engineering.Bio IGCSE- Genetic Engineering.
Bio IGCSE- Genetic Engineering.
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
 
2013 duke-talk
2013 duke-talk2013 duke-talk
2013 duke-talk
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
 
Lets Make a Mammoth
Lets Make a Mammoth  Lets Make a Mammoth
Lets Make a Mammoth
 
Apollo - A webinar for the Phascolarctos cinereus research community
Apollo - A webinar for the Phascolarctos cinereus research communityApollo - A webinar for the Phascolarctos cinereus research community
Apollo - A webinar for the Phascolarctos cinereus research community
 
2014 sage-talk
2014 sage-talk2014 sage-talk
2014 sage-talk
 
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
 
Comparative Genomics and Visualisation - Part 2
Comparative Genomics and Visualisation - Part 2Comparative Genomics and Visualisation - Part 2
Comparative Genomics and Visualisation - Part 2
 
How to sequence a large eukaryotic genome
How to sequence a large eukaryotic genomeHow to sequence a large eukaryotic genome
How to sequence a large eukaryotic genome
 
Future of metagenomics
Future of metagenomicsFuture of metagenomics
Future of metagenomics
 
Genome Curation using Apollo
Genome Curation using ApolloGenome Curation using Apollo
Genome Curation using Apollo
 
Plant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In SequencesPlant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In Sequences
 
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015
Long read sequencing -  WEHI  bioinformatics seminar - tue 16 june 2015Long read sequencing -  WEHI  bioinformatics seminar - tue 16 june 2015
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015
 
DNA Notes
DNA NotesDNA Notes
DNA Notes
 

En vedette

10 tips for adding polish to presentations
10 tips for adding polish to presentations10 tips for adding polish to presentations
10 tips for adding polish to presentationsKeith Bradnam
 
This bioinformatics lesson is brought to you by the letter 'W'
This bioinformatics lesson is brought to you by the letter 'W'This bioinformatics lesson is brought to you by the letter 'W'
This bioinformatics lesson is brought to you by the letter 'W'Keith Bradnam
 
Polish that presentation! 25 tips to bring clarity to your slides
Polish that presentation! 25 tips to bring clarity to your slidesPolish that presentation! 25 tips to bring clarity to your slides
Polish that presentation! 25 tips to bring clarity to your slidesKeith Bradnam
 
Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...Keith Bradnam
 
13 questions you might have about galaxy
13 questions you might have about galaxy13 questions you might have about galaxy
13 questions you might have about galaxyKeith Bradnam
 
Assembly: before and after
Assembly: before and afterAssembly: before and after
Assembly: before and afterLex Nederbragt
 

En vedette (7)

10 tips for adding polish to presentations
10 tips for adding polish to presentations10 tips for adding polish to presentations
10 tips for adding polish to presentations
 
This bioinformatics lesson is brought to you by the letter 'W'
This bioinformatics lesson is brought to you by the letter 'W'This bioinformatics lesson is brought to you by the letter 'W'
This bioinformatics lesson is brought to you by the letter 'W'
 
Polish that presentation! 25 tips to bring clarity to your slides
Polish that presentation! 25 tips to bring clarity to your slidesPolish that presentation! 25 tips to bring clarity to your slides
Polish that presentation! 25 tips to bring clarity to your slides
 
Master Thesis Presentation
Master Thesis PresentationMaster Thesis Presentation
Master Thesis Presentation
 
Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...
 
13 questions you might have about galaxy
13 questions you might have about galaxy13 questions you might have about galaxy
13 questions you might have about galaxy
 
Assembly: before and after
Assembly: before and afterAssembly: before and after
Assembly: before and after
 

Similaire à Better vocabulary = better bioinformatics with standardized databases and ontologies

Computing on the shoulders of giants
Computing on the shoulders of giantsComputing on the shoulders of giants
Computing on the shoulders of giantsBenjamin Good
 
BITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequencesBITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequencesBITS
 
B.sc biochem i bobi u 2 database
B.sc biochem i bobi u 2 databaseB.sc biochem i bobi u 2 database
B.sc biochem i bobi u 2 databaseRai University
 
Eumicrobedb - Oomycetes Genomics Database
Eumicrobedb - Oomycetes Genomics Database Eumicrobedb - Oomycetes Genomics Database
Eumicrobedb - Oomycetes Genomics Database Arup Ghosh
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsmikaelhuss
 
BITS - Introduction to comparative genomics
BITS - Introduction to comparative genomicsBITS - Introduction to comparative genomics
BITS - Introduction to comparative genomicsBITS
 
ABIcurator.doc
ABIcurator.docABIcurator.doc
ABIcurator.docbutest
 
Ontology - and Reloaded and Revolutions
Ontology - and Reloaded and RevolutionsOntology - and Reloaded and Revolutions
Ontology - and Reloaded and RevolutionsJie Bao
 
Introduction to Gene Mining Part A: BLASTn-off!
Introduction to Gene Mining Part A: BLASTn-off!Introduction to Gene Mining Part A: BLASTn-off!
Introduction to Gene Mining Part A: BLASTn-off!adcobb
 
Semantics of and for the diversity of life:
 Opportunities and perils of tryi...
Semantics of and for the diversity of life:
 Opportunities and perils of tryi...Semantics of and for the diversity of life:
 Opportunities and perils of tryi...
Semantics of and for the diversity of life:
 Opportunities and perils of tryi...Hilmar Lapp
 
Getting Started with the Hymenoptera Anatomical Ontology
Getting Started with the Hymenoptera Anatomical OntologyGetting Started with the Hymenoptera Anatomical Ontology
Getting Started with the Hymenoptera Anatomical OntologyKatja C. Seltmann
 
Web Apollo Workshop University of Exeter
Web Apollo Workshop University of ExeterWeb Apollo Workshop University of Exeter
Web Apollo Workshop University of ExeterMonica Munoz-Torres
 

Similaire à Better vocabulary = better bioinformatics with standardized databases and ontologies (20)

Chibucos annot go_final
Chibucos annot go_finalChibucos annot go_final
Chibucos annot go_final
 
Locus link
Locus linkLocus link
Locus link
 
Computing on the shoulders of giants
Computing on the shoulders of giantsComputing on the shoulders of giants
Computing on the shoulders of giants
 
BITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequencesBITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequences
 
B.sc biochem i bobi u 2 database
B.sc biochem i bobi u 2 databaseB.sc biochem i bobi u 2 database
B.sc biochem i bobi u 2 database
 
Eumicrobedb - Oomycetes Genomics Database
Eumicrobedb - Oomycetes Genomics Database Eumicrobedb - Oomycetes Genomics Database
Eumicrobedb - Oomycetes Genomics Database
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomics
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
BITS - Introduction to comparative genomics
BITS - Introduction to comparative genomicsBITS - Introduction to comparative genomics
BITS - Introduction to comparative genomics
 
ABIcurator.doc
ABIcurator.docABIcurator.doc
ABIcurator.doc
 
Pathogen Genome Data
Pathogen Genome DataPathogen Genome Data
Pathogen Genome Data
 
Ontology - and Reloaded and Revolutions
Ontology - and Reloaded and RevolutionsOntology - and Reloaded and Revolutions
Ontology - and Reloaded and Revolutions
 
Introduction to Gene Mining Part A: BLASTn-off!
Introduction to Gene Mining Part A: BLASTn-off!Introduction to Gene Mining Part A: BLASTn-off!
Introduction to Gene Mining Part A: BLASTn-off!
 
Semantics of and for the diversity of life:
 Opportunities and perils of tryi...
Semantics of and for the diversity of life:
 Opportunities and perils of tryi...Semantics of and for the diversity of life:
 Opportunities and perils of tryi...
Semantics of and for the diversity of life:
 Opportunities and perils of tryi...
 
A01-Openness in knowledge-based systems
A01-Openness in knowledge-based systemsA01-Openness in knowledge-based systems
A01-Openness in knowledge-based systems
 
Protease Phylogeny
 Protease Phylogeny  Protease Phylogeny
Protease Phylogeny
 
David
DavidDavid
David
 
Major biological nucleotide databases
Major biological nucleotide databasesMajor biological nucleotide databases
Major biological nucleotide databases
 
Getting Started with the Hymenoptera Anatomical Ontology
Getting Started with the Hymenoptera Anatomical OntologyGetting Started with the Hymenoptera Anatomical Ontology
Getting Started with the Hymenoptera Anatomical Ontology
 
Web Apollo Workshop University of Exeter
Web Apollo Workshop University of ExeterWeb Apollo Workshop University of Exeter
Web Apollo Workshop University of Exeter
 

Plus de Keith Bradnam

This bioinformatics lesson is brought to you by the letter 'T'
This bioinformatics lesson is brought to you by the letter 'T'This bioinformatics lesson is brought to you by the letter 'T'
This bioinformatics lesson is brought to you by the letter 'T'Keith Bradnam
 
This bioinformatics lesson is brought to you by the letter 'D'
This bioinformatics lesson is brought to you by the letter 'D'This bioinformatics lesson is brought to you by the letter 'D'
This bioinformatics lesson is brought to you by the letter 'D'Keith Bradnam
 
Genome assembly: then and now (with notes) — v1.2
Genome assembly: then and now (with notes) — v1.2Genome assembly: then and now (with notes) — v1.2
Genome assembly: then and now (with notes) — v1.2Keith Bradnam
 
Genome assembly: then and now — v1.0
Genome assembly: then and now — v1.0Genome assembly: then and now — v1.0
Genome assembly: then and now — v1.0Keith Bradnam
 
Database talk for Bits & Bites meeting
Database talk for Bits & Bites meetingDatabase talk for Bits & Bites meeting
Database talk for Bits & Bites meetingKeith Bradnam
 
Benchmarking short-read mapping programs
Benchmarking short-read mapping programsBenchmarking short-read mapping programs
Benchmarking short-read mapping programsKeith Bradnam
 
Thoughts on the recent announcements by Oxford Nanopore Technologies
Thoughts on the recent announcements by Oxford Nanopore TechnologiesThoughts on the recent announcements by Oxford Nanopore Technologies
Thoughts on the recent announcements by Oxford Nanopore TechnologiesKeith Bradnam
 
When is a genome finished?
When is a genome finished? When is a genome finished?
When is a genome finished? Keith Bradnam
 
Twitter 101 - an introduction to Twitter
Twitter 101  - an introduction to TwitterTwitter 101  - an introduction to Twitter
Twitter 101 - an introduction to TwitterKeith Bradnam
 

Plus de Keith Bradnam (9)

This bioinformatics lesson is brought to you by the letter 'T'
This bioinformatics lesson is brought to you by the letter 'T'This bioinformatics lesson is brought to you by the letter 'T'
This bioinformatics lesson is brought to you by the letter 'T'
 
This bioinformatics lesson is brought to you by the letter 'D'
This bioinformatics lesson is brought to you by the letter 'D'This bioinformatics lesson is brought to you by the letter 'D'
This bioinformatics lesson is brought to you by the letter 'D'
 
Genome assembly: then and now (with notes) — v1.2
Genome assembly: then and now (with notes) — v1.2Genome assembly: then and now (with notes) — v1.2
Genome assembly: then and now (with notes) — v1.2
 
Genome assembly: then and now — v1.0
Genome assembly: then and now — v1.0Genome assembly: then and now — v1.0
Genome assembly: then and now — v1.0
 
Database talk for Bits & Bites meeting
Database talk for Bits & Bites meetingDatabase talk for Bits & Bites meeting
Database talk for Bits & Bites meeting
 
Benchmarking short-read mapping programs
Benchmarking short-read mapping programsBenchmarking short-read mapping programs
Benchmarking short-read mapping programs
 
Thoughts on the recent announcements by Oxford Nanopore Technologies
Thoughts on the recent announcements by Oxford Nanopore TechnologiesThoughts on the recent announcements by Oxford Nanopore Technologies
Thoughts on the recent announcements by Oxford Nanopore Technologies
 
When is a genome finished?
When is a genome finished? When is a genome finished?
When is a genome finished?
 
Twitter 101 - an introduction to Twitter
Twitter 101  - an introduction to TwitterTwitter 101  - an introduction to Twitter
Twitter 101 - an introduction to Twitter
 

Dernier

PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptxPoojaSen20
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 

Dernier (20)

PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptx
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 

Better vocabulary = better bioinformatics with standardized databases and ontologies

  • 1. WHAT'S IN A NAME? Better vocabulary = better bioinformatics??? From flickr user giantginkgo # Author: Keith Bradnam, Genome Center, UC Davis # This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
  • 2. http://biomickwatson.wordpress.com Most of the interesting 'stuff' that I discover about bioinformatics and genomics comes from a) twitter, b) blogs, and c) papers (in that order). Mick Watson has fun and engaging blog about bioinformatics and today he raised an important point: the lack of standardization in scientific databases leads to frustration (and frustration leads to...suffering).
  • 3. http://biomickwatson.wordpress.com These are some terms that appear in the same database. You can code solutions for some of this variation (e.g. British/American English differences or presence/absence of underscore vs space character), but who wants to waste time doing that? Shouldn't these databases be using controlled vocabularies?
  • 4. This infamous paper from 2004 reveals how easy it is to introduce errors into biological databases.
  • 5. First highlighted column = actual gene name. Second highlighted column = what Excel will automatically assume you mean.
  • 6. RIKEN ID: 2310009E13 Happens for other identifiers as well. This RIKEN ID will change if it ever ends up in Excel...
  • 7. RIKEN ID: 2.31E+13 ...now it appears as a number in scientific notation.
  • 8. The paper shows that these 'dates-as-gene-names' ended up propagating to other databases.
  • 9. I searched today for '2-Sep' at GenBank and this was the only hit. It's possible that this is an intended gene-name variant, but Septin 2 is usually referred to as sep2/sept2/sep-2 etc. So this is possibly another Excel-based error.
  • 10. Sometimes people make assumptions that gene names are unique to a specific function. DEC1 (one of the Excel-ified gene names mentioned in the earlier paper) can mean one thing to people working on many vertebrate species...
  • 11. ...but something else if you work on fruit flies. Dangerous to make any assumptions when it comes to gene names.
  • 12. Consider one worm gene... Here is one Caenorhabiditis elegans gene (abu-11) in WormBase. There is the official gene name, a sequence name, 'other' names, the WormBase gene ID, plus other identifiers for external databases which also describe the gene (there's also a protein ID, not shown here).
  • 13. In C. elegans, gene names have a central naming authority (the CGC) but genes often get renamed. Just look at these pqn genes which have been renamed or merged with other genes.
  • 14. This is the current view of the twk-43 gene in C. elegans (aka F32H5.7[abc]).
  • 15. WormBase allows you to see the history behind genes. This gene started out as just F32H5.2, a gene with no splice isoforms.
  • 16. Then at some point it was split into 3 genes...
  • 17. ...before being converted into the current one gene (with four splice isoforms). Genes are split and merged and renamed all the time. Relying on the common gene name (e.g. twk-43) or the sequence identifier (F32H5.7) can get you into trouble.
  • 18. SOLUTIONS What can be done to help with these sorts of problems?
  • 19. Use ontologies and understand what those ontologies do.
  • 20. Three main parts to a Gene Ontology term (GO term): 1) The name 2) The accession 3) The definition (which can change)
  • 21. A fourth major part of a GO term is that it has ancestors and children. A single term is 'part of' other terms and also 'is' examples of other terms. E.g. a nuclear outer membrane *is* a nuclear membrane and is *part of* the cell.
  • 22. Most model organism databases are loaded up with GO terms. E.g. you can search GO terms from the 'front door' of FlyBase.
  • 23. In WormBase, the same GO term search takes you directly to a gene page.
  • 24. Scroll down on that gene page and we see the specified GO term...but what is an 'evidence code', and what does 'IDA' mean? Sadly the majority of people who use GO terms (as part of 'DAVID' analyses etc.) have no knowledge of evidence codes
  • 25. All GO terms should be connected to genes (or other database entries) with evidence codes. Gives you an idea of how robust the assignment is. Databases like WormBase have curators that scan papers (by eye, but also with software) to find suitable GO terms that can be added to genes on the basis of experiments described in the paper.
  • 26. Most of the GO terms you will ever see have this evidence code. It is among the weakest of all evidence (avoid any evidence which is 'non-traceable author statement'). It could simply mean that a human protein (with some known information) was BLASTed against a yeast genome and the resulting yeast match acquired the human meta-information as GO terms. IEA codes should be treated with some suspicion.
  • 27. 48.2% of GO annotations — in one of the best annotated eukaryotic animal genomes — are generated automatically The Gene Ontology website shows how many GO terms are attached to genes in different organisms. Even in C. elegans (with >15 years of gene annotation), about half of the GO terms are all in the IEA category.
  • 28. Gene Ontology is not the only game in town. Sequence Ontology (SO) is widely used and a subset of SO terms are used in GFF files to describe features (or at least they should be!).
  • 29. GO and SO are part of OBO (Open Biological Ontologies: http://www.obofoundry.org).There may be a community developing an ontology for your field of interest. This site lists them all.
  • 30. Some get very specific.
  • 32. Use ontologies whenever possible Don't assume that identifiers in existing databases are the correct (or only) identifiers Be careful when inflicting new database identifiers on to the world! On the last point, check whether your identifiers (even if they end up buried in supplementary material somewhere) don't conflict with other databases out there. Long and boring identifiers are usually the most stable and more easily parsed by scripts (although they are the least human-friendly). But no spaces or asterisks in identifiers please! This talk is KORF_labtalk_00000315