Global Digital Infrastructure for Biological Nomenclature and Taxonomy
Ellinor Michel, Dep’t of Life Sciences, The Natural History Museum, London, UK, (e.michel@nhm.ac.uk)
Richard L. Pyle, Natural Sciences Dep’t, Bishop Museum, Honolulu, HI, USA
Robert P. Guralnick, Dep’t of Ecology & Evolutionary Biology, Univ Colorado, Boulder, CO, USA
Jon Todd, Dep’t of Earth Sciences, The Natural History Museum, London, UK,
The future for interoperable scientific information is digital, yet scientific names, the handles for all biodiversity information, remain without an integrated system tied to published descriptions and museum type specimens. Descriptions and type specimens provide standards for the otherwise fluid concepts of biological taxa. We are working to unify the infrastructures for biological nomenclature across nomenclatural codes (including zoological (ICZN - http://iczn.org/), botanical (ICNafp - http://www.iapt-taxon.org/nomen/main.php) and bacterial (ICNB) codes) through the Global Names Architecture (GNA). Our initial focus is on animal names, as these comprise the largest component of metazoan biodiversity and ZooBank (zoobank.org) is the first code-related online nomenclatural registration system. Users are applied scientists in agriculture, medicine, veterinary science and climate change research; biodiversity researchers such as ecologists, physiologists; archives such as museums; the scientific publishing community – in short, all users of scientific names of organisms based on the work of taxonomists.
Engler and Prantl system of classification in plant taxonomy
Michel digital nomenclature-gna-zoobank-2014-co-namesconfv2
1. Global Digital
Infrastructure for
biological
nomenclature and
taxonomy
Ellinor Michel1,2
Richard Pyle2,3
Robert Guralnick4
Jon Todd1,2
1The Natural History Museum, London UK
2Int’l Committee on Bionomenclature
3Bishop Museum, HI, USA
4Univ of Colorado, Boulder, USA
2. THE LINNAEAN
ENTERPRISE
(E.O. Wilson)
________
the task of
identifying all of
Earth’s
biodiversity
3. Names and the information revolution
All accumulated information of
a species is tied to a scientific
name, a name that serves as a
link between what has been
learned in the past and what we
today add to the body of
knowledge.
- Grimaldi & Engel, 2005
Note: they don’t say THE scientific name (i.e., singular)
9. Type Principle
Stabilizing biological names with a physical standard
… the type
specimen
From Linnaeus’ Fish
Collection held at the
Linnean Society of London
From living
animals to ….
Pomatomus saltator (Linnaeus 1766)
11. Type Specimen = Name
Stabilizing biological names with a physical standard
Gasterosteus saltatrix Linnaeus, 1766
Pomatomus saltator (Linnaeus 1766)
12. Types for the Future
✔Whole Organism
✔Organism Part
Tissue Sample
? DNA Extraction
?PCR Product
✗DNA Sequence
13. Types for the Future
✔Whole Organism
✔Organism Part
Tissue Sample
? DNA Extraction
?PCR Product - reproduced
DNA Sequence
- derived & interpreted
✗
14. Sequence data is a (usually high fidelity,
fine granularity) representation of
molecules
DNA, RNA or proteins are single kinds of organismal data
15. Sequencing Type Specimens
Data enrichment of existing types
✔
Types for the Future
Sequencing ‘Epitypes’
(‘botanical’ term, new specimens from type locality, etc.)
Risk of losing stability
?
✗‘Type Sequence’
16. A name = ‘computer’ readable
code
that links information
17. A name = ‘computer’ readable
code
FFF7160A-372D-40E9-9611-23AF5D9EAC4C
Hard for a human;
Easy for a computer
“Pomatomus saltator Linn.”
Easy for a human;
Hard for a computer
18. ToL Where Are
Biodiversity Data?
GenBank
BDWB
Hymenoptera
Name
Server
CalPhotos
Names
connecting information
20. A Global Names Architecture
Rationale
• Taxon Names are the fundamental link among virtually
all biodiversity information
• Biodiversity Information relates to species concepts, but
data resources are usually tied to text-string names
• Text-string names are difficult to cross-link due to
spelling variations, different genus-species
combinations, homonyms, synonyms, etc.
• Linking text-string names to concepts requires source-based
(literature-based) approach
• The key challenge is to cross-link thousands of
biodiversity datasets through taxon concepts, using only
text-string names
21. A Global Names Architecture
Funding & Support (2007- to date)
National Science Foundation
BiSciCol (DBI-0956415)
GNA (DBI-1062441)
Encyclopedia of Life
GBIF
PBIN / NBII
Various others (e.g., NOAA, other NSF projects)
More NSF & EU Proposals in Process
Partners & Governance
CoL (Species2000 / ITIS), EOL, BHL, GBIF, IPNI, Index
Fungorum, ICZN / ZooBank, Landcare Research,
MOBOT / Tropicos, Bishop Museum, WHOI, IRMNG,
PESI, ALA (and numerous others)
Global Names Architecture Advisory Panel (GNAAP)
22. A Global Names Architecture
What GNA is NOT
Yet another database
What GNA IS (…intended to be…)
“Pomatomus saltator Linn.”
Easy for a human;
Hard for a computer
Name or Text-string name
23. A Global Names Architecture
What GNA is NOT
Yet another database
What GNA IS (…intended to be…)
FFF7160A-372D-40E9-9611-23AF5D9EAC4C
Hard for a human;
Easy for a computer
“Pomatomus saltator Linn.”
Easy for a human;
Hard for a computer
Name or Text-string name UUID or GUID (Universally or
Globally Unique Identifier)
24. A Global Names Architecture
Data Components
Global Names Index
(GNI)
• Database and services
optimized for taxon names
represented as raw text
strings. (“Dirty Bucket”)
• ~17+ million text strings
• Parsing Services
• Lexical Grouping
• Links back to sources
• Developed at Woods Hole
25. A Global Names Architecture
Data Components
Global Names Usage Bank
(GNUB)
• Database and services
optimized for taxon names
represented as “curated”
Taxon Name Usages.
(“Clean Bucket”)
• >70K Agents
• >73K References
• >523K Taxon Name Usages
(>186K Protonyms)
• Developed at Bishop Museum
Global Names Index
(GNI)
• Database and services
optimized for taxon names
represented as raw text
strings. (“Dirty Bucket”)
• ~17+ million text strings
• Parsing Services
• Lexical Grouping
• Links back to sources
• Developed at Woods Hole
26. Reference
Any static document source (Publication;
Specimen Determination Label; Field
Notes, Correspondence, etc.).
Taxon Name Usage (TNU)
A usage of a taxon name within the
context of a Reference.
Protonym (≈Basionym)
A usage of a taxon name representing the
Code-Compliant “creation” of a new name.
tnuID Reference NameString Rank
123 Fowler & Bean, 1930:181 Belonoperca Genus
ProtonymID
123
ValidUsageID
123
234 Fowler & Bean, 1930:182 chabanaudi Species 234 234
ParentUsageID
123
37. California
Academy
of Sciences
AnimalBase
Encyclopedia
of D Life
Marine Species
Identification
Portal
D Catalog of
Life
D FishBase
D WoRMS
DITIS
D IRMNG
DGBIF
38. California
Academy
of Sciences
AnimDalBase
BHL
Amphibian
Species of
the World
D
Hymenoptera
Online
DIPNI
D FishBase
39. A Global Names Architecture
GenBank
BDWB
Hymenoptera
Name
Server
CalPhotos
ToL
FFF7160A-372D-40E9-9611-23AF5D9EAC4C
40. A Global Names Architecture
ToL
GenBank
BDWB
Hymenoptera
Name
Server
CalPhotos
FFF7160A-372D-40E9-9611-23AF5D9EAC4C
41. Current Numbers
- 1,236 Contributors
- 70,404 Agents (Authors)
- 73,779 References
- 8,547 Journals
- 4,302 Books
- 2,205 Book Sections
- 51,205 Articles
- 7,520 Other
- 523,700 Taxon Name Usages
- 185,772 Protonyms
Scaling Content
~2M Species
~5-10M Protonyms
~50M Name-Strings?
~100M’s TNUs???
Publication Workflow (Pensoft, Zootaxa, Others)
Bulk Import
- Sherborn’s Index Animalium (7,700+ References, 430K TNUs)
- Hymenoptera Name Server
- Systema Dipterorum (35K References, 130K TNUs)
- Dozen+ other nomenclator databases
- BHL (3,400 Journals, 55K Books, 100’sK Articles)
42. BHL References cross-linked
to Protonyms to
generate Taxon Name
Global
Names
Usage
Bank
BHL scanned text is
processed to
discover taxon names
in Global Names
Index.
Taxon names in GNI are anchored
to Protonyms in GNUB.
Usages.
44. But we all know that some
names aren’t simple
• Sometimes name strings have multiple
meanings
• In these cases a name string cannot act as a
taxon-identifier without knowing how it has
been interpreted
Taxonomy
45.
46. In birds there are many allopatric subspecies and different authorities
interpret inclusiveness of taxa in different ways (= different taxon concepts)
48. A Global Names Architecture
Conclusions
• Taxon Names are the fundamental link among virtually
all biodiversity information
• Biodiversity Information relates to species concepts, but
data resources are usually tied to text-string names
• Text-string names are difficult to cross-link due to
spelling variations, different genus-species
combinations, homonyms, synonyms, etc.
• Linking text-string names to concepts requires source-based
(literature-based) approach
• The key challenge is to cross-link thousands of
biodiversity datasets through taxon concepts, using only
text-string names