Web & Social Media Analytics Previous Year Question Paper.pdf
Dr David Schindel and Mike Trizna - BOL Data Portal
1. The Barcode of Life
Data Portal
(http://bol.uvm.edu)
Dr. David E Schindel, Executive Secretary
Michael Trizna, Database Specialist
Consortium for the Barcode of Life (CBOL)
Smithsonian Institution
Washington, DC
www.barcodeoflife.org;
SchindelD@si.edu and TriznaM@si.edu
2. Contents of Presentation
Crowd-sourced open source software
How does Data Portal complement BOLD
and GenBank?
Data Portal capabilities
Case Study: Smithsonian frozen bird
tissue project
3. An Experiment in Museum Tissue
Mining and Fast Data Release
Tissue sampling winter/spring
Sequencing completed in September
Sequence quality control in October
Taxonomic checking in early November
– Obvious errors removed
– Minor discrepancies remain
Data released for Adelaide Conference
– Crowd-sourced annotation by community
– Will data be mis-used?
4. Unique Data Portal Capabilities
Creating customized datasets from public
and/or your private data
Online library of standard datasets
Support sharing within project teams using
Connect IDs, easy link to Working Groups
Running different identification analyses
based on different methodologies:
– Standard sequence input using FASTA format
– Use standard or customized datasets
8. Existing Data Analysis Packages
LIST of packages
– BLOG
– BRONX
– Kernel
– CAOS
– USEARCH
– BLAST
Output of identification routines as
probabilities of assignment
9. Data Analysis Methods Session
New packages presented Friday
afternoon:
– Damon Little: Automatic Plants Barcode
pipeline (from raw traces to trimmed/edited
sequences)
– Ka Hou Chu: Composite Vector Method
(profile trees for faster alignment and tree-
based analysis)
– Alain Franc: Matching Next Generation results
to Sanger-based reference records
14. The USNM Bird Project
USNM Division of Birds frozen tissue
collection:
– 21,104 specimens, 2512 species
Which new ones ones to sample/barcode?
Public records for birds
– All public bird COI records: 10,967
– All BARCODE records in GenBank: 8,419
– BARCODE with taxonomic names: 7,965
– BARCODE, name and 2 traces: 2,388
15. Moving Data Among
BOLD, GenBank, Data Portal
USNM Excel BOLD
Spreadsheet Split into projects that
(KE-Emu Source) consist of 2-4 plates
Local database that Data Portal
holds all fields from Aggregator
the original database
spreadsheet
16. Creating a ‘Pick List’
Spreadsheet of tissue samples compared
with:
– ITIS taxonomy
– Clemens species list in BOLD
– Counts of GenBank and/or public BOLD
records
– Geographic informattion
Screenshot of USNM list side-by-side with
BOLD records
19. USNM Bird Dataset
3150 tissues sampled
168 failed sequences
94 problematic sequences
166 clustered badly
2761 ‘BARCODE-ready’ samples
1,147 ‘first-BARCODE’ species
91% increase over 1,259 barcoded species
(3,892 listed in BOLD includes BINs, others)
20. Two problematic clades, USNM data
Flycatchers: Family Tyrannidae
– Sublegatus arenarum, S. modestus, S.
obscurior, S. sp.
– Conopias parvus, C. albovittatus
– Myiarchus ferox, M. swainsoni, M. sp.
Hummingbirds: Family Trochilidae
– Phaethornis longuemareus
Inconsistencies within USNM dataset
Incompatibilities with public, other data
23. What testing dataset to use?
ID trees and analytical routines could use:
– All public bird COI records: 10,967
– All BARCODE records in GenBank: 8,419
– BARCODE with taxonomic names: 7,965
– BARCODE, name and 2 traces: 2,388
Which ones have reliable taxonomic IDs?
24. Preparing a Data Release Paper
Summary statistics from Data Portal
Figures from BOLD