Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Building data infrastructures for science

614 vues

Publié le

Presented by V. Smith at the Informatics Horizons event, Natural History Museum, London, UK. 24 July 2013.

Publié dans : Technologie
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Building data infrastructures for science

  1. 1. Building data infrastructures for science Vince Smith Informatics Horizons, London 24 July 2013
  2. 2. Overview 1. (my) Background • • Lice to data infrastructures! Why data infrastructures at the NHM 2. Building data infrastructures • • • Recent core investment in NHM infrastructures Leveraging external investment in NHM infrastructures Infrastructure design principles & coordination 3. NHM 5-year data infrastructure horizons • • • Collections digitisation Large-scale use of collections data New approaches to biodiversity discovery 4. Decadal community infrastructure challenges • • The long view – science data strategies Data modeling and real time monitoring as a unifying theme
  3. 3. 1. (my) Background
  4. 4. Lice to data infrastructures! Systematics (circa 1998) - No high level keys - Poor high level taxonomy - Just one phylogeny - Few living experts! Circa 5,000 spp. Mammals & birds 12,000 associations 15,000 potential hosts
  5. 5. My data infrastructure (circa 1998) - Taxonomic names - Authorities (name concepts) - Citations - Collection data - Morphological characters - Textual descriptions - Diagnostic keys - Illustrations - Photographs Palma, R.L., and R.L.C. Pilgrim. 2002. A revision of the genus Naubates (Insecta: Phthiraptera: Philopteridae). J. R. Soc. N.Z. 32:760. 142 pieces of “raw” data in 4 of 54 pages, in 1 of 9,110 taxonomic papers on lice
  6. 6. “The bane of my existence is doing things that I know the computer could do for me” -- Dan Connolly, The XML Revolution (Nature, 1998)
  7. 7. My data infrastructures (circa 2004) Images Specimens (SID) LouseBASE Glasgow version at: http://darwin.zoology.gla.ac.uk/~rpage/LouseBase/2/ Lab Notebook Literature http://darwin.zoology.gla.ac.uk/~SID/ Host-Parasite Checklists PHPBib http://www2.flmnh.ufl.edu/pdb/ http://myphpbib.sourceforge.net/ http://www2.flmnh.ufl.edu/adb/
  8. 8. My publications in 2004 (enabled by these infrastructures) Making louse research more efficient, more collaborative and more productive Biol. Letters Zoo. Scripta Syst. Biol. Specimens Grzimek’s Ency. Mol. Phyl. Evol. Images Ent. Abh. Proc. R. Soc. B Lab Notebooks PLoS Biology Science Literature Checklists
  9. 9. Why data infrastructures at the NHM: lots of potential Card indices Library Archives Staff Frozen Tissue Labels Slides Spirit Dry
  10. 10. 2. Building data infrastructures
  11. 11. Recent NHM investment in science data infrastructures 1. KE EMu (collections data) • • • Improved interface (speed, complexity, data quality, support) Rapid Data Entry Web-Interface Improved import & export functionality (CLD & data portal) 2. DAMS (multimedia) ? • Review (Digital Strategy Group) 3. NHM Virtual Library (literature) • • Integrated search & discovery of NHM resources Better integration with external resources 4. NHM Data Portal (access, citation & archival) • • • • Discovery & visualisation of collections data on the Web Web exposure & archival of NHM research datasets Sub-portals for collaborative projects As strategically important as the Web in 3 years time! Enabling the NHM mission? Collections Public Engagement Research
  12. 12. What are Scratchpads? (http://scratchpads.eu) External investment in science data infrastructures 1. ViBRANT (EU FP7 Infrastructures, 17 partners, €4.75M) • • Virtual Biodiversity Research & Access Network for Taxonomy Building & integrating tools supporting biodiversity research communities (publishing, literature & vocabulary management, ID keys, conservation assessments, mapping & visualisation tools, citizen science support) 2. e-Monocot (NERC Consortium; Kew Oxford & NHM, £2.38M) • • Sustainable, integrated resource on Monocot plants Content and supporting digital infrastructure (Complete family level keys & taxon pages; generic keys & pages for 8 families; select species-level resources from European Monocots, Red-list species and Slipper orchids) 3. SYNTHESYS 1,2 & 3 (EU FP5/6/7 Infrastructures, 18 partners, €10M) • • Support for physical access to participating collections JRA: Research into mass collections digitisation (Image analysis, segmentation, transcription & crowdsourcing) 4. Others • • Open-UP BHL-EUROPE ViBRANT Virtual Biodiversity
  13. 13. What are Scratchpads? (http://scratchpads.eu) Scratchpad VRE: foundation for ViBRANT & eMonocot Taxa (Classifications, taxon profiles, specimens, literature, images, maps, phenotypic, genotypic & morphometric datasets, keys, phylogenies) Conservation Projects Regions Societies
  14. 14. Impact: What are Scratchpads? (http://scratchpads.eu) Scratchpad usage (July 2013) 525 Scratchpad Communities by 6,550 active registered users covering 73,444 taxa in 535,317 pages. In total more than 1,300,000 visitors 81 paper citations in 2012 Per month unique visitors to Scratchpad sites 119 NHM staff, 83 sites 65,000 unique visitors/month
  15. 15. 3. Our near-term infrastructure horizons
  16. 16. Digital Ambition: NHM Science Strategy 2013-2017 A New Voyage of Discovery Three Focal Areas 1. Scientific discovery 2. Scientific infrastructure 3. Scientific engagement Five Challenges 1. The digital NHM 2. Origins, evolution & futures 3. Biodiversity discovery 4. Natural resources & hazards 5. Science, society & skills Resources & funding Measuring success
  17. 17. Digital Ambition: NHM Science Strategy 2013-2017 A New Voyage of Discovery Three Focal Areas 1. Scientific discovery 2. Scientific Infrastructure 3. Scientific engagement Five Challenges 1. The digital NHM 2. Origins, evolution & futures 3. Biodiversity discovery 4. Natural resources & hazards 5. Science, society & skills Resources & funding Measuring success Collections digitisation Large-scale use of collections data New approaches to biodiversity discovery
  18. 18. Collections digitisation (data mobalisation) Target 20M specimens available digitally in 5-years Challenges Current fragmented efforts Heterogeneity of process Existing data (2.8M lots; 400k geo.; 120k images) Scale of operation (iCollections, 130k in 1 year) Transcription (Citizen Sci. / crowdsourcing) Data quality, annotation & feedback Resources & funding Expensive (£20-£60M @ £1-3 per specimen) Linked to our public offer Next steps (Sept. 2013) Coll. Descriptions & protocols Greater coordination of effort Programme group with project portfolio? Planning of digital access via NHM Data Portal
  19. 19. Large scale use of collections data (or why digitise) Data applications help set digitisation priorities 1000 Crop Wild Relatives 500 Invasive alien species Impacts of climate change Species conservation & protected areas Impacts of human development Biodiversity & human health Food, farming & biofuels Sustainable delivery of data 0 Poaceae Legumino… Brassicac… Rosaceae Solanaceae Composit… Rubiaceae Vitaceae Anacardi… Araceae Arecaceae Moraceae Malvaceae Musaceae Cucurbita… Amaryllid… Grossular… Amarant… Aquifoliac… Theaceae Juglandac… Euphorbi… Apiaceae Caricaceae Asparaga… Dioscorea… Pedaliace… Rutaceae Lauraceae Betulaceae Convolvul… Myrtaceae Oleaceae Zingibera… Bromelia… Piperaceae Lecythida… Potential applications for NHM data NHM Data Portal NHM Data portal Promote access & reuse of data Sub-portals for specific themes Delivering content to third parties (e.g. GBIF) Next steps (requirements) Storage (Access, backup & archival) Citation, linking & measuring impact (identifiers) Data layering & visualisation H.P.C. (Ecol. niche modeling & analysis) Data visualisation
  20. 20. New approaches to biodiversity discovery (new types of data) Take home messages from NHM Tropical Biodiversity Symposium Molecular approaches Molecular detection & monitoring of organisms is routine Metagenomics (env. sequencing) commonplace Whole genomes are normal The primary route to understanding biodiversity for many Ecological observatories 3-4 June 2013, NHM Automated biodiversity detection Remote sensing (e.g. satellite & acoustic data, drones, camera traps) Monitoring conspicuous, rare or invasive spp. (algal blooms, palms) Monitoring human activity Supplement field research, fills in gaps & scales Digital infrastructure requirements Very large quantities of data (2.5-10TB per researcher per yr.) Doesn’t map to existing NHM collections infrastructures Challenge current networking & storage capacity Digital and physical collections become equally important? 22 July, 2013
  21. 21. 4. Community decadal challenges
  22. 22. The long view: community informatics challenges GBIF GBIC Report (Coming soon) EU Biodiversity Strategy (2011) Biodiv. Inf. Challenges (2013)
  23. 23. Modeling the biosphere: a (the) 30 year goal? A clear, singular long-term vision, that NHM data can contribute too Nature 2013, doi:10.1038/493295a
  24. 24. QUESTIONS
  25. 25. What are Scratchpads? Infrastructure design principals* (http://scratchpads.eu) = experience from 7-years with the Scratchpads = lessons for building NHM data infrastructures? 1. Start with needs - focus on real user needs (not just the ‘official process’) 2. Do less - if someone else is doing it, link to it or use it 3. Design with data - prototype and test with real users on the live website 4. Do the hard work to make it simple - let the computer take the strain 5. Iterate. Then iterate again. - iteration reduces risk & is more sustainable 6. Build for inclusion – it’s easier in the long run 7. Understand context - we are designing for people, not a screen or a brand 8. Build digital services, not websites - there is life beyond the website 9. Be consistent, not uniform - every circumstance is different 10. Make things open: it makes things better - it’s more sustainable *https://www.gov.uk/designprinciples
  26. 26. What are Scratchpads? (http://scratchpads.eu) Better NHM digital coordination from 2013 Digital Strategy Group Developing common vision High level strategy Director level engagement (Science, PEG & Corp. Services) Digital Design Group Digital Programme Group Delivering & leading digital activities Fund raising (internal & external) Prioritisation Administrative support Resource management Analysis of impact