A talk given at iEvobio11, a conference about Informatics for Phylogenetics, Biodiversity and Evolutionary Biology, held in Norman, Oklahoma June 21-22, 2011
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
The emerging biodiversity data ecosystem
1. The emerging biodiversity data ecosystem Cynthia Parr, Katja Schulz, Jennifer Hammock Smithsonian Institution Nathan Wilson, Patrick Leary Marine Biological Laboratory Richard Allen Environmental Protection Agency
2. Today’s story What is EOL Core questions Network analysis Hotlist development Page richness algorithm Conclusion: improving the health and richness of our knowledge network advances understanding
9. EOL is a content curation community Content providers Databases Journals LifeDesks Public contributions Curating Aggregation Commenting Tagging http://www.eol.org
10. Core questions Where is our knowledge about biodiversity? Where are the gaps? What are the most effective ways to fill gaps given our limited resources?
11. Network analysis with Anne Bowser, University of Maryland EOL GBIF NCBI EOL connects hubs
14. Implications and next steps Need more data Identify isolated projects & mechanisms for connecting them to the network Improve resilience & redundancy Distribute annotation & quality control Model data flow quantity and impact
18. Developing the EOL hot list Consultation with taxonomic experts Development of criteria Assembly of critical lists Establishing targets for rich taxon pages, lesser known pages
19. EOL’s hot lists Hot List Red Hot List 70,000 taxa Conservation concern Invasives Model organisms Ecologically important Pests Charismatics Data availability 2,800 taxa Most searched Top 100 invasives Crops (food) Zoos & aquaria High traffic Higher taxa
20. Taxon page richness algorithm 60% 30% 10% Breadth: Images, topics of text objects, references, maps, videos, sounds, conservation status Depth: # words per text object, # words total Diversity: Sources (partners) + + a (Breadth) b (Depth) c (Diversity) 0 – 1, Threshold 0.4
21. Summary of EOL page richness Overall Hot List 640,000 have content 2 % are rich 25 % have only links to literature 28 % of 75K are rich Average richness = 0.30 Red Hot List 56 % of 3K are rich Average richness = 0.43
22. Strategies for improving richness Crowd-sourcing Leveraging Collections Communities Mobile apps Enabling platforms Enabling journals Data mining BHL etc. Version 2 Coming in Fall 2011!
23. The page richness index Helps fill gaps with existing knowledge Helps prioritize funding and training so that it has maximum impact on closing true gaps Will be available via API Computing and storing richness index on EOL is a step towards storing and serving computable data
24. Dynamic data summaries = new knowledge Summarize data within a partner, then across partners. For example: compute an average value for one taxon (x specimens), compare to range of values across all taxa (621,393 samples) Atlantic Cod Gadusmorhua Jen Hammock (EOL) Edward van den Berge (OBIS)
28. Thank you http://www.eol.org 160+ content partners 2000 Flickr contributors 1000s Wikipedia contributors 43,000 EOL members Funding:John D. and Catherine T. MacArthur Foundation, Alfred P. Sloan Foundation, Cornerstone Institutions, Private Donors See Demo and Version 2 sneak peak in Software Bazaar Leadership: Erick Mata, Bob Corrigan, Mark Westneat, Marie Studer, Tom Garnett, Jim Edwards, David Patterson, Developers: Peter Mangiafico, Jeremy Rice, DimitriMozzherin, David Shorthouse, Lisa Whalley and others Biologists: Tanya Dewey, Audrey Aronowsky, Leo Shapiro
Notes de l'éditeur
Conclusion is that there is value to treating all the biodiversity information systems as part of an interconnected ecosystem. We can study the connections, we can assess depth of infomraiton in the network. I’ll focus on EOL’s role in the system, but I hope to make observations that will be generally useful too
Objects such as these are essentially chunks of text sorted by topic. Span biology from physiology to ecology to evolutionEach of these credits the source, and can receive comments or ratings, or can be trusted or untrusted by curators.
So, the approach of EOL is rather different than many other sites. EOL is a giant mashup that creates pages, that are then available for curators (mostly credentialed scientists) to assess and rate, or for anybody to provide comments or tags.160+ partner databases700 curators/1000s contributors/46,000 members2.8 million pages600 thousand pages with Creative Commons contentOver 2 million data objects and >1 million pages with links to research literatureTraffic in past year: 1.7 million unique users, 6.2 million page views
Represents about 1600 projects, and 1700 instances of data flow or hyperlinks between them. Size of the vertex, or node, reflects degree, or how many links the node has. We used the Claust-Newman-Moore algorithm to determine which vertices grouped together, then gave each group a color code. Those nodes with a degree of 15 or higher are labeled, and their edges are shown thicker than the others. These are the hubsThese are the hubs of this network, and they are reasonably well connected to each other. (go through and expand the acronyms)
Daphne Fautin’sHexacorallians of the world
With this as a baseline, how connected and resilient is the network? Over time we want it to become more connected and resilient, both to enable discovery and recovery in case of catastrophic problems.We can also use this to develop effective mechanisms to annotate data and improve data quality. If the same data appear on different parts of the network, and someone reports an error, the repair of that data needs to propagate effectively. What are the factors that influence data flow quantity and effectiveness…
Brighter green has higher % descendents with text, size of square is number of descendents square root scaled
Ecologically important – keystone species, indicator species
Inspired by community ecology & measures of species diversity, which of course were originally inspired by information theory, but we haven’t used those measures. Instead we put together these factors in a way that we could assign weights to different factors based on how well they capture “a rich page”We sampled dozens of pages and had team members assess them for their gestalt “richness” based on their own criteria. Then we compared those scores to those generated by the algorithm, and iteratively changed weights until we achieved a set of weights that appeared to reflect human perception of “richness.”Note that there’s a penalty that unvetted material is only worth about 75% of vetted materialAlso there are maximums for many of these input values – having 200 images may not make a page much more rich than having 25 images.Reserve the right to change this to ensure that the index is as useful as possible. Like Google PageRank, want to ensure that nobody can game the system.
Also note that there is an implication that a “rich page” is a “high quality page” – not necessarily true but often it is.As EOL goes forward with our version 2 we’ll be gathering other inputs that can tell us if a page is successful – ratings of its objects, for example.
Here’s what we are already doing – for the OBIS specimens which have rich environmental data associated with themCould add simllar values from other partners, for example from GenBank where some samples that are sequenced are collected from known envorinments, or from ecological studies that aren’t part of the specimen based system.Could subscribe to this value and get alerts if new values that come in that are outside this range.Could set up an model for this taxon and its relatives, predicting expected values, then if new values are aggregratedfrom any of EOL’s partners that violate the model, the scientist who has published the model gets a notification, could be there’s a flaw in the data integration, some violation of assumptions about the measurement workflow. Or could be that there’s something we truly didn’t understand before.Truly leveraging the scientific output of many researchers, better use of resources, more rapid advances in understanding of biological systems.
Analogousto the study of ecosystems where we seek to build an understanding of entire systems with many kinds of inputs, both biotic and abiotic