The document summarizes efforts to automatically mine and extract information from scholarly publications related to plants. It describes how ContentMine downloads publications, normalizes them, classifies them by discipline, and extracts semantics, annotations, diagrams and other content. It provides examples of using ContentMine to search for information about specific plants and phytochemicals, and to automatically annotate and link publications. The goal is to help researchers more efficiently find relevant information from the vast number of publications being produced each day.
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
High throughput mining of the plant-science literature
1. Mining science from the plant
literature
ContentMine
Rothamsted Research,
Harpenden, UK, 2016-09-12
Peter Murray-Rust
[1]University of Cambridge [2]TheContentMine
5,000 scholarly publications every day.
How many relate to plants?
2. Overview
• Scholarly literature
• Automation of downloading, normalization
• Discipline-dependent semantics/ontology
• Classification
• Extraction
• Annotation
• Mining diagrams
• Politics of mining
3. The Right to Read is the Right to Mine**PeterMurray-Rust, 2011
http://contentmine.org
5. Output of scholarly publishing
[2] https://en.wikipedia.org/wiki/Mont_Blanc#/media/File:Mont_Blanc_depuis_Valmorel.jpg
586,364 Crossref DOIs 201507 [1] /month 8000 papers/day
2.5 3 million (papers + supplemental data) /year
each 3 mm thick
4500 m high per year [2]
* Most is not Publicly readable
[1] http://www.crossref.org/01company/crossref_indicators.html
35. Automatic extraction of plant species from the literature
Lars Willighagen, ContentMine Fellow 2016, NL
https://larsgw.github.io/contentmine-
fellowship/html/card_c03-d.html
50. Systematic Reviews
Can we:
• eliminate true negatives automatically?
• extract data from formulaic language?
• mine diagrams?
• Annotate existing sources?
• forward-reference clinical trials?
51. Polly has 20 seconds to read this paper…
…and 10,000 more
52. ContentMine software can do this in a few minutes
Polly: “there were 10,000 abstracts and due
to time pressures, we split this between 6
researchers. It took about 2-3 days of work
(working only on this) to get through
~1,600 papers each. So, at a minimum this
equates to 12 days of full-time work (and
would normally be done over several weeks
under normal time pressures).”
53. 400,000 Clinical Trials
In 10 government registries
Mapping trials => papers
http://www.trialsjournal.com/content/16/1/80
2009 => 2015. What’s
happened in last 6 years??
Search the whole scientific literature
For “2009-0100068-41”
54. (2x digital music industry!)
Contentmine.org
Non-profit
Collaborations include:
• University of Cambridge Plant Sciences
• TGAC/Open Plant
• EuropePMC
• Wikimedia
• Some publishers