Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Big Data and its Role in Biomedical Research
1. Big Data and its Role in
Biomedical Research
Philip E. Bourne PhD, FACMI
Stephenson Chair of Data Science
Director, Data Science Institute
Professor of Biomedical Engineering
peb6a@virginia.edu
https://www.slideshare.net/pebourne
10/10/18 ACoP 2018 1
@pebourne
2. Bias
• Cant help but be influenced by my time as Associate
Director for Data Science (ADDS) at NIH
• Now very much engaged in data science across disciplines
– broader but shallower perspective
• Knowing my long-time colleague Prof. Lei Xie and others
will follow me with a deeper perspective
10/10/18 ACoP 2018 2
4. Big data and data
science are like
the Internet…
If I asked you to
define them you
would all say
something
different, yet you
use them every
day…
10/10/18 ACoP 2018 4
http://vadlo.com/cartoons.php?id=357
5. So what do I mean by big data/data
science?
• Use of the ever increasing amount of open, complex, diverse
digital data
• Finding ways to ask and then answer relevant questions by
combining such diverse data sets
• Arriving at statistically significant conclusions not otherwise
obtainable
• Sharing such findings in a useful way
• Translating such findings into actions that improve the human
condition
10/10/18 ACoP 2018 5
7. Machine learning has been around for over 20
years – why the fuss now?
• Amount of data available for training
• Open source - R and python
• Advances in computing (e.g., GPU’s) allow for deeper neural nets (deep
learning)
• Algorithmic efficiency gains (e.g., in back propagation)
• Success promotes further research
• Commercialization
10/10/18 ACoP 2018 7
Pastur-Romay et al. 2016 doi:10.3390/ijms17081313
8. The NIH view
• Big Data
– Total data from NIH-funded research in 2016 estimated at 650 PB*
– 20 PB of that is in NCBI/NLM (3%) and it is expected to grow by 10
PB in 2016
• Dark Data
– Only 12% of data described in published papers is in recognized
archives – 88% is dark data^
• Cost
– 2007-2014: NIH spent ~$1.2Bn extramurally on maintaining data
archives
* In 2012 Library of Congress was 3 PB
^ http://www.ncbi.nlm.nih.gov/pubmed/26207759
10/10/18 ACoP 2018 8
9. NIH strategic plan for data
• Support a Highly Efficient and Effective
Biomedical Research Data
Infrastructure
• Promote Modernization of the Data-
Resources Ecosystem
• Support the Development and
Dissemination of Advanced Data
Management, Analytics, and
Visualization Tools
• Enhance Workforce Development for
Biomedical Data Science
• Enact Appropriate Policies to Promote
Stewardship and Sustainability
10/10/18 ACoP 2018 9
https://grants.nih.gov/grants/rfi/NIH-Strategic-Plan-for-Data-Science.pdf
10. A research data infrastructure requires
we move from pipes to platform…
which begs the question ...
10/10/18 ACoP 2018 10
Vivien Bonazzi Bonazzi & Bourne 2017, PLoS Biol. 7;15(4):e2001818.
Will biomedical research become more like Airbnb?
11. I am not crazy, hear me out
• Airbnb is a platform that supports a trusted relationship between consumer
(renter) and supplier (host)
• The platform focuses on maximizing the exchange of services between supplier and
consumer and maximizing the amount of trust associated with a given stakeholder
• It seems to be working:
– 60 million users searching 2 million listings in 192 countries
– Average of 500,000 stays per night.
– Evaluation of US $25bn
10/10/18 ACoP 2018 11
Bonazzi & Bourne 2017, PLoS Biol. 7;15(4):e2001818.
13. The pillars of data science operate
within this platform environment
13
QSP
10/10/18 ACoP 2018
14. Lets briefly focus on those five pillars
in the Context of QSP …
10/10/18 ACoP 2018 14
15. Data acquisition
The data production issue (the V’s of Big Data)— Experimentally
• Estimated (2017) that ≈2.5 quintillion (2.5×1018) bytes of data generated daily, with 90%
of all the world’s data having been created in the past two years.
• Plaintext PDB files typically ≈ few 100s KB (…but, that’s just the start!)
Mura et al. 2018 Curr Opin Struct Biol. 52:95-102
10/10/18 ACoP 2018 15
16. Data integration and engineering
• Generic
– Ontologies
– Object identifiers
– Indexing schemes
– Common data models
1610/10/18 ACoP 2018
19. Ethics, law & policy
10/10/18 ACoP 2018 19
• Landmark studies identify
histone mutations as
recurrent driver mutations in
DIPG ~2012
• Almost 3 years later, in
largely the same datasets,
but partially expanded, the
same two groups and 2
others identify ACVR1
mutations as a secondary,
co-occurring mutation
From Adam Resnick
Diffuse Intrinsic Pontine Glioma (DIDG)
20. Conclusion:
Driven by large amounts of open
digital data of different types and new
algorithms and approaches biomedical
researchers are destined to follow the
private sector towards the fourth
paradigm
10/10/18 ACoP 2018 20
21. Acknowledgements
10/10/18 ACoP 2018 21
The BD2K Team at NIH
My Colleagues at UVA
The 150 folks who have passed through my laboratory
https://docs.google.com/spreadsheets/d/1QZ48UaKcwDl_iFCvBmJsT03FK-bMchdfuIHe9Oxc-rw/edit#gid=0
Zheng Zhao Lei Xie
Model integration in systems pharmacology. Diverse models need to be integrated
across multiple methodologies, multiple heterogeneous data sets, organismal hierarchy, and
species (transportability).
$1.25bn per year to capture all data.
After a significant effort at reduction, intramurally data is spread across > 60 data centers; imagine the extramural situation.
Distribution of kinases and the number of covalent small-molecule kinase inhibitors (CSKIs) for every targeted kinase across the human kinome