Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
The Thinking Behind Big Data at the NIH
1. The Thinking Behind Big Data at the NIH
Philip E. Bourne Ph.D.
Associate Director for Data Science
National Institutes of Health
http://www.slideshare.net/pebourne/
2. Disclaimer: I only started March 3,
2014
…but I and others had been thinking about this
prior to my appointment
3. Let me start with a few examples of
what motivates our thinking …
4. The Story of Meredith
http://fora.tv/2012/04/20/Congress_Unplugged_
Phil_Bourne
Stephen Friend
5. We have Entered An Era of
Deinstitutionalize & Democratization
of Science
Daniel Hulshizer/Associated Press
6. We have Entered An Era of
Deinstitutionalize & Democratization
of Science – NIH Should Support This
Daniel Hulshizer/Associated Press
7. I can’t reproduce research from my
own laboratory?
Daniel Garijo et al. 2013 Quantifying Reproducibility in Computational Biology:
The Case of the Tuberculosis Drugome PLOS ONE 8(11) e80278 .
Can you?
But what does it take and does
it matter?
9. Reproducibility Studies Are On-going
Across the NIH
Expected outcomes:
– Improved accessibility to data and software
– Support for workflows
– Closer relationships with publishers
– Metrics for measuring reproducibility
– Closure of the research lifecycle loop
– Rewards for reproducibility
10. You will notice that so far none of
these issues has to do with “Big
Data” per se
“Big Data” has simply bought more
attention to these issues
11. What Worries Me the Most -
Sustainability
Source Michael Bell http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=830
12. We Cant Go On Like This – Some
Options
Introduction of business models
– The 50% model
– Mergers
– Acquisitions associated with best practices
– Centralization
– Public/private partnerships
– Fee for service
– Archiving
Usage metrics / impact ….
13. We don’t know enough
about how current data
are used!
* http://www.cdc.gov/h1n1flu/estimates/April_March_13.htm
Jan. 2008 Jan. 2009 Jan. 2010Jul. 2009Jul. 2008 Jul. 2010
1RUZ: 1918 H1 Hemagglutinin
Structure Summary page activity for
H1N1 Influenza related structures
3B7E: Neuraminidase of A/Brevig Mission/1/1918
H1N1 strain in complex with zanamivir
[Andreas Prlic]
15. And This May Just be the Beginning
Evidence:
– Google car
– 3D printers
– Waze
– Robotics
From: The Second Machine Age: Work, Progress,
and Prosperity in a Time of Brilliant Technologies
by Erik Brynjolfsson & Andrew McAfee
16. Scholarship is broken
I have a paper with 16,000 citations that no one has
ever read
I have papers in PLOS ONE that have more citations
than ones in PNAS
I have data sets I am proud of few places to put
them
I edited a journal but it did not count for much
19. Approach to Solutions
New policies, e.g. data sharing, blanket consent
Funding where it is most needed
– New metrics
– De-identification
– Agile pilots
– Smaller funding for the many, but with appropriate
governance
– Competitions
– Coordination across agencies and countries
Shared infrastructure
Support for new reward systems
21. Associate Director for Data Science
Commons
Training
Center
BD2K
Modified
Review
Sustainability* Education* Innovation* Process
• Cloud – Data &
Compute
• Search
• Security
• Reproducibility
Standards
• App Store
• Coordinate
• Hands-on
• Syllabus
• MOOCs
• Community
• Centers
• Training Grants
• Catalogs
• Standards
• Analysis
• Data
Resource
Support
• Metrics
• Best
Practices
• Evaluation
• Portfolio
Analysis
The Biomedical Research Digital Enterprise
Communication
Collaboration
rogrammatic Theme
Deliverable
Example Features • IC’s
• Researchers
• Federal
Agencies
• International
Partners
• Computer
Scientists
Scientific Data Council External Advisory
Board
* Hires made
22. Solution: The Power of the Commons
Data
The Long Tail
Core Facilities/HS Centers
Clinical /Patient
The Why:
Data Sharing Plans
The
Commons
Government
The How:
Data
Discovery
Index
Sustainable
Storage
Quality
Scientific
Discovery
Usability
Security/
Privacy
Commons == Extramural NCBI == Research Object Sandbox == Collaborative Environment
The End Game:
KnowledgeNIH
Awardees
Private
Sector
Metrics/
Standards
Rest of
Academia
Software Standards
Index
BD2K
Centers
Cloud, Research Objects,
23. What Does the Commons Enable?
Dropbox like storage
The opportunity to apply quality metrics
Bring compute to the data
A place to collaborate
A place to discover
http://100plus.com/wp-content/uploads/Data-Commons-3-
1024x825.png
24. Commons Timeline
Spring/Summer 2014: DS group are gathering
information about activities and needs from ICs (and
outside communities).
– Shared interests in developing cloud-based biomedical
commons.
– Investigating potential models of sustainability.
– Exploring metrics of usefulness and success.
Fall 2014: Develop possible pilots to explore
options in addition to those already being
implemented by some ICs.
25. Associate Director for Data Science
Commons
Training
Center
BD2K
Modified
Review
Sustainability* Education* Innovation* Process
• Cloud – Data &
Compute
• Search
• Security
• Reproducibility
Standards
• App Store
• Coordinate
• Hands-on
• Syllabus
• MOOCs
• Community
• Centers
• Training Grants
• Catalogs
• Standards
• Analysis
• Data
Resource
Support
• Metrics
• Best
Practices
• Evaluation
• Portfolio
Analysis
The Biomedical Research Digital Enterprise
Communication
Collaboration
rogrammatic Theme
Deliverable
Example Features • IC’s
• Researchers
• Federal
Agencies
• International
Partners
• Computer
Scientists
Scientific Data Council External Advisory
Board
* Hires made
26. TrainingTraining
Summary of Training Workshop and Request for
Information:
– http://bd2k.nih.gov/faqs_trainingFOA.html
– Contact: Michelle Dunn (NCI)
Training Goals:
– develop a sufficient cadre of researchers skilled in the
science of Big Data
– elevate general competencies in data usage and analysis
across the biomedical research workforce.
27. BD2K Training RFAsBD2K Training RFAs
K01s for Mentored Career Development Awards,
RFA-HG-14-007
Provides salary and research support for 3-5 years for
intensive research career development under the guidance
of an experienced mentor in biomedical Big Data Science.
R25s for Courses for Skills Development, RFA-HG-
14-008
Development of creative educational activities with a
primary focus on Courses for Skills Development.
R25 for Open Educational Resources, RFA-HG-14-
009
Development of open educational resources (OER) for use
by large numbers of learners at all career levels, with a
primary focus on Curriculum or Methods Development.
29. Associate Director for Data Science
Commons
Training
Center
BD2K
Modified
Review
Sustainability* Education* Innovation* Process
• Cloud – Data &
Compute
• Search
• Security
• Reproducibility
Standards
• App Store
• Coordinate
• Hands-on
• Syllabus
• MOOCs
• Community
• Centers
• Training Grants
• Catalogs
• Standards
• Analysis
• Data
Resource
Support
• Metrics
• Best
Practices
• Evaluation
• Portfolio
Analysis
The Biomedical Research Digital Enterprise
Communication
Collaboration
rogrammatic Theme
Deliverable
Example Features • IC’s
• Researchers
• Federal
Agencies
• International
Partners
• Computer
Scientists
Scientific Data Council External Advisory
Board
* Hires made
30. BD2K InnovationBD2K Innovation
Data Discovery Index Coordination
Consortium (U24) (closed)
Metadata standards (under development)
Targeted Software Development
Development of Software and Analysis Methods for
Biomedical Big Data in Targeted Areas of High
Need (U01)
–RFA-HG-14-020
–Application receipt date June 20, 2014
–Topics: data compression/reduction, visualization,
provenance, or wrangling.
–Contact: Jennifer Couch (NCI) and Dave Miller (NCI)
31. BD2K InnovationBD2K Innovation
BISTI PARs
– BISTI: Biomedical Information Science and Technology
Initiative
– Joint BISTI-BD2K effort
– R01s and SBIRs
– Contacts: Peter Lyster (NIGMS) and Jennifer Couch (NCI)
Workshops:
– Software Index (Last week)
• Need to be able to find and cite software, as well as data, to
support reproducible science.
– Cloud Computing (Summer/Fall 2014)
• Biomedical big data are becoming too large to be analyzed on
traditional localized computing systems.
– Contact: Vivien Bonazzi (NHGRI)
32. BD2K Innovation CentersBD2K Innovation Centers
FY14
Investigator-initiated Centers of Excellence for Big
Data Computing in the Biomedical Sciences (U54)
RFA-HG-13-009 (closed)
BD2K-LINCS-Perturbation Data Coordination and
Integration Center (DCIC) (U54) RFA-HG-14-001
(closed)
33. Associate Director for Data Science
Commons
Training
Center
BD2K
Modified
Review
Sustainability* Education* Innovation* Process
• Cloud – Data &
Compute
• Search
• Security
• Reproducibility
Standards
• App Store
• Coordinate
• Hands-on
• Syllabus
• MOOCs
• Community
• Centers
• Training Grants
• Catalogs
• Standards
• Analysis
• Data
Resource
Support
• Metrics
• Best
Practices
• Evaluation
• Portfolio
Analysis
The Biomedical Research Digital Enterprise
Communication
Collaboration
rogrammatic Theme
Deliverable
Example Features • IC’s
• Researchers
• Federal
Agencies
• International
Partners
• Computer
Scientists
Scientific Data Council External Advisory
Board
* Hires made
34. Some Thoughts About Process
Machine readable data sharing plans?
Open review?
Micro funding?
Standing data committees to explore best practices?
Crowd sourcing?
35. Associate Director for Data Science
Commons
Training
Center
BD2K
Modified
Review
Sustainability* Education* Innovation* Process
• Cloud – Data &
Compute
• Search
• Security
• Reproducibility
Standards
• App Store
• Coordinate
• Hands-on
• Syllabus
• MOOCs
• Community
• Centers
• Training Grants
• Catalogs
• Standards
• Analysis
• Data
Resource
Support
• Metrics
• Best
Practices
• Evaluation
• Portfolio
Analysis
The Biomedical Research Digital Enterprise
Communication
Collaboration
rogrammatic Theme
Deliverable
Example Features • IC’s
• Researchers
• Federal
Agencies
• International
Partners
• Computer
Scientists
Scientific Data Council External Advisory
Board
* Hires made
37. Associate Director for Data Science
Commons
Training
Center
BD2K
Modified
Review
Sustainability* Education* Innovation* Process
• Cloud – Data &
Compute
• Search
• Security
• Reproducibility
Standards
• App Store
• Coordinate
• Hands-on
• Syllabus
• MOOCs
• Community
• Centers
• Training Grants
• Catalogs
• Standards
• Analysis
• Data
Resource
Support
• Metrics
• Best
Practices
• Evaluation
• Portfolio
Analysis
The Biomedical Research Digital Enterprise
Communication
Collaboration
rogrammatic Theme
Deliverable
Example Features • IC’s
• Researchers
• Federal
Agencies
• International
Partners
• Computer
Scientists
Scientific Data Council External Advisory
Board
* Hires made
38. Components of The Academic Digital
Enterprise
Consists of digital assets
– E.g. datasets, papers, software, lab notes
Each asset is uniquely identified and has provenance,
including access control
– E.g. publishing simply involves changing the access control
Digital assets are interoperable across the enterprise
39. Life in the Academic Digital Enterprise
Jane scores extremely well in parts of her graduate on-line neurology class.
Neurology professors, whose research profiles are on-line and well described, are
automatically notified of Jane’s potential based on a computer analysis of her scores
against the background interests of the neuroscience professors. Consequently,
professor Smith interviews Jane and offers her a research rotation. During the
rotation she enters details of her experiments related to understanding a widespread
neurodegenerative disease in an on-line laboratory notebook kept in a shared on-line
research space – an institutional resource where stakeholders provide metadata,
including access rights and provenance beyond that available in a commercial
offering. According to Jane’s preferences, the underlying computer system may
automatically bring to Jane’s attention Jack, a graduate student in the chemistry
department whose notebook reveals he is working on using bacteria for purposes of
toxic waste cleanup. Why the connection? They reference the same gene a number
of times in their notes, which is of interest to two very different disciplines – neurology
and environmental sciences. In the analog academic health center they would never
have discovered each other, but thanks to the Digital Enterprise, pooled knowledge
can lead to a distinct advantage. The collaboration results in the discovery of a
homologous human gene product as a putative target in treating the
neurodegenerative disorder. A new chemical entity is developed and patented.
Accordingly, by automatically matching details of the innovation with biotech
companies worldwide that might have potential interest, a licensee is found. The
licensee hires Jack to continue working on the project. Jane joins Joe’s laboratory,
and he hires another student using the revenue from the license. The research
continues and leads to a federal grant award. The students are employed, further
research is supported and in time societal benefit arises from the technology.
From What Big Data Means to Me JAMIA 2014 21:194
40. Some Acknowledgements
Eric Green & Mark Guyer (NHGRI)
Jennie Larkin (NHLBI)
Vivien Bonazzi (NHGRI)
Michelle Dunn (NCI)
Mike Huerta (NLM)
David Lipman (NLM)
Jim Ostell (NLM)
Peter Lyster (NIGMS)
All the over 100 folks on the BD2K team
Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8): e124. doi:10.1371/journal.pmed.0020124
http://www.reuters.com/article/2012/03/28/us-science-cancer-idUSBRE82R12P20120328
Can you put more than just one time point on this? How about making the last sub-bullet it’s own bullet (as shown)
Yes – that is what we were thinking and wrestling with.