Presentation made to the incoming bioinformatics and systems biology students at UCSD on how they could get involved in changing scholarly communication. Given February 28, 2011
Scholarly Communication for Bioinformatics Students
1. The Changing Face of Scholarly Communication and the Opportunities it Affords the Bioinformatics/Systems Biology Student Philip E. Bourne University of California San Diego pbourne@ucsd.edu http://www.sdsc.edu/pb Third UCSD Bioinformatics and Systems Biology Expo – 2/28/2011
3. Observation 2: We Are a Field That Uses/Produces Public On-Line Data Like No Other
4. Observation 3: We Have Shaped the Way Data Are Shared – We Have Had Very Little Impact on Publications
5. Perhaps it is Time We Though Less About a Publication as a Reward and More About How it Can be Presented to Maximize its Use
6. So What Needs to Happen We need data and knowledge about that data to interoperate i.e. we need new kinds of fast, versatile publications and data archives We need to be more open with both We need to think more about the tools that analyze, visualize and annotate data to maximize knowledge discovery Reward systems need to change We need scientist management tools We need to be less fixated on the big data problems We need to unleash the full power of the Internet Hard Easy
8. Josh Sommer – A Remarkable Young ManCo-founder & Executive Director the Chordoma Foundation http://sagecongress.org/Presentations/Sommer.pdf
9. Chordoma A rare form of brain cancer No known drugs Treatment – surgical resection followed by intense radiation therapy http://upload.wikimedia.org/wikipedia/commons/2/2b/Chordoma.JPG
13. If I have seen further it is only by standing on the shoulders of giants Isaac Isaac Newton From Josh’s point of view the climb up just takes too long > 15 years and > $850M to be more precise Adapted: http://sagecongress.org/Presentations/Sommer.pdf
17. So We Have Seem What Needs the Change and Why. What about the How?
18. We Need Data and Knowledge About That Data to Interoperate The Knowledge and Data Cycle 0. Full text of PLoS papers stored in a database 4. The composite view has links to pertinent blocks of literature text and back to the PDB User clicks on content Metadata and webservices to data provide an interactiveview that can be annotated Selecting features provides a data/knowledge mashup Analysis leads to new content I can share 4. 1. 3. A composite view of journal and database content results 1. A link brings up figures from the paper 3. 2. 2. Clicking the paper figure retrieves data from the PDB which is analyzed PLoS Comp. Biol. 2005 1(3) e34
19. We Need Data and Knowledge About That Data to Interoperate – What is Stopping US? Open Access Governance – publishers vs. database providers Reward Metadata standards for provenance, privacy etc. Exemplars ….
20. A Small Example - The World Wide Protein Data Bank The single worldwide repository for data on the structure of biological macromolecules Vital for drug discovery and the life sciences 39 years old Free to all http://www.wwpdb.org We need data and knowledge about that data to interoperate PLoS Comp. Biol. 2005 1(3) e34
21. The World Wide Protein Data Bank – The Best Case Scenario Paper not published unless data are deposited – strong data to literature correspondence Highly structured data conforming to an extensive ontology DOI’s assigned to every structure http://www.wwpdb.org We need data and knowledge about that data to interoperate PLoS Comp. Biol. 2005 1(3) e34
22. Example Interoperability: The Database View www.rcsb.org/pdb/explore/literature.do?structureId=1TIM We need data and knowledge about that data to interoperate BMC Bioinformatics 2010 11:220
23. Example Interoperability: The Literature Viewhttp://biolit.ucsd.edu Nucleic Acids Research 2008 36(S2) W385-389 We need data and knowledge about that data to interoperate
24. ICTP Trieste, December 10, 2007 We need data and knowledge about that data to interoperate
25. Semantic Tagging & Widgets are a Powerful Tool to Integrate Data and Knowledge of that Data, But as Yet Not Used Much Will Widgets and Semantic Tagging Change Computational Biology? PLoS Comp. Biol. 6(2) e1000673 We need data and knowledge about that data to interoperate
26. Semantic Tagging of Database Content in The Literature or Elsewhere http://www.rcsb.org/pdb/static.do?p=widgets/widgetShowcase.jsp PLoS Comp. Biol. 6(2) e1000673 Semantic Tagging
27. We need data and knowledge about that data to interoperate
29. This is Literature Post-processingBetter to Get the Authors Involved Authors are the absolute experts on the content More effective distribution of labor Add metadata before the article enters the publishing process We need data and knowledge about that data to interoperate
30. Word 2007 Add-in for authors Allows authors to add metadata as they write, before they submit the manuscript Authors are assisted by automated term recognition OBO ontologies Database IDs Metadata are embedded directly into the manuscript document via XML tags, OOXML format Open Machine-readable Open source, Microsoft Public License http://www.codeplex.com/ucsdbiolit We need data and knowledge about that data to interoperate
31. Challenges Authors Carrot IF one or more publishers fast tracked a paper that had semantic markup it might catch on Publishers Carrot Competitive advantage We need data and knowledge about that data to interoperate
32. The Promise – A Hypothetical Example Cardiac Disease Literature Immunology Literature Shared Function We need data and knowledge about that data to interoperate
33. High-throughput Biology Requires High-throughput Knowledge Discovery Consider an Example from Our Own Work… Roger Chang Will Give You Another Example
34. The TB-Drugome Determine the TB structural proteome Determine all known drug binding sites from the PDB Determine which of the sites found in 2 exist in 1 Call the result the TB-drugome Kinnings et al 2010 PLoS Comp Biol6(11): e1000976 High-throughput Data Requires High-throughput Knowledge
35. 1. Determine the TB Structural Proteome TB proteome homology models solved structures 2, 266 3, 996 284 1, 446 High quality homology models from ModBase (http://modbase.compbio.ucsf.edu) increase structural coverage from 7.1% to 43.3% Kinnings et al 2010 PLoS Comp Biol 6(11): e1000976
36. 2. Determine all Known Drug Binding Sites in the PDB Searched the PDB for protein crystal structures bound with FDA-approved drugs 268 drugs bound in a total of 931 binding sites No. of drugs Acarbose Darunavir Alitretinoin Conjugated estrogens Chenodiol Methotrexate No. of drug binding sites Kinnings et al 2010 PLoS Comp Biol 6(11): e1000976
37. Map 2 onto 1 – The TB-Drugome http://funsite.sdsc.edu/drugome/TB/ Similarities between the binding sites of M.tb proteins (blue), and binding sites containing approved drugs (red).
38. From a Drug Repositioning Perspective Similarities between drug binding sites and TB proteins are found for 61/268 drugs 41 of these drugs could potentially inhibit more than one TB protein conjugated estrogens & methotrexate No. of drugs chenodiol levothyroxine testosterone raloxifene alitretinoin ritonavir No. of potential TB targets Kinnings et al 2010 PLoS Comp Biol 6(11): e1000976
40. We Need Better Ways to Associate Data and Knowledge and its More than Just Text Mining of PubMed Abstracts – Its About Changing the System Our Future is in Your Hands!
41. Acknowledgements BioLit Team Lynn Fink Parker Williams Marco Martinez RahulChandran Greg Quinn Microsoft Scholarly Communications Pablo Fernicola Lee Dirks SavasParastitidas Alex Wade Tony Hey RCSB PDB team Andreas Prilc DimitrisDimitropoulos TB Drugome Team Lei Xie Sarah Kinnings Li Xie http://funsite.sdsc.edu/drugome/TB/ http://biolit.ucsd.edu http//www.pdb.org http://www.codeplex.com/ucsdbiolit
3,996 proteins in TB proteome749 solved structures in the PDB, representing a total of 284 proteins (7.2% coverage)ModBase contains homology models for entire TB proteome1,446 ‘high quality’ homology models were added to the data setStructural coverage increased to 43.8% Retained only those models with a model score of > 0.7 and a Modpipe quality score of > 1.1 (2818 models).There were multiple models per protein. For each TB protein, chose the model with the best model score, and if they were equal, chose the model with the best Modpipe quality score (1703 models).However, 251 (+6) models were removed since they correspond to TB proteins that already have solved structures. 1446 models remained)Score for the reliability of a Model, derived from statistical potentials (F. Melo, R. Sanchez, A. Sali,2001 PDF). A model is predicted to be good when the model score is higher than a pre-specified cutoff (0.7). A reliable model has a probability of the correct fold that is larger than 95%. A fold is correct when at least 30% of its Calpha atoms superpose within 3.5A of their correct positions. The ModPipe Protein Quality Score is a composite score comprising sequence identity to the template, coverage, and the three individual scores evalue, z-Dope and GA341. We consider a MPQS of >1.1 as reliable
(nutraceuticals excluded)
Multi-target therapy may be more effective than single-target therapy to treat infectious diseasesMost of the proteins listed are potential novel drug targets for the development of efficient anti-tuberculosis chemotherapeutics.GSMN-TB: Genome Scale Metabolic Reaction Network of M.tb (http://sysbio/sbs.surrey.ac.uk/tb)849 reactions, 739 metabolites, 726 genesCan optimize the model for in vivo growthCarry out multiple gene inhibition and compute the maximal theoretical growth rate (if close to zero, that combination of genes is essential for growth)