4. The Life of One Scientist – The Early Years
So That You Might Not Make the Same Mistakes
• My high school
teacher Mr. Wilson • The opportunity to
said I would be a
failure at chemistry live in different
• My PhD is in places shaped my
chemistry life
• Good friends are
5/3/12 UCSD BILD 94
forever 4
5. 40+ Years Later
Ten Simple Rules for Starting a Company
PLoS Comp Biol 2012 8(3) 1002439
5/3/12 UCSD BILD 94 5
11. Some Things Stay with You Your Whole
Life
5/3/12 UCSD BILD 94 11
12. Senior Scientist HHMI Columbia
University New York
• Driven not by career but
wanting to live in New York
City
5/3/12 UCSD BILD 94 12
13. ~1990 Got Involved with the The Human
Genome
• Was only possible by
applying computers to
problems in biology
• Developed algorithms
to support physical and
genetic mapping of Chr
13
5/3/12 UCSD BILD 94 13
14. Came to UCSD to Apply Computers to
Big Biological Problems
• Possibly the best place in the
world to do computational
biology
5/3/12 UCSD BILD 94 14
16. The Protein Kinase Family
•A large family
important to signal
transduction in
eukaryotes and many
bacteria.
•Phosphotransferases:
transfer phosphate
group from ATP to
Ser/Thr or Tyr residue on
target protein,
producing a range of
downstream signaling
effects.
•PKA: an example of a
typical protein kinase
(TPK) fold, shown in
“open book” format
5/3/12 UCSD BILD 94 16
17. Sometime Ya Got to Just Do It Yourself
5/3/12 UCSD BILD 94 17
18. The Growth of Data is A Major Driver
in Biology
Number of released entries
Year
5/3/12 UCSD BILD 94 18
20. Big Research Questions in the Lab
1. Can we improve how science is
disseminated and
comprehended?
2. What is the ancestry of the
protein structure universe and
what can we learn from it?
3. Are there alternative ways to
represent proteins from which
we can learn something new?
4. What really happens when we
take a drug?
5. Can we contribute to the
treatment of neglected
{tropical} diseases?
August 14, 2009
5/3/12 UCSD BILD 94 20
22. Nature’s Reductionism
There are ~ 20300 possible proteins
>>>> all the atoms in the Universe
11.2M protein sequences from
10,854 species (source RefSeq)
38,221 protein structures
yield 1195 domain folds (SCOP 1.75)
5/3/12 UCSD BILD 94 22
23. Initial Question:
With the current coverage of proteomes
by structure and assuming we know a
high percentage of all folds, is structure
a useful discriminator of species?
5/3/12 UCSD BILD 94 23
24. Chapter 2 Initial Findings
Song Yang
Russ Doolittle, Post Doc UC Berkeley
Professor Department of Chemistry and Biochemistry
Center for Molecular Genetics UCSD
UCSD
Yang, Doolittle & Bourne (2005) PNAS 102(2) 373-8
5/3/12 UCSD BILD 94 24
25. To Answer this Question We Only Need to
Make Use of Existing Resources
• SCOP – Further catalogs Nature’s
reductionism into structural domains, folds,
families and superfamilies
• SUPERFAMILY assigns the above to fully
sequenced proteomes
5/3/12 UCSD BILD 94 25
26. The SCOP Hierarchy v1.75
Based on 38221 Structures
7
1195
1962
3902
110800
5/3/12 UCSD BILD 94 26
27. Is Structure a Useful Discriminator of Species? -
Maybe…
Distribution among the three kingdomsas taken from SUPERFAMILY
Eukaryota (650)
153/14
135
• Superfamily distributions
would seem to be 10
21/2 118
310/0
related to the complexity 645/49
387
of life
9/1
12 29/0
17
42
68/0
• Update of the work of
Caetano-Anolles2 (2003) Archaea (416) Bacteria (564)
Genome Biology 13:1563
SCOP fold (765 total)
Any genome / All genomes
5/3/12 UCSD BILD 94 27
28. Method – Distance Determination
Presence/Absence Data Matrix
organisms
(FSF)
SCOP
SUPERFAMILY C. intestinalis C. briggsae F. rubripes
a.1.1 1 1 1
a.1.2 1 1 1
a.10.1 0 0 1
a.100.1 1 1 1
a.101.1 0 0 0
a.102.1 0 1 1
a.102.2 1 1 1
Distance Matrix
C. intestinalis C. briggsae F. rubripes
C. intestinalis 0 101 109
C. briggsae 0 144
F. rubripes 0
Chapter 2 Initial Findings
5/3/12 UCSD BILD 94 28
29. Is Structure a Useful Discriminator of
Species? - Yes
Archaea Bacteria Eukaryota
The method cleanly placed all species in their
correct superkingdoms
5/3/12 UCSD BILD 94 29
30. The Answer Would Appear to be Yes
• It is possible to
generate a reasonable
tree of life from merely
the presence or
absence of
superfamilies (FSFs)
within a given
proteome
5/3/12 UCSD BILD 94 30
31. Environmental Influence
Chris Dupont
Scripps Institute of Oceanography
UCSD
DuPont, Yang, Palenik, Bourne. 2006 PNAS 103(47) 17822-17827
5/3/12 UCSD BILD 94 31
32. Consider the Distribution of Disulfide Bonds
among Folds
• Disulphides are only stable under
oxidizing conditions Eukaryota
• Oxygen content gradually accumulated
during the earth’s evolution 31.9%
(43/135)
• The divergence of the three kingdoms
occurred 1.8-2.2 billion years ago
0% 14.4%
• Oxygen began to accumulate ~ 2.0 (0/10) 4.7% (17/118)
billion years ago (18/387)
• Logical deduction – disulfides more 0% 16.7%
5.9%
prevalent in folds (organisms) that 1
(0/2) (1/17) (7/42)
evolved later Archaea Bacteria
• This would seem to hold true
• Can we take this further?
SCOP fold (708 total)
5/3/12 UCSD BILD 94 32
33. Evolution of the Earth
• 4.5 billion years of change
• 300+50K
• 1-5 atmospheres
• Constant photoenergy
• Chemical and geological
changes
• Life has evolved in this time
• The ocean was the “cradle”
for 90% of evolution
5/3/12 UCSD BILD 94 33
34. Theoretical Levels of Trace Metals and Oxygen in the
Deep Ocean Through Earth’s History
• Whether the deep ocean became
oxic or euxinic following the rise
Bacteria Eukarya
in atmospheric oxygen (~2.3 Gya)
Archaea
1 is debated, therefore both are
Oxygen
0.5 shown (oxic ocean-solid lines,
(O2 in arbitrary units, Zn and Fe in moles L-1
0 euxinic ocean-dashed lines).
1.00E-08
Zinc 1.00E-12
Concentration
1.00E-16
1.00E-20
• The phylogenetic tree symbols at
Iron
1.00E-06
1.00E-09 the top of the figure show one
1.00E-12
1.00E-15
1.00E-07
idea as to the theoretical periods
Cobalt 1.00E-09
of diversification for each
Manganese
1.00E-11
Superkingdom.
4.5 4 3.5 3 2.5 2 1.5 1 0.5 0
Billions of years before present
Replotted from Saito et al, 2003
Inorganica Chimica Acta 356: 308-318
5/3/12 UCSD BILD 94 34
36. Hypothesis
• Emergence of cyanobacteria changed oxygen
concentrations
• Impacted metal concentrations in the ocean
• Organisms used new metals in new ways to
evolve new biological processes eg complex
signaling
• This in turn further impacted the environment
5/3/12 UCSD BILD 94 36
37. Big Research Questions in the Lab
1. Can we improve how science is
disseminated and
comprehended?
2. What is the ancestry of the
protein structure universe and
what can we learn from it?
3. Are there alternative ways to
represent proteins from which
we can learn something new?
4. What really happens when we
take a drug?
5. Can we contribute to the
treatment of neglected
{tropical} diseases?
August 14, 2009
5/3/12 UCSD BILD 94 37
38. Our Motivation
• Tykerb – Breast cancer
• Gleevac – Leukemia, GI
cancers
• Nexavar – Kidney and liver
cancer
• Staurosporine – natural product
– alkaloid – uses many e.g.,
antifungal antihypertensive
5/3/12 UCSD BILD 94 38
Collins and Workman 2006 Nature Chemical Biology 2 689-700
Motivators
39. Our Broad Approach
• Involves the fields of:
– Structural bioinformatics
– Cheminformatics
– Biophysics
– Systems biology
– Pharmaceutical chemistry
• L. Xie, L. Xie, S.L. Kinnings and P.E. Bourne 2012 Novel Computational Approaches to Polypharmacology as a
Means to Define Responses to Individual Drugs, Annual Review of Pharmacology and Toxicology 52: 361-379
• L. Xie, S.L. Kinnings, L. Xie and P.E. Bourne 2012 Predicting the Polypharmacology of Drugs: Identifying New Uses
Through Bioinformatics and Cheminformatics Approaches in Drug Repurposing M. Barrett and D. Frail (Eds.) Wiley
and Sons. (available upon request)
5/3/12 UCSD BILD 94 39
40. Approach - Need to Start with a 3D Drug-
Receptor Complex – Either Experimental or
Modeled
Generic Name Other Name Treatment PDBid
Lipitor Atorvastatin High cholesterol 1HWK, 1HW8…
Testosterone Testosterone Osteoporosis 1AFS, 1I9J ..
Taxol Paclitaxel Cancer 1JFF, 2HXF, 2HXH
Viagra Sildenafil citrate ED, pulmonary 1TBF, 1UDT,
arterial 1XOS..
hypertension
Digoxin Lanoxin Congestive heart 1IGJ
failure
5/3/12 UCSD BILD 94 40
41. A Reverse Engineering Approach to
Drug Discovery Across Gene Families
Characterize ligand binding Identify off-targets by ligand
site of primary target binding site similarity
(Geometric Potential) (Sequence order independent
profile-profile alignment)
Extract known drugs
or inhibitors of the
primary and/or off-targets
Search for similar
small molecules …
Dock molecules to both
primary and off-targets
Statistics analysis
of docking score
correlations 41
5/3/12 Xie and Bourne 2009
Bioinformatics 25(12) 305-312
42. Characterization of the Ligand Binding
Site - The Geometric Potential
Conceptually similar to hydrophobicity
or electrostatic potential that is
dependant on both global and local
environments
• Initially assign C atom with a
value that is the distance to the
environmental boundary
• Update the value with those of
surrounding C atoms
dependent on distances and
orientation – atoms within a
10A radius define i
Pi cos( i) 1.0
GP P
neighbors Di 1.0 2.0 Xie and Bourne 2007 BMC Bioinformatics, 8(Suppl 4):S9
5/3/12 UCSD BILD 94 42
43. Discrimination Power of the Geometric
Potential
4
binding site
non-binding site
3.5
3 • Geometric
2.5 potential can
2 distinguish
1.5
binding and
1
non-binding
0.5
sites
0 100 0
11
22
33
44
55
66
77
88
99
0
Geometric Potential Geometric Potential Scale
For Residue Clusters
5/3/12 UCSD BILD 94 43
44. Local Sequence-order Independent Alignment with
Maximum-Weight Sub-Graph Algorithm
Xie and Bourne 2008 PNAS, 105(14) 5441
Structure A Structure B
LER
VKDL
LER
VKDL
• Build an associated graph from the graph representations of two
structures being compared. Each of the nodes is assigned with a weight
from the similarity matrix
• The maximum-weight clique corresponds to the optimum alignment of
the two structures
5/3/12 UCSD BILD 94 44
45. Similarity Matrix of Alignment
Chemical Similarity
• Amino acid grouping: (LVIMC), (AGSTP), (FYW), and (EDNQKRH)
• Amino acid chemical similarity matrix
Evolutionary Correlation
• Amino acid substitution matrix such as BLOSUM45
• Similarity score between two sequence profiles
i i i i
d f a Sb fb Sa
i i
fa, fb are the 20 amino acid target frequencies of profile a
and b, respectively
Sa, Sb are the PSSM of profile a and b, respectively
5/3/12 UCSD BILD 94 45
46. The Problem with Tuberculosis
• One third of global population infected
• 1.7 million deaths per year
• 95% of deaths in developing countries
• Anti-TB drugs hardly changed in 40 years
• MDR-TB and XDR-TB pose a threat to
human health worldwide
• Development of novel, effective and
inexpensive drugs is an urgent priority
5/3/12 UCSD BILD 94 46
47. The TB-Drugome
1. Determine the TB structural proteome
2. Determine all known drug binding sites
from the PDB
3. Determine which of the sites found in 2
exist in 1
4. Call the result the TB-drugome
Kinnings et al 2010 PLoS Comp Biol 6(11): e1000976
5/3/12 UCSD BILD 94 47
48. 1. Determine the TB Structural
Proteome
3, 996 2, 266 284
1, 446
• High quality homology models from ModBase
(http://modbase.compbio.ucsf.edu) increase structural
coverage from 7.1% to 43.3%
5/3/12 UCSD BILD 94 48
49. 2. Determine all Known Drug
Binding Sites in the PDB
• Searched the PDB for protein crystal structures
bound with FDA-approved drugs
• 268 drugs bound in a total of 931 binding sites
140
120
100
Acarbose
No. of drugs
Darunavir Alitretinoin
80
Conjugated
60
estrogens
40 Chenodiol
20
Methotrexate
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
No. of drug binding sites
5/3/12 UCSD BILD 94 49
50. Map 2 onto 1 – The TB-Drugome
http://funsite.sdsc.edu/drugome/TB/
Similarities between the binding sites of M.tb proteins (blue),
UCSD BILD 94
and binding sites containing approved drugs (red).
Tuberculosis, which is caused by the bacterial pathogen Mycobacterium tuberculosis, is a leading cause of mortality among the infectious diseases. It has been estimated by the World Health Organization (WHO) that almost one-third of the world's population, around 2 billion people, is infected with the disease. Every year, more than 8 million people develop an active form of the disease, which claims the lives of nearly 2 million. This translates to over 4,900 deaths per day, and more than 95% of these are in developing countries. Despite the current global situation, antitubercular drugs have remained largely unchanged over the last four decades. The widespread use of these agents has provided a strong selective pressure for M.tuberculosis, thus encouraging the emergence of resistant strains. Multidrug resistant (MDR) tuberculosis is defined as resistance to the first-line drugs isoniazid and rifampin. The effective treatment of MDR tuberculosis necessitates long-term use of second-line drug combinations, an unfortunate consequence of which is the emergence of further drug resistance. Enter extensively drug resistant (XDR) tuberculosis - M.tuberculosis strains that are resistant to both isoniazid plus rifampin, as well as key second-line drugs. Since the only remaining drug classes exhibit such low potency and high toxicity, XDR tuberculosis is extremely difficult to treat. The rise of XDR tuberculosis around the world imposes a great threat on human health, therefore reinforcing the development of new antitubercular agents as an urgent priority. Very few Mtb proteins explored as drug targets
3,996 proteins in TB proteome749 solved structures in the PDB, representing a total of 284 proteins (7.2% coverage)ModBase contains homology models for entire TB proteome1,446 ‘high quality’ homology models were added to the data setStructural coverage increased to 43.8% Retained only those models with a model score of > 0.7 and a Modpipe quality score of > 1.1 (2818 models).There were multiple models per protein. For each TB protein, chose the model with the best model score, and if they were equal, chose the model with the best Modpipe quality score (1703 models).However, 251 (+6) models were removed since they correspond to TB proteins that already have solved structures. 1446 models remained)Score for the reliability of a Model, derived from statistical potentials (F. Melo, R. Sanchez, A. Sali,2001 PDF). A model is predicted to be good when the model score is higher than a pre-specified cutoff (0.7). A reliable model has a probability of the correct fold that is larger than 95%. A fold is correct when at least 30% of its Calpha atoms superpose within 3.5A of their correct positions. The ModPipe Protein Quality Score is a composite score comprising sequence identity to the template, coverage, and the three individual scores evalue, z-Dope and GA341. We consider a MPQS of >1.1 as reliable