275
- 1. Protein structure prediction methods for drug design
Thomas Lengauer
was a professor of Computer
Protein structure prediction
Science at the University of
Paderborn, before he joined
GMD, the German National
methods for drug design
Research Centre for Thomas Lengauer and Ralf Zimmer
Information Technology, in Date received (in revised form): 4th July 2000
1992 as Director of the
Institute for Algorithms and
Scientific Computing. Jointly, Abstract
he is Professor of Computer Along the long path from genomic data to a new drug, the knowledge of three-dimensional
Science at the University of
Bonn. His research interests
protein structure can be of significant help in several places. This paper points out such places,
include computational biology discusses the virtues of protein structure knowledge and reviews bioinformatics methods for
and bioinformatics, gaining such knowledge on the protein structure.
computational chemistry and
combinatorial optimisation
problems in technological
applications.
INTRODUCTION NOTIONS OF PROTEIN
Ralf Zimmer FUNCTION
The long path from genomic data to a
is a research scientist at
GMD. He directs the research new drug can conceptually be divided The increased accessibility of genomic
group on algorithmic into two parts (see left side of Figure 1). data and, especially, that of large-scale
structural genomics. His The first task is to select a target protein expression data has opened new
research interests include whose molecular function is to be possibilities for the search for target
algorithms and statistical
methods for genomics,
moderated, in many cases blocked, by a proteins. This development has
proteomics, protein sequence drug molecule binding to it. Given the prompted large-scale investments into
and structure analysis, and target protein, the second task is to the new technology by many
target finding, as well as select a suitable drug that binds to the pharmaceutical companies. The
connections between
molecular biology and
protein tightly, is easy to synthesise, is respective screening experiments rely
computing (DNA computing). bio-accessible and has no adverse effects critically on appropriate bioinformatics
such as toxicity. The knowledge of the support for interpreting the generated
three-dimensional structure of a protein data. Specifically, methods are required
can be of significant help in both phases. to identify interesting differentially
Keywords: protein structure The steric and physicochemical expressed genes and to predict the
prediction, protein target,
protein–ligand docking
complementarity of the binding site of function and structure of putative target
the protein and the drug molecule is an proteins from differential expression data
important, if not the dominating, feature generated in an appropriate screening
of strong binding. Thus, in many cases, experiment.
the knowledge of the protein structure Protein function is a colourful notion
affords well-founded hypotheses of the whose meaning can range over several
function of the protein. If the structure levels:
of the relevant binding site of the
protein is known in detail, we can even q a very general classification (globular,
start to employ structure-based methods enzyme, hormone, structural protein,
in order to develop a drug binding viral capsid protein, transmembrane
Thomas Lengauer, tightly to the protein. protein, etc.);
Institute for Algorithms and
Scientific Computing (SCAI),
In this paper bioinformatics methods
GMD – National Research for prediction aspects of the protein q biochemical function (biochemical
Center for Information structure are described and their use reaction, enzyme specificity, binding
Technology,
Sankt Augustin,
towards the goal of drug design is partners, cofactors);
Germany D53754. discussed. The possibilities and limitations
of using protein structure knowledge q classification via broad cellular function
Tel: +49 2241 14 2776/2777
Fax: +49 2241 14 2656 towards the goal of developing new drug (interaction with DNA and other
E-mail: lengauer@gmd.de therapies are also discussed. proteins, cellular localisation);
© HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000 275
08-lengauer.p65 275 9/19/00, 1:49 PM
- 2. Lengauer and Zimmer
Genome/Organism/Disease
Target Protein Search
Structure Families Evolutionary Expression Phenotyp SNPs, Linkage
SEARCH Para-/Analogs Information Genotyp Mutations
Target Protein
IDENTIFY Structure Sequence Fusion Co-Evolution Co-Expression Motifs
Target Protein Function
MODEL Structure
Assay/
Target Protein Structure
Screening
Drug Lead
DESIGN Rational Drug Design
Search
Ligand Computer Docking Combinatorial
Design HTS Libraries HTS Trial&Error
Target Lead Structure / Drug
Figure 1
q broad phenotypic function (changes function simply because they originate
observed for organisms with deleted or from a common ancestor and they still
mutated genes); fulfil their role within the cellular
processes, mutations occur independently
q identification of detailed physiological after speciation events. Depending on the
function such as the localisation in a extent of the evolutionary changes, the
metabolic or regulatory pathway and recognition of homology or orthology
the associated cellular role of the among proteins can be difficult, but still
protein; in these cases consistent evidence for
relatedness should be expected on the
q identification of molecular binding sequence, structure and function levels.
partners and their mode of interaction Sometimes, the situation is complicated
with the protein. because of gene duplications within a
species leading to paralogous copies of the
The derivation of protein function from same gene. These paralogous copies are
protein sequence by theoretical means is subject to evolutionary changes and the
commonly performed by transferring evolutionary pressure on structure or
functional information from related function is much relaxed for all but one
proteins (eg from other organisms). copy, which still serves the original
Usually the transfer is from proteins purpose, such that greater deviations in
whose function has been established with sequence, structure and function occur for
experimental evidence. The establishment these copies. As still considerable, ie
of the relevant protein relationship based significantly more than random, sequence
on sequence is complicated by some similarity among paralogous proteins can
subtleties of evolutionary processes. be observed, this messes up the situation,
Though it is often true that organisms leading to erroneous transfer of functions
share related proteins with similar to already functionally disabled or
sequence, similar structure and the same functionally completely different proteins.
276 © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000
08-lengauer.p65 276 9/19/00, 1:49 PM
- 3. Protein structure prediction methods for drug design
Therefore, in the following, we have to families that form clusters of structurally
distinguish between three notions. or functionally related proteins are helpful
similarity Similarity is a quantitative measure on the in the prediction of protein function in
sequence, structure or function level. these cases. There are several protein
homology Homology is used when there is a clear classifications available on the internet
established or potential (assumed, that can serve for this purpose
predicted) evolutionary relationship
orthology between proteins. The term orthology, in q COGS3,4
addition, indicates homologous proteins
with (established or potential) the same q ProDom5
or at least similar function. The notion of
paralogy paralogy, in contrast, is used, when q PFAM6
homologous proteins are expected to
have evolved enough to expect changes q SMART7,8
in function (with or without a change in
3D structure). q PRINTS9
For drug design, we need to know
more of the function of the protein than q Blocks+10
follows from just a general classification.
It would be best both to know natural q ProtoMap11,12
binding partners and to have a detailed
structural model of the binding sites of A number of these databases (Pfam,
the protein. PROSITE, PRINTS, ProDom,
SWISSPROT+TREMBL) are currently
METHODS OF being united in the InterPro13 database.
PREDICTING PROTEIN Since protein function is basically tied to
FUNCTION protein domains, protein domain analysis
There are a number of ways to predict is an integral part of the methodology
protein function from sequence. Most of that leads to protein family databases.14–22
them are based on sequence similarity. A Since only 20–40 per cent of the
large database of protein sequences is protein sequences in a genome such as
screened for ‘model sequences’ that Mycoplasma genitalium, M. janaschii and M.
exhibit a high level of similarity to the tuberculosis have significant sequence
query protein sequence. Sequence similarity to proteins of known
BLAST alignment tools such as BLAST1 and function,23,24 we need to be able to make
PSI-BLAST PSI-BLAST2 are the work-horses of such conclusions on the function of proteins
analyses. If one or more model sequences that exhibit no significant sequence
are found that exhibit a sufficiently high similarity to suitable model proteins. As
level of similarity to the query sequence the similarity between query sequence
and about whose function we have some and model sequence decreases below a
knowledge, then conclusions may be threshold of, say, 25 per cent, safe
possible on the function of the query conclusions on a common evolutionary
sequence. If the homology is above, say, origin of the query sequence and the
40 per cent and functionally important model sequence can no longer be made.
motifs are conserved then we can However, it turns out that, in many cases,
hypothesise that the query sequence has a the protein fold can still be reliably
function that is quite similar to that of predicted, and in several cases even
the model sequence. As the level of detailed structural models of protein
similarity decreases, the conclusions on binding sites can be generated. Thus,
function that can be drawn from especially in this similarity range, protein
sequence similarity become less and less structure prediction – again together with
protein classifications reliable. Classifications of proteins into the identification of conserved sequence
© HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000 277
08-lengauer.p65 277 9/19/00, 1:49 PM
- 4. Lengauer and Zimmer
or spatial motifs – can help to ascertain them in more detail here. While these
aspects of protein function. methods are reported to generate
Other sources of information beside significant insight into protein function on
sequence similarity have been explored in a higher level and to point to putative
order to gain insight into protein target proteins,39 in the end, drug design
function. These methods are represented can be expected to necessitate structural
by five arrows pointing downwards in the knowledge of either the target protein or
top right part of Figure 1. The following its binding partners.
comments on these methods apply in the
order from left to right: METHODS FOR
PREDICTING PROTEIN
sequence alignment q Sequence alignment has long been used STRUCTURE
for ascertaining protein function. This is In the authors’ view, computational
the standard method and we methods for predicting protein structure
commented on it above. This approach from sequence alone are still well out of
is only reliable if there is high sequence range, although, there are recent
similarity such that we can argue about methodical advances – sometimes called
orthologous proteins, since we know mini-threading – that are based on the
the function of one of the proteins. assembly of fragments (see eg
ROSETTA40). In contrast, modelling
q Recently, the Rosetta stone method has protein structures after folds that have been
been introduced. This method uses over seen before has become quite a powerful
20 completely sequenced genomes and method for protein structure prediction.
analyses evolutionary correlations of two Here, the query sequence is aligned
domains being fused into one protein in (threaded) to a model sequence whose
one species and occurring in separate three-dimensional structure is known (the
proteins in another species. From these template protein). All proteins in a given
classifications the method establishes protein structure database – usually, an
pairwise links between functionally appropriate representative set of structures
related proteins25 and elicits putative are tried — and each template is ranked
protein–protein interactions.26 using heuristic scoring functions. The
score reflects the likelihood that the query
q For the same purpose, the phylogenetic sequence assumes the template structure.
profile method analyses the co- The approach of modelling a protein
occurrence of genes in the genomes of structure after a known template is called
homology-based different organisms.27 homology-based modelling and the selection
modelling of a suitable template protein is often done
protein threading q The analysis of change of phenotype via protein threading.
based on mutated genes (eg by knock- Protein threading has three major
out experiments) yields important objectives: first, to provide orthogonal
information on aspects of protein evidence of possible homology for
function.28–30 distantly related protein sequences;
second, to detect possible homology in
q In the future, the analysis of genetic cases where sequence methods fail; and
variations31 among individuals, eg single third, to improve structural models for
nucleotide polymorphisms (SNPs),32–34 the query sequence via structurally more
will be helpful in ascertaining protein accurate alignments.
function beyond mere disease linkage or There are several successful protein
association (right arrow in Figure 1).35–38 threading methods, including:
None of these methods looks at protein q methods based on hidden Markov
structures, and thus we do not discuss models;41–48
278 © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000
08-lengauer.p65 278 9/19/00, 1:49 PM
- 5. Protein structure prediction methods for drug design
q dynamic programming methods based q Modeller60–64 and ModBase;65
on profiles;49–51
q Swiss-Model;66,67
q environment compatibility (ie contact
capacity potentials as used in the q or commercial versions included in
protein threader 123D).52 Quanta (MSI) or Sybyl (Tripos, Inc.).
side-chain modelling These programs are very fast. A mid-size For protein side-chain modelling there
protein sequence can be threaded against are two contrasting approaches based on
a database of about 1,500 protein knowledge deduced from structural
structures in a few minutes on a PC or databases and methods such as energy
workstation. However, the underlying minimisation and molecular dynamics,68
methods assume that the assignment of respectively. Methods based on side-chain
chemical properties to spatial regions in rotamer libraries that have been created
the protein is the same in the query via the analysis of the protein structure
protein and the template protein. This is database are usually employed to get a
not the case, in practice, especially if one first model. Energy minimisation or
compares proteins with partly different molecular dynamics69 is often used to
folds or different functions. Extensions of refine the model. Such methods have
the homology-based modelling approach been in use for crystallography/nuclear
to proteins with very similar protein magnetic resonance (NMR) for many
structures but different chemical make- years and are available in several program
up require the solution of packages and tools (Charmm,70
algorithmically provably hard problems GROMOS/GROMACS71,72 and many
and thus necessitate much more others73,74). In general these methods are
computing time.There are: quite computer-intensive and can only
be exercised on one or a few proteins.
q heuristic approaches based on distance- Generally, the backbone alignment is an
based pair potentials of mean force;53–56 input to homology-based modelling tools
and the quality of the derived models is
q optimal or approximate combinatorial highly sensitive to the accuracy of the
tree search techniques.57–59 provided alignments.
loop modelling Loops are modelled by a related host of
Such approaches need hours to thread a methods. Loops that involve more than
protein through a database of 1,500 about five residues are still hard to
templates. However, they can yield more model.75–78
accurate alignments and models of The evaluation of the accuracy of
binding pockets of proteins. assigning a protein fold (general protein
The process of protein threading architecture) to a query sequence is
selects a suitable template protein for a commonly based on generally accepted
protein query sequence and computes an fold classifications such as SCOP79 or
quality assurance alignment of the backbone of the two CATH.80 The quality of backbone
proteins that is the starting point for alignments is much harder to rate, and no
generating a structural model for the generally accepted scheme is available, as
query protein based on the structure of of today.81–84 Rating the quality of
the template protein. What is left is to protein structure models is generally
place the side chains of the query protein based on the root mean square (rms)
and to model the loops of the query deviation of the model and the actual
protein that are not modelled by the structure on a selected set of residues.
template structure. These two tasks are The problem here is that the model must
performed by homology-based be superposed with the actual structure.
modelling tools such as: There are several tools that perform this
© HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000 279
08-lengauer.p65 279 9/19/00, 1:49 PM
- 6. Lengauer and Zimmer
task – DALI/FSSP,85,86 SSAP,87 VAST,88 be derived beyond doubt. For more than
PROSUP89 or SARF90 – and they can half of the 21 more difficult cases
yield different results. Thus, there is no reasonable models could be predicted by
accepted gold standard for protein at least one of the participating prediction
CAFASP structure superposition. However, for the teams. In addition, the CAFASP
purpose of rating the structures of target subsection of the assessment has
proteins, the available superposition demonstrated that 10 out of 19 folds
methods are sufficient. could be solved via completely automatic
application of the best threading methods
PERFORMANCE OF without any manual intervention.
PROTEIN STRUCTURE Methods for refining rough structural
PREDICTION METHODS models towards the true native structure
There are strong efforts to render the of the query protein are also not
predition assessment quality of protein structure prediction straightforward. This is an active area of
methods more transparent and easier to research.92
evaluate. The centre of these efforts is the A combination of protein threading
bi-annual CASP experiment, which rates followed by homology-based modelling
protein structure prediction methods on cannot create genuinely novel protein
blind predictions and aims at developing structures. But it turns out to be quite
standardised and generally agreed upon sensitive in creating structure models
assessment procedures both for fold based on known folds. Models that have
identification and the evaluation of been reasonably accurate (eg down to
alignment accuracy as well as homology 1.4Å for some 60 amino acids of the
models. A blind prediction is a prediction active site of herpes virus thymidine
of the three-dimensional structure for a kinase93) have been reported in blind
protein sequence at a time, at which the studies of proteins with a sequence
actual structure of the protein is not identity to the template protein of as low
known (yet). After the structure has been as 10 per cent. Correct folds can be
resolved, the prediction is compared with assigned in many cases, even if the query
the actual structure. There have been sequence and the suitable template
CASP three issues of the CASP experiment;91 exhibit a very low level of sequence
the fourth one follows this year. The similarity (down to 5 per cent, ie far
CASP experiment has been a significant below the level of random sequence
help in providing a more solid basis for similarity of 17–18 per cent in optimal
assessing the power of different protein alignments).
structure prediction methods.
For fold recognition, detectable STRUCTURAL GENOMICS
progress has been observed from CASP1 The goal of structural genomics projects
to CASP2. In CASP3, similar is to solve experimental structures of all
performance as in CASP2 was achieved major classes of protein folds
on more difficult targets. There appears to systematically independent of some
be a certain limit of current fold functional interest in the proteins.94,95 The
structure space recognition methods, which is still well aim is to chart the protein structure space
below the limit of detectable structural efficiently; functional annotations and/or
similarity (via structural comparisons). In assignment are made afterwards. This
addition, in CASP3 several groups affords a thoroughly thought-out strategy
produced reasonable models of up to 60 of mixing experimental protein structure
residues for ab initio target fragments. determination, eg via X-ray, with
In CASP3 from 43 protein targets, 15 computer-based protein structure
could be classified as comparative prediction. The experiments have to yield
homology modelling targets, ie related novel protein structures. The proteins to
folds and accompanying alignments could be resolved experimentally are again
280 © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000
08-lengauer.p65 280 9/19/00, 1:49 PM
- 7. Protein structure prediction methods for drug design
selected by computer. The computer part characteristics are imprinted onto the
deduces the remaining structures based protein structure by specific patterns of
on homology-based modelling and amino acid side chains that make up the
protein threading. One goal of the overall binding pocket. The conservation of
structural genomics endeavour is to have these amino acids is what makes two
an experimentally resolved protein proteins have the same function. Since
structure within a certain structural nature varies sequence quite flexibly, this
distance to any possible protein sequence, level of conservation is only maintained
which allows for computing reliable among orthologous proteins that exhibit
models for all protein sequences. a high level of sequence similarity.
Once a map of the protein structure Thus, if the template protein from
space is available, this knowledge should which we predict protein structure is not
provide additional insights on what the orthologous to the query protein, other
function of the protein in the cell is and methods of function prediction have to
with what other partners it might come to bear. It is quite natural to
interact. Such information should add to consider conservation patterns in the
information gained from high- protein sequence here, such as exhibited
throughput screening and biological in databases containing functional
functional motifs assays. So far, glimpses of what will be sequence motifs such as PROSITE. An
possible could be obtained by analysing alternative that has been investigated
complete genomes or large sets of more recently is to analyse conservation
proteins from expression experiments in 3D space.98 Experience shows that
structural motifs with the structural knowledge available such ‘structural’ motifs provide more
today, ie more or less complete information than motifs derived purely
representative sets and a quite coarse from sequence, even if the sequence
coverage of structure space.63,96,97 motifs are distributed over several regions
(BLOCKS+, PRINTS). Recently, the
METHODS FOR notion of an approximate structural motif
PREDICTING PROTEIN has been introduced – sometimes called
fuzzy functional forms FUNCTION FROM fuzzy functional form (FFF).99 Using a
PROTEIN STRUCTURE library of approximate structural motifs
Aspects of protein structure that are enhances the range of applicability of
useful for drug design studies typically motif search at the price of reduced
have to involve three-dimensional sensitivity and specificity. Such
structure. Predicting the secondary approaches are supported by the fact that,
structure of the protein is not sufficient. often, binding sites of proteins are much
Even the similarity of the three- more conserved than the overall protein
dimensional structures of two proteins structure (eg bacterial and eukaryotic
cannot be taken as an indication for a serine proteases), such that an inexact
similar function of these proteins. The model can have an accurately modelled
reason is that protein structure is part responsible for function. As the
conserved much more than protein structural genomics projects produce a
function. Indeed, protein folds such as the more and more complete picture of the
TIM barrel (triose-phosphate isomerase) protein structure space, comprehensive
are quite ubiquitous and can be libraries of highly discriminative
considered as general scaffolds that lend structural motifs can be expected.
molecular stability to the protein and are The relationship between structure and
not directly tied to its function. In function is a true many-to-many relation.
contrast, the molecular function of the Recent studies have shown that
protein is tied to local structural particular functions could be mounted
characteristics pertaining to binding onto several different protein folds100 and,
pockets on the protein surface. These conversely, several protein fold classes can
© HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000 281
08-lengauer.p65 281 9/19/00, 1:49 PM
- 8. Lengauer and Zimmer
docking perform a wide range of functions.101 search for drug leads. A docking method
This limits our potential of deducing that takes a minute per instance can be
function from structure. But knowledge used to screen up to thousands of
on which folds support a given function compounds on a PC or hundreds of
and which functions are based on a given thousands of drugs on a suitable parallel
fold can still help in predicting function computer. Docking methods that take the
from structure. In addition, local better part of an hour cannot be suitably
drug screening structural templates such as FFFs employed for such large-scale screening
indicative for a particular function can purposes. In order to screen really large
identify similar sites and the associated drug databases with several hundred
function despite a globally different fold. thousand compounds docking methods
Such 3D patterns can also discriminate that can handle single protein/drug pairs
among globally similar folds with respect within seconds are needed.
to containing particular conserved 3D The high conformational flexibility of
functional motifs in order to classify them small molecules as well as the subtle
into different functional categories. structural changes in the protein binding
Though it is not easy to derive pocket upon docking (induced fit) are
functions from resolved protein major complications in docking.
structures, the availability of structural Furthermore, docking necessitates careful
information improves the chances analysis of the binding energy. The energy
scoring function compared with relying on sequence model is cast into the so-called scoring
methods alone. function that rates the protein–ligand
complex energetically. Challenges in the
METHODS FOR energy model include the handling of
DEVELOPING DRUGS entropic contributions, and solvation
BASED ON PROTEIN effects, and the computation of long-
STRUCTURE range forces in fast docking methods.
The object of drug design is to find or The state of the art in docking can be
develop a, mostly small, drug molecule summarised as follows (see also Table 1).
structural flexibility that tightly binds to the target protein, Handling the structural flexibility of the
moderating (often blocking) its function drug molecule can be done within the
or competing with natural substrates of regime up to about a minute per
the protein. Such a drug can be best molecular complex on a PC (see, eg,
found on the basis of knowledge of the Kramer et al.102). A suitable analysis of the
protein structure. If the spatial shape of structural changes in the protein still
the site of the protein is known, to which necessitates more computing time.
the drug is supposed to bind, then Today, tools that are able to dock a
docking methods can be applied to select molecule to a protein within seconds are
suitable lead compounds that have the still based on rigid-body docking (both
potential of being refined to drugs. The the protein and ligand conformational
speed of a docking method determines flexibility is omitted).
whether the method can be employed for Recently, fast docking tools have been
screening compound databases in the adapted to screening combinatorial drug
Table 1: Taxonomy of docking methods
Runtime on a PC Fraction of a second About a minute An hour or longer
Flexibility of the drug molecule X X
Flexibility of the protein binding site X
Energy model None Short-range Force field
282 © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000
08-lengauer.p65 282 9/19/00, 1:49 PM
- 9. Protein structure prediction methods for drug design
libraries (see, eg, Rarey and Lengauer103). advantage that it does not have to deal
Such libraries provide a carefully selected with insufficiently powerful computer
set of molecular building blocks together models, at the expense of high laboratory
with a small set of chemical reactions that cost and the absence of structural
link the modules. In this way, a knowledge on ‘why’ a compound binds
combinational library combinatorial library can theoretically to the protein.
provide a diversity of up to billions of
molecules from a small set of reactants. CONCLUSION
The accuracy of docking predictions In summary, the field is still in an early
lies within 50–80 per cent ‘correct’ stage of development. Ab initio protein
predictions depending on the evaluation structure prediction continues to be a
measure and the method. That means that grand challenge for which no
docking methods are far from perfectly comprehensive solution is in sight. The
accurate. Nevertheless, they are very quality of fold prediction based on
useful in pharmaceutical practice. The homology rises and tools has reached the
major benefit of docking is that a large stage where one can generate confident
drug library can be ranked with respect predictions for soluble proteins that in a
to the potential that its molecules have substantial fraction (about half) of the
for being a useful lead compound for the cases provide significant threading hits in
target protein in question. The quality of the structure database. Protein threading
a method in this context can be and homology-based prediction become
enrichment factor measured by an enrichment factor. Roughly, especially helpful in an environment
this is the ratio between the number of where the methods can be used in
active compounds (drugs that bind concert with experimental techniques for
tightly to the protein) in a top fraction structure and function determination.
(say the top 1 per cent) of the ranked Here, the prediction methods can
drug database divided by the same figure exercise their strengths, which lie in
in the randomly arranged drug database. being used interactively by experts and
State-of-the-art docking methods in the making suggestions that can be followed
middle regime (minutes per molecular up by succeeding experimentation, rather
pair), eg FlexX,104 achieve enrichment than being required to provide proven
factors of up to about 15. Fast methods fact. The process of going from structure
(seconds per pair), eg FeatureTrees,105 to function is far from being automated.
achieve similar enrichment factors, but In a scenario that combines structure
deliver molecules similar to known prediction methods with
binding ligands and do not detect as experimentation, the step from structure
diverse a range of binding molecules. to function can be performed in a
Even if the structure of the protein customised manner.
binding site is not known, computer- Protein structure prediction by
based methods can be used to select homology is definitely not yet a turn-key
promising lead compounds. Such technology. But we can expect it to enter
methods compare the structure of a the ‘production’ stage through the
molecule with that of a ligand that is activities in structural genomics. Still the
known to bind to the protein, for field of protein structure prediction is
instance, its natural substrate. very busy, generating the tools and
Alternatives to docking for lead finding processes for raising the number of
high-throughput include high-throughput screening confident structure predictions and the
screening (HTS). This laboratory method allows for accompanying estimates of significance.
testing the binding affinity of up to more Problems for applying these results in
than several thousand compounds to the drug design are not only that the models
same target protein in a day. In may not be sufficiently accurate but also
comparison this method has the that the structures of many interesting
© HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000 283
08-lengauer.p65 283 9/19/00, 1:49 PM
- 10. Lengauer and Zimmer
target proteins will not be accessible by 2. Altschul, S. F., Madden, T. L., Schaffer, A. A.
et al. (1997), ‘Gapped BLAST and PSI-
homology-based modelling, at all, for BLAST: a new generation of protein
some time to come. This includes the database search programs’, Nucleic Acids
therapeutically particularly interesting Res., Vol. 25(17), pp. 3389–3402. http://
class of membrane proteins, for which ncbi.nlm.nih.gov/blast/psiblast.cgi
essentially no structures have been 3. Tatusov, R. L., Galperin, M. Y., Natale, D.
resolved. A. and Koonin, E. V. (2000), ‘The COG
database: a tool for genome-scale analysis
Docking is used frequently in of protein functions and evolution’, Nucleic
structure-based drug design. To the Acids Res., Vol. 28(1), pp. 33–36.
authors’ knowledge, the first drug 4. Tatusov, R. L., Koonin, E. V. and Lipman,
drugs developed with developed with structure-based D. J. (1997), ‘A genomic perspective on
computer techniques techniques was the HIV protease protein families’, Science, Vol. 278(5338),
pp. 631–637.
inhibitor Dorzolamide. In the past few
years structural considerations have begun 5. Corpet, F., Servant, F., Gouzy, J. and Kahn,
D. (2000), ‘ProDom and ProDom-CG: tools
to pervade the design of new drugs. A for protein domain analysis and whole
point in case is that of the neuraminidase genome comparisons’, Nucleic Acids Res.,
inhibitors for HIV. Such studies mostly Vol. 28(1), pp. 267–269.
involve experimentally resolved protein 6. Bateman, A., Birney, E., Durbin, R. et al.
structures. However, even models can (2000), ‘The Pfam protein families
database’, Nucleic Acids Res., Vol. 28(1),
serve to guide drug development. Based pp. 263–266.
on the experimentally resolved structure
7. Schultz, J., Milpetz, F., Bork, P. and Ponting,
of the membrane protein C. P. (1998), ‘SMART, a simple modular
bacteriorhodopsin, several groups are architecture research tool: identification of
attempting to model binding sites of G- signaling domains’, Proc. Natl Acad. Sci.
protein coupled receptors that are USA, Vol. 95(11), pp. 5857–5864.
believed to be structurally similar. 8. Schultz, J., Copley, R. R., Doerks, T. et al.
Nevertheless, the authors are not aware (2000), ‘SMART: a web-based tool for the
study of genetically mobile domains’,
of any instance where the whole process Nucleic Acids Res., Vol. 28(1), pp. 231–234.
line from the protein sequence to the
9. Attwood, T. K., Croning, M. D., Flower,
lead structure has been exercised in an D. R. et al. (2000), ‘PRINTS-S: the
integrated manner and with significant database formerly known as PRINTS’,
help of computer predictions. The field Nucleic Acids Res., Vol. 28(1), pp. 225–227.
has not reached this level of maturity 10. Henikoff, S., Henikoff, J. G. and
yet. While structural aspects – even as Pietrokovski, S. (1999), ‘Blocks+: a non-
redundant database of protein alignment
predicted by the computer – can be blocks derived from multiple compilations’,
expected to invade the search for target Bioinformatics, Vol. 15(6), pp. 471–479.
proteins and the development of new 11. Yona, G., Linial, N. and Linial, M. (2000),
drugs, experimental data, where they are ‘ProtoMap: automatic classification of
accessible, will always be highly welcome protein sequences and hierarchy of protein
families’, Nucleic Acids Res., Vol. 28(1), pp.
and often be indispensable in this 49–55.
process.
12. Yona, G., Linial, N. and Linial, M. (1999),
‘ProtoMap: automatic classification of
Acknowledgements protein sequences, a hierarchy of protein
We thank Matthias Rarey for helpful comments on families, and local maps of the protein
this paper and Gerhard Barnickel and Gerhard space’, Proteins, Vol. 37(3), pp. 360–378.
Klebe for information on the state of drugs 13. http://www.ebi.ac.uk/interpro/
developed by structure-based techniques.
14. Rose, G. D. (1979), ‘Hierarchic organization
of domains in globular proteins’, J. Mol.
References Biol., Vol. 134(3),
pp. 447–470.
1. Altschul, S. F., Gish, W., Miller, W.
et al. (1990), ‘Basic local alignment search 15. Nichols, W. L., Rose, G. D., Ten Eyck, L. F.
tool’, J. Mol. Biol., Vol. 215(3), pp. 403–410. and Zimm, B. H. (1995), ‘Rigid domains
http://ncbi.nlm.nih. gov/BLAST/ in proteins: an algorithmic approach to
284 © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000
08-lengauer.p65 284 9/19/00, 1:49 PM
- 11. Protein structure prediction methods for drug design
their identification’, Proteins, Vol. 23(1), 28. Bork, P., Dandekar, T., Diaz-Lazcoz, Y.
pp. 38–48. et al. (1998), ‘Predicting function: from
genes to genomes and back’, J. Mol. Biol.,
16. Gracy, J. and Argos, P. (1998), ‘Automated Vol. 283(4), pp. 707–725.
protein sequence database classification. II.
Delineation of domain boundaries from 29. Roemer, K., Johnson, P. A. and
sequence similarities’, Bioinformatics, Friedmann, T. (1991), Knock-in and
Vol. 14(2), pp. 174–187. knock-out: Transgenes, Development and
Disease: A Keystone Symposium
17. Gracy, J. and Argos, P. (1998), ‘DOMO: sponsored by Genentech and Immunex,
a new database of aligned protein Tamarron, CO, USA, January 12–18
domains’, Trends Biochem. Sci., Vol. 23(12), 1991’, New Biol., Vol. 3(4), pp. 331–335.
pp. 495–497.
30. Sato, T. N. (1999), ‘Gene trap, gene
18. Sowdhamini, R., Rufino, S. D. and knockout, gene knock-in, and transgenics
Blundell, T. L. (1996), ‘A database of in vascular development’, Thromb.
globular protein structural domains: Haemost., Vol. 82(2), pp. 865–869.
clustering of representative family members
into similar folds’, Fold Des., Vol. 1(3), 31. Collins, F. S., Guyer, M. S. and
pp. 209–220. Charkravarti, A. (1997), ‘Variations on a
theme: cataloging human DNA sequence
19. Jones, S., Stewart, M., Michie, A. et al. variation’, Science, Vol. 278(5343),
(1998), ‘Domain assignment for protein pp. 1580–1581.
structures using a consensus approach:
characterization and analysis’, Protein Sci., 32. Brookes, A. J. (1999), ‘The essence of
Vol. 7(2), pp. 233–242. SNPs’, Gene, Vol. 234(2), pp. 177–186.
20. Orengo, C. A., Martin, A. M., 33. Kuska, B. (1999), ‘Snipping “SNPs”: a
Hutchinson, G. et al. (1998), ‘Classifying a new tool for mining gene variations’,
protein in the CATH database of domain J. Natl Cancer Inst., Vol. 91(13), p. 1110.
structures’, Acta Crystallogr. D Biol. 34. Vilain, E. (1998), ‘CYPs, SNPs,
Crystallogr., Vol. 54(1(Pt 6)), pp. 1155–1167. and molecular diagnosis in the
21. Murzin, A. G. (1996), ‘Structural postgenomic era’, Clin. Chem.,
classification of proteins: new Vol. 44(12), pp. 2403–2404.
superfamilies’, Curr. Opin. Struct. Biol., 35. Collins, F. S. (1999), ‘Shattuck lecture –
Vol. 6(3), pp. 386–394. medical and societal consequences of the
22. Murzin, A. G., Brenner, S. E., Hubbard, T. Human Genome Project’, N. Engl. J.
and Chothia, C. (1995), ‘SCOP: a Med., Vol. 341(1), pp. 28–37.
structural classification of proteins database 36. Ellsworth, D. L. and Manolio, T. A. (1999),
for the investigation of sequences and ‘The emerging importance of genetics in
structures’, J. Mol. Biol., Vol. 247(4), epidemiologic research II. Issues in study
pp. 536–540. design and gene mapping’, Ann.
Epidemiol., Vol. 9(2), pp. 75–90.
23. Fischer, D. and Eisenberg, D. (1999),
‘Predicting structures for genome 37. Ellsworth, D. L. and Manolio, T. A.
proteins’, Curr. Opin. Struct. Biol., Vol. 9(2), (1999), ‘The emerging importance of
pp. 208–211. genetics in epidemiologic research III.
Bioinformatics and statistical genetic
24. Huynen, M., Doerks, T., Eisenhaber, F. et al. methods’, Ann. Epidemiol., Vol. 9(4),
(1998), ‘Homology-based fold predictions
pp. 207–224.
for Mycoplasma genitalium proteins’, J. Mol.
Biol., Vol. 280(3), pp. 323–326. 38. Terwilliger, J. D. and Ott, J. (1994),
‘Handbook of Human Genetic Linkage’,
25. Marcotte, E. M., Pellegrini, M., Johns Hopkins University Press,
Thompson, M. J. et al. (1999), ‘A combined Baltimore.
algorithm for genome-wide prediction of
protein function’, Nature, Vol. 402(6757), 39. Drews, J. (1996), ‘Genomic sciences and
pp. 83–86. the medicine of tomorrow’, Nat.
Biotechnol., Vol. 14(11), pp. 1516–1518.
26. Marcotte, E. M., Pellegrini, M., Ng, H. L.,
Rice, D. W. et al. (1999), ‘Detecting protein 40. Simons, K. T., Bonneau, R., Ruczinski, I.
function and protein-protein interactions and Baker, D. (1999), ‘Ab initio protein
from genome sequences’, Science, Vol. structure prediction of CASP III targets
285(5428), pp. 751–753. using ROSETTA’, Proteins, Vol. 37(S3),
pp. 171–176.
27. Pellegrini, M., Marcotte, E. M., Thompson,
M. J. et al. (1999), ‘Assigning protein 41. Karchin, R. and Hughey, R. (1998),
functions by comparative genome analysis: ‘Weighting hidden Markov models for
protein phylogenetic profiles’, Proc. Natl maximum discrimination’, Bioinformatics,
Acad. Sci. USA, Vol. 96(8), pp. 4285–4288. Vol. 14(9), pp. 772–782.
© HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000 285
08-lengauer.p65 285 9/19/00, 1:49 PM
- 12. Lengauer and Zimmer
42. Bateman, A., Birney, E., Durbin, R. et al. 54. Hendlich, M., Lackner, P., Weitckus, S. et al.
(1999), ‘Pfam 3.1: 1313 multiple alignments (1990), ‘Identification of native protein folds
and profile HMMs match the majority of amongst a large number of incorrect
proteins’, Nucleic Acids Res., Vol. 27(1), models. The calculation of low energy
pp. 260–262. conformations from potentials of mean
force’, J. Mol. Biol., Vol. 216(1), pp. 167–180.
43. Park, J., Karplus, K., Barrett, C. et al.
(1998), ‘Sequence comparisons using 55. Sippl, M. J. (1995), ‘Knowledge-based
multiple sequences detect three times as potentials for proteins’, Curr. Opin. Struct.
many remote homologues as pairwise Biol., Vol. 5(2), pp. 229–235.
methods’, J. Mol. Biol., Vol. 284(4),
pp. 1201–1210. 56. Sippl, M. J. and Flockner, H. (1996),
‘Threading thrills and threats’, Structure,
44. Barrett, C., Hughey, R. and Karplus, K. Vol. 4(1), pp. 15–19.
(1997), ‘Scoring hidden Markov
models’, Comput. Appl. Biosci., Vol. 13(2), 57. Lathrop, R. H. and Smith, T. F. (1996),
pp. 191–199. ‘Global optimum protein threading with
gapped alignment and empirical pair score
45. McClure, M. A., Smith, C. and Elton, P. functions’, J. Mol. Biol., Vol. 255(4),
(1996), ‘Parameterization studies for the pp. 641–665.
SAM and HMMER methods of hidden
Markov model generation’, ‘Proc. 4th 58. Thiele, R., Zimmer, R. and Lengauer, T.
International Conference on Intelligent (1999), ‘Protein threading by recursive
Systems for Molecular Biology’, AAAI dynamic programming’, J. Mol. Biol., Vol.
Press, Menlo Park, CA, pp. 155–164 290(3), pp. 757–779.
46. Eddy, S. R. (1998), ‘Profile hidden 59. Xu, Y., Xu, D. and Uberbacher, E. C.
Markov models’, Bioinformatics, Vol. 14(9), (1998), ‘An efficient computational method
pp. 755–763. for globally optimal threading’, J. Comput.
Biol., Vol. 5(3), pp. 597–614.
47. Sonnhammer, E. L., Eddy, S. R., Birney, E.,
Bateman, A. and Durbin, R. (1998), ‘Pfam: 60. Sali, A. (1995), ‘Modeling mutations and
multiple sequence alignments and HMM- homologous proteins’, Curr. Opin.
profiles of protein domains’, Nucleic Acids Biotechnol., Vol. 6(4), pp. 437–451.
Res., Vol. 26(1), pp. 320–322. 61. Sali, A., Potterton, L., Yuan, F. et al. (1995),
48. Eddy, S. R. (1996), ‘Hidden Markov ‘Evaluation of comparative protein
models’, Curr. Opin. Struct. Biol., Vol. 6(3), modeling by MODELLER’, Proteins,
pp. 361–365. (1995), ‘Proc. 3rd Vol. 23(3), pp. 318–326.
International Conference on Intelligent 62. Sali, A. (1998), ‘100,000 protein structures
Systems for Molecular Biology’, AAAI for the biologist’, Nat. Struct. Biol., Vol.
Press, Menlo Park, CA, pp. 114–120. 5(12), pp. 1029–1032.
49. Bowie, J. U., Luthy, R. and Eisenberg, D.
63. Sanchez, R. and Sali, A. (1998),
(1991), ‘A method to identify protein ‘Large-scale protein structure modeling of
sequences that fold into a known three- the Saccharomyces cerevisiae genome’,
dimensional structure’, Science, Vol. Proc. Natl Acad. Sci. USA, Vol. 95(23),
253(5016), pp. 164–170. pp. 13597–13602.
50. Luthy, R., Bowie, J. U. and Eisenberg, D.
64. Sanchez, R. and Sali, A. (1997), ‘Evaluation
(1992), ‘Assessment of protein models with
of comparative protein structure modeling
three-dimensional profiles’, Nature, Vol. by MODELLER-3’, Proteins, Suppl 1,
356(6364), pp. 83–85. pp. 50–58.
51. Luthy, R., Xenarios, I. and Bucher, P. 65. Sanchez, R., Pieper, U., Mirkovic, N. et al.
(1994), ‘Improving the sensitivity of the (2000), ‘MODBASE, a database of
sequence profile method’, Protein Sci., Vol. annotated comparative protein structure
3(1), pp. 139–146. models’, Nucleic Acids Res., Vol. 28(1),
52. Alexandrov, N. N., Nussinov, R. and pp. 250–253.
Zimmer, R. M. (1996), ‘Fast protein fold
66. Guex, N., Diemand, A. and Peitsch, M. C.
recognition via sequence to structure
(1999), ‘Protein modelling for all’, Trends
alignment and contact capacity potentials’, Biochem. Sci., Vol. 24(9), pp. 364–367.
Pacific Symposium on Biocomputing,
pp. 53–72. 67. Guex, N. and Peitsch, M. C. (1997),
‘SWISS-MODEL and the Swiss-
53. Sippl, M. J. (1990), ‘Calculation of
PdbViewer: an environment for
conformational ensembles from potentials
comparative protein modeling’,
of mean force. An approach to the
Electrophoresis, Vol. 18(15), pp. 2714–2723.
knowledge-based prediction of local
structures in globular proteins’, J. Mol. Biol., 68. Petrella, R. J., Lazaridis, T. and Karplus, M.
Vol. 213(4), pp. 859–883. (1998), ‘Protein sidechain conformer
286 © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000
08-lengauer.p65 286 9/19/00, 1:49 PM
- 13. Protein structure prediction methods for drug design
prediction: a test of the energy function’, 82. Marchler-Bauer, A., Levitt, M. and Bryant,
Fold Des., Vol. 3(5), pp. 353–377. S. H. (1997), ‘A retrospective analysis of
CASP2 threading predictions’, Proteins,
69. Karplus, M. and Petsko, G. A. (1990),
Suppl 1, pp. 83–91.
‘Molecular dynamics simulations in
biology’, Nature, Vol. 347(6294), 83. Marchler-Bauer, A. and Bryant, S. H.
pp. 631–639 (1997), ‘A measure of success in fold
recognition’, Trends Biochem. Sci., Vol. 22(7),
70. Brooks, B. R., Bruccoleri, R. E., Olafson,
pp. 236–240.
B. D. et al. (1983), ‘CHARMM: A program
for macromolecular energy, minimization, 84. Lackner, P., Koppensteiner, W. A.,
and dynamics calculation’, Domingues, F. S. and Sippl, M. J. (1999),
J. Comp. Chem., Vol. 4, pp. 187–213. ‘Automated large scale evaluation of
71. Van Gunsteren, W. F. and Berendsen, H. J. protein structure predictions’, Proteins, Vol.
(1982), ‘Molecular dynamics: perspective 37(S3), pp. 7–14.
for complex systems’, Biochem. Soc. Trans., 85. Holm, L. and Sander, C. (1998), ‘Dictionary
Vol. 10(5), pp. 301–305. of recurrent domains in protein structures’,
72. Van Gunsteren, W. F. and Berendsen, Proteins, Vol. 33(1),
H. J. (1990), ‘Moleküldynamik- pp. 88–96.
Computersimulationen: Methodik, 86. Holm, L. and Sander, C. (1998), ‘Touring
Anwendungen und Perspektiven in protein fold space with Dali/FSSP’, Nucleic
der Chemie’, Angew. Chem., Vol. 102, Acids Res., Vol. 26(1), pp. 316–319.
pp. 1020–1055.
87. Orengo, C. A. and Taylor, W. R. (1996),
73. Levitt, M. (1983), ‘Protein folding by ‘SSAP: sequential structure alignment
restrained energy minimization and program for protein structure comparison’,
molecular dynamics’, J. Mol. Biol., Methods Enzymol., Vol. 266, pp. 617–635.
Vol. 170(3), pp. 723–764.
88. Gibrat, J. F., Madej, T. and Bryant, S. H.
74. Novotny, J., Bruccoleri, R. and Karplus, M. (1996), ‘Surprising similarities in structure
(1984), ‘An analysis of incorrectly folded comparison’, Curr. Opin. Struct. Biol., Vol.
protein models. Implications for structure 6(3), pp. 377–385.
predictions’, J. Mol. Biol., Vol. 177(4),
pp. 787–818. 89. Lackner, P., Koppensteiner, W. A.,
Domingues, F. S. and Sippl, M. J. (1999),
75. van Vlijmen, H. W. and Karplus, M. (1997), ‘Automated large scale evaluation of
‘PDB-based protein loop prediction: protein structure predictions’, Proteins, Vol.
parameters for selection and methods for
37(S3), pp. 7–14.
optimization’, J. Mol. Biol., Vol. 267(4),
pp. 975–1001. 90. Alexandrov, N. N. (1996), ‘SARFing the
PDB’, Protein Eng., Vol. 9(9), pp. 727–732.
76. Lessel, U. and Schomburg, D. (1997),
‘Creation and characterization of a new, 91. Lattman, E. E. (ed.) (1999), ‘Third Meeting
non-redundant fragment data bank’, Protein on the Critical Assessment of Techniques
Eng., Vol. 10(6), pp. 659–664. for Protein Structure Prediction’, Proteins,
Vol. 37, Suppl. 3..
77. Lessel, U. and Schomburg, D. (1999),
‘Importance of anchor group positioning 92. Kolinski, A., Rotkiewicz, P., Ilkowski, B.
in protein loop prediction’, Proteins, and Skolnick, J. (1999), ‘A method for the
Vol. 37(1), pp. 56–64. improvement of threading-based protein
models’, Proteins, Vol. 37(4), pp. 592–610.
78. Fechteler, T., Dengler, U. and Schomburg,
D. (1995), ‘Prediction of protein three- 93. Zimmer, R. and Thiele, R. (1997), ‘Fast
dimensional structures in insertion and protein fold recognition and accurate
deletion regions: a procedure for searching sequence–structure alignment’, in ‘German
data bases of representative protein Conference on Bioinformatics, GCB ’96’,
fragments using geometric scoring criteria’, Hofestädt, R., Lengauer, T., Löffler, M.
J. Mol. Biol., Vol. 253(1), pp. 114–131. and Schomburg, D. Eds, Springer, Berlin,
pp. 137–148.
79. Lo Conte, L., Ailey, B., Hubbard, T. J. et al.
(2000), ‘SCOP: a structural classification of 94. Kim, S. H. (1998), ‘Shining a light on
proteins database’, Nucleic Acids Res., Vol. structural genomics’, Nat. Struct. Biol., Vol.
28(1), pp. 257–259. 5 Suppl., pp. 643–645.
80. Orengo, C. A., Michie, A. D., Jones, S. et al. 95. Montelione, G. T. and Anderson, S. (1999),
(1997), ‘CATH – a hierarchic classification ‘Structural genomics: keystone for a
of protein domain structures’, Structure, Vol. Human Proteome Project’, Nat. Struct.
5(8), pp. 1093–1108. Biol., Vol. 6(1), pp. 11–12.
81. Marchler-Bauer, A. and Bryant, S. H. 96. Sali, A. (1998), ‘100,000 protein structures
(1997), ‘Measures of threading specificity for the biologist’, Nat. Struct. Biol.,
and accuracy’, Proteins, Suppl 1, pp. 74–82. Vol. 5(12), pp. 1029–1032.
© HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000 287
08-lengauer.p65 287 9/19/00, 1:50 PM