SlideShare une entreprise Scribd logo
1  sur  14
Télécharger pour lire hors ligne
Protein structure prediction methods for drug design




            Thomas Lengauer
            was a professor of Computer
                                                 Protein structure prediction
            Science at the University of
            Paderborn, before he joined
            GMD, the German National
                                                 methods for drug design
            Research Centre for                  Thomas Lengauer and Ralf Zimmer
            Information Technology, in           Date received (in revised form): 4th July 2000

            1992 as Director of the
            Institute for Algorithms and
            Scientific Computing. Jointly,       Abstract
            he is Professor of Computer          Along the long path from genomic data to a new drug, the knowledge of three-dimensional
            Science at the University of
            Bonn. His research interests
                                                 protein structure can be of significant help in several places. This paper points out such places,
            include computational biology        discusses the virtues of protein structure knowledge and reviews bioinformatics methods for
            and bioinformatics,                  gaining such knowledge on the protein structure.
            computational chemistry and
            combinatorial optimisation
            problems in technological
            applications.
                                                 INTRODUCTION                                              NOTIONS OF PROTEIN
            Ralf Zimmer                                                                                    FUNCTION
                                                 The long path from genomic data to a
            is a research scientist at
            GMD. He directs the research         new drug can conceptually be divided                      The increased accessibility of genomic
            group on algorithmic                 into two parts (see left side of Figure 1).               data and, especially, that of large-scale
            structural genomics. His             The first task is to select a target protein              expression data has opened new
            research interests include           whose molecular function is to be                         possibilities for the search for target
            algorithms and statistical
            methods for genomics,
                                                 moderated, in many cases blocked, by a                    proteins. This development has
            proteomics, protein sequence         drug molecule binding to it. Given the                    prompted large-scale investments into
            and structure analysis, and          target protein, the second task is to                     the new technology by many
            target finding, as well as           select a suitable drug that binds to the                  pharmaceutical companies. The
            connections between
            molecular biology and
                                                 protein tightly, is easy to synthesise, is                respective screening experiments rely
            computing (DNA computing).           bio-accessible and has no adverse effects                 critically on appropriate bioinformatics
                                                 such as toxicity. The knowledge of the                    support for interpreting the generated
                                                 three-dimensional structure of a protein                  data. Specifically, methods are required
                                                 can be of significant help in both phases.                to identify interesting differentially
            Keywords: protein structure          The steric and physicochemical                            expressed genes and to predict the
            prediction, protein target,
            protein–ligand docking
                                                 complementarity of the binding site of                    function and structure of putative target
                                                 the protein and the drug molecule is an                   proteins from differential expression data
                                                 important, if not the dominating, feature                 generated in an appropriate screening
                                                 of strong binding. Thus, in many cases,                   experiment.
                                                 the knowledge of the protein structure                      Protein function is a colourful notion
                                                 affords well-founded hypotheses of the                    whose meaning can range over several
                                                 function of the protein. If the structure                 levels:
                                                 of the relevant binding site of the
                                                 protein is known in detail, we can even                   q   a very general classification (globular,
                                                 start to employ structure-based methods                       enzyme, hormone, structural protein,
                                                 in order to develop a drug binding                            viral capsid protein, transmembrane
            Thomas Lengauer,                     tightly to the protein.                                       protein, etc.);
            Institute for Algorithms and
            Scientific Computing (SCAI),
                                                    In this paper bioinformatics methods
            GMD – National Research              for prediction aspects of the protein                     q   biochemical function (biochemical
            Center for Information               structure are described and their use                         reaction, enzyme specificity, binding
            Technology,
            Sankt Augustin,
                                                 towards the goal of drug design is                            partners, cofactors);
            Germany D53754.                      discussed. The possibilities and limitations
                                                 of using protein structure knowledge                      q   classification via broad cellular function
            Tel: +49 2241 14 2776/2777
            Fax: +49 2241 14 2656                towards the goal of developing new drug                       (interaction with DNA and other
            E-mail: lengauer@gmd.de              therapies are also discussed.                                 proteins, cellular localisation);

                               © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000                 275




08-lengauer.p65                            275                                                              9/19/00, 1:49 PM
Lengauer and Zimmer



                                                                                              Genome/Organism/Disease




                                               Target Protein Search
                                                                                  Structure     Families      Evolutionary     Expression    Phenotyp    SNPs, Linkage
                                                                       SEARCH                 Para-/Analogs   Information                    Genotyp     Mutations



                                                                                                       Target Protein


                                                                       IDENTIFY   Structure     Sequence        Fusion       Co-Evolution   Co-Expression    Motifs




                                                                                                Target Protein Function


                                                                       MODEL      Structure




                                                                                                                                             Assay/
                                                                                Target Protein Structure
                                                                                                                                            Screening
                                              Drug Lead




                                                                       DESIGN                 Rational Drug Design
                                                Search




                                                                            Ligand       Computer        Docking      Combinatorial
                                                                            Design         HTS                          Libraries             HTS       Trial&Error


                                                                                              Target Lead Structure / Drug


                                     Figure 1


                                     q   broad phenotypic function (changes                                   function simply because they originate
                                         observed for organisms with deleted or                               from a common ancestor and they still
                                         mutated genes);                                                      fulfil their role within the cellular
                                                                                                              processes, mutations occur independently
                                     q   identification of detailed physiological                             after speciation events. Depending on the
                                         function such as the localisation in a                               extent of the evolutionary changes, the
                                         metabolic or regulatory pathway and                                  recognition of homology or orthology
                                         the associated cellular role of the                                  among proteins can be difficult, but still
                                         protein;                                                             in these cases consistent evidence for
                                                                                                              relatedness should be expected on the
                                     q   identification of molecular binding                                  sequence, structure and function levels.
                                         partners and their mode of interaction                                  Sometimes, the situation is complicated
                                         with the protein.                                                    because of gene duplications within a
                                                                                                              species leading to paralogous copies of the
                                     The derivation of protein function from                                  same gene. These paralogous copies are
                                     protein sequence by theoretical means is                                 subject to evolutionary changes and the
                                     commonly performed by transferring                                       evolutionary pressure on structure or
                                     functional information from related                                      function is much relaxed for all but one
                                     proteins (eg from other organisms).                                      copy, which still serves the original
                                     Usually the transfer is from proteins                                    purpose, such that greater deviations in
                                     whose function has been established with                                 sequence, structure and function occur for
                                     experimental evidence. The establishment                                 these copies. As still considerable, ie
                                     of the relevant protein relationship based                               significantly more than random, sequence
                                     on sequence is complicated by some                                       similarity among paralogous proteins can
                                     subtleties of evolutionary processes.                                    be observed, this messes up the situation,
                                       Though it is often true that organisms                                 leading to erroneous transfer of functions
                                     share related proteins with similar                                      to already functionally disabled or
                                     sequence, similar structure and the same                                 functionally completely different proteins.

            276        © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000




08-lengauer.p65                276                                                                             9/19/00, 1:49 PM
Protein structure prediction methods for drug design



                                               Therefore, in the following, we have to            families that form clusters of structurally
                                            distinguish between three notions.                    or functionally related proteins are helpful
            similarity                      Similarity is a quantitative measure on the           in the prediction of protein function in
                                            sequence, structure or function level.                these cases. There are several protein
            homology                        Homology is used when there is a clear                classifications available on the internet
                                            established or potential (assumed,                    that can serve for this purpose
                                            predicted) evolutionary relationship
            orthology                       between proteins. The term orthology, in              q   COGS3,4
                                            addition, indicates homologous proteins
                                            with (established or potential) the same              q   ProDom5
                                            or at least similar function. The notion of
            paralogy                        paralogy, in contrast, is used, when                  q   PFAM6
                                            homologous proteins are expected to
                                            have evolved enough to expect changes                 q   SMART7,8
                                            in function (with or without a change in
                                            3D structure).                                        q   PRINTS9
                                               For drug design, we need to know
                                            more of the function of the protein than              q   Blocks+10
                                            follows from just a general classification.
                                            It would be best both to know natural                 q   ProtoMap11,12
                                            binding partners and to have a detailed
                                            structural model of the binding sites of              A number of these databases (Pfam,
                                            the protein.                                          PROSITE, PRINTS, ProDom,
                                                                                                  SWISSPROT+TREMBL) are currently
                                            METHODS OF                                            being united in the InterPro13 database.
                                            PREDICTING PROTEIN                                    Since protein function is basically tied to
                                            FUNCTION                                              protein domains, protein domain analysis
                                            There are a number of ways to predict                 is an integral part of the methodology
                                            protein function from sequence. Most of               that leads to protein family databases.14–22
                                            them are based on sequence similarity. A                 Since only 20–40 per cent of the
                                            large database of protein sequences is                protein sequences in a genome such as
                                            screened for ‘model sequences’ that                   Mycoplasma genitalium, M. janaschii and M.
                                            exhibit a high level of similarity to the             tuberculosis have significant sequence
                                            query protein sequence. Sequence                      similarity to proteins of known
            BLAST                           alignment tools such as BLAST1 and                    function,23,24 we need to be able to make
            PSI-BLAST                       PSI-BLAST2 are the work-horses of such                conclusions on the function of proteins
                                            analyses. If one or more model sequences              that exhibit no significant sequence
                                            are found that exhibit a sufficiently high            similarity to suitable model proteins. As
                                            level of similarity to the query sequence             the similarity between query sequence
                                            and about whose function we have some                 and model sequence decreases below a
                                            knowledge, then conclusions may be                    threshold of, say, 25 per cent, safe
                                            possible on the function of the query                 conclusions on a common evolutionary
                                            sequence. If the homology is above, say,              origin of the query sequence and the
                                            40 per cent and functionally important                model sequence can no longer be made.
                                            motifs are conserved then we can                      However, it turns out that, in many cases,
                                            hypothesise that the query sequence has a             the protein fold can still be reliably
                                            function that is quite similar to that of             predicted, and in several cases even
                                            the model sequence. As the level of                   detailed structural models of protein
                                            similarity decreases, the conclusions on              binding sites can be generated. Thus,
                                            function that can be drawn from                       especially in this similarity range, protein
                                            sequence similarity become less and less              structure prediction – again together with
            protein classifications         reliable. Classifications of proteins into            the identification of conserved sequence


                           © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000      277




08-lengauer.p65                       277                                                          9/19/00, 1:49 PM
Lengauer and Zimmer



                                       or spatial motifs – can help to ascertain                them in more detail here. While these
                                       aspects of protein function.                             methods are reported to generate
                                         Other sources of information beside                    significant insight into protein function on
                                       sequence similarity have been explored in                a higher level and to point to putative
                                       order to gain insight into protein                       target proteins,39 in the end, drug design
                                       function. These methods are represented                  can be expected to necessitate structural
                                       by five arrows pointing downwards in the                 knowledge of either the target protein or
                                       top right part of Figure 1. The following                its binding partners.
                                       comments on these methods apply in the
                                       order from left to right:                                METHODS FOR
                                                                                                PREDICTING PROTEIN
            sequence alignment         q   Sequence alignment has long been used                STRUCTURE
                                           for ascertaining protein function. This is           In the authors’ view, computational
                                           the standard method and we                           methods for predicting protein structure
                                           commented on it above. This approach                 from sequence alone are still well out of
                                           is only reliable if there is high sequence           range, although, there are recent
                                           similarity such that we can argue about              methodical advances – sometimes called
                                           orthologous proteins, since we know                  mini-threading – that are based on the
                                           the function of one of the proteins.                 assembly of fragments (see eg
                                                                                                ROSETTA40). In contrast, modelling
                                       q   Recently, the Rosetta stone method has               protein structures after folds that have been
                                           been introduced. This method uses over               seen before has become quite a powerful
                                           20 completely sequenced genomes and                  method for protein structure prediction.
                                           analyses evolutionary correlations of two            Here, the query sequence is aligned
                                           domains being fused into one protein in              (threaded) to a model sequence whose
                                           one species and occurring in separate                three-dimensional structure is known (the
                                           proteins in another species. From these              template protein). All proteins in a given
                                           classifications the method establishes               protein structure database – usually, an
                                           pairwise links between functionally                  appropriate representative set of structures
                                           related proteins25 and elicits putative              are tried — and each template is ranked
                                           protein–protein interactions.26                      using heuristic scoring functions. The
                                                                                                score reflects the likelihood that the query
                                       q   For the same purpose, the phylogenetic               sequence assumes the template structure.
                                           profile method analyses the co-                      The approach of modelling a protein
                                           occurrence of genes in the genomes of                structure after a known template is called
            homology-based                 different organisms.27                               homology-based modelling and the selection
            modelling                                                                           of a suitable template protein is often done
            protein threading          q   The analysis of change of phenotype                  via protein threading.
                                           based on mutated genes (eg by knock-                    Protein threading has three major
                                           out experiments) yields important                    objectives: first, to provide orthogonal
                                           information on aspects of protein                    evidence of possible homology for
                                           function.28–30                                       distantly related protein sequences;
                                                                                                second, to detect possible homology in
                                       q    In the future, the analysis of genetic              cases where sequence methods fail; and
                                           variations31 among individuals, eg single            third, to improve structural models for
                                           nucleotide polymorphisms (SNPs),32–34                the query sequence via structurally more
                                           will be helpful in ascertaining protein              accurate alignments.
                                           function beyond mere disease linkage or                 There are several successful protein
                                           association (right arrow in Figure 1).35–38          threading methods, including:

                                       None of these methods looks at protein                   q   methods based on hidden Markov
                                       structures, and thus we do not discuss                       models;41–48


            278          © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000




08-lengauer.p65                  278                                                             9/19/00, 1:49 PM
Protein structure prediction methods for drug design



                                         q   dynamic programming methods based                   q   Modeller60–64 and ModBase;65
                                             on profiles;49–51
                                                                                                 q   Swiss-Model;66,67
                                         q   environment compatibility (ie contact
                                             capacity potentials as used in the                  q   or commercial versions included in
                                             protein threader 123D).52                               Quanta (MSI) or Sybyl (Tripos, Inc.).

            side-chain modelling         These programs are very fast. A mid-size                For protein side-chain modelling there
                                         protein sequence can be threaded against                are two contrasting approaches based on
                                         a database of about 1,500 protein                       knowledge deduced from structural
                                         structures in a few minutes on a PC or                  databases and methods such as energy
                                         workstation. However, the underlying                    minimisation and molecular dynamics,68
                                         methods assume that the assignment of                   respectively. Methods based on side-chain
                                         chemical properties to spatial regions in               rotamer libraries that have been created
                                         the protein is the same in the query                    via the analysis of the protein structure
                                         protein and the template protein. This is               database are usually employed to get a
                                         not the case, in practice, especially if one            first model. Energy minimisation or
                                         compares proteins with partly different                 molecular dynamics69 is often used to
                                         folds or different functions. Extensions of             refine the model. Such methods have
                                         the homology-based modelling approach                   been in use for crystallography/nuclear
                                         to proteins with very similar protein                   magnetic resonance (NMR) for many
                                         structures but different chemical make-                 years and are available in several program
                                         up require the solution of                              packages and tools (Charmm,70
                                         algorithmically provably hard problems                  GROMOS/GROMACS71,72 and many
                                         and thus necessitate much more                          others73,74). In general these methods are
                                         computing time.There are:                               quite computer-intensive and can only
                                                                                                 be exercised on one or a few proteins.
                                         q   heuristic approaches based on distance-             Generally, the backbone alignment is an
                                             based pair potentials of mean force;53–56           input to homology-based modelling tools
                                                                                                 and the quality of the derived models is
                                         q   optimal or approximate combinatorial                highly sensitive to the accuracy of the
                                             tree search techniques.57–59                        provided alignments.
            loop modelling                                                                          Loops are modelled by a related host of
                                         Such approaches need hours to thread a                  methods. Loops that involve more than
                                         protein through a database of 1,500                     about five residues are still hard to
                                         templates. However, they can yield more                 model.75–78
                                         accurate alignments and models of                          The evaluation of the accuracy of
                                         binding pockets of proteins.                            assigning a protein fold (general protein
                                            The process of protein threading                     architecture) to a query sequence is
                                         selects a suitable template protein for a               commonly based on generally accepted
                                         protein query sequence and computes an                  fold classifications such as SCOP79 or
            quality assurance            alignment of the backbone of the two                    CATH.80 The quality of backbone
                                         proteins that is the starting point for                 alignments is much harder to rate, and no
                                         generating a structural model for the                   generally accepted scheme is available, as
                                         query protein based on the structure of                 of today.81–84 Rating the quality of
                                         the template protein. What is left is to                protein structure models is generally
                                         place the side chains of the query protein              based on the root mean square (rms)
                                         and to model the loops of the query                     deviation of the model and the actual
                                         protein that are not modelled by the                    structure on a selected set of residues.
                                         template structure. These two tasks are                 The problem here is that the model must
                                         performed by homology-based                             be superposed with the actual structure.
                                         modelling tools such as:                                There are several tools that perform this


                          © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000         279




08-lengauer.p65                    279                                                            9/19/00, 1:49 PM
Lengauer and Zimmer



                                         task – DALI/FSSP,85,86 SSAP,87 VAST,88                 be derived beyond doubt. For more than
                                         PROSUP89 or SARF90 – and they can                      half of the 21 more difficult cases
                                         yield different results. Thus, there is no             reasonable models could be predicted by
                                         accepted gold standard for protein                     at least one of the participating prediction
            CAFASP                       structure superposition. However, for the              teams. In addition, the CAFASP
                                         purpose of rating the structures of target             subsection of the assessment has
                                         proteins, the available superposition                  demonstrated that 10 out of 19 folds
                                         methods are sufficient.                                could be solved via completely automatic
                                                                                                application of the best threading methods
                                         PERFORMANCE OF                                         without any manual intervention.
                                         PROTEIN STRUCTURE                                         Methods for refining rough structural
                                         PREDICTION METHODS                                     models towards the true native structure
                                         There are strong efforts to render the                 of the query protein are also not
            predition assessment         quality of protein structure prediction                straightforward. This is an active area of
                                         methods more transparent and easier to                 research.92
                                         evaluate. The centre of these efforts is the              A combination of protein threading
                                         bi-annual CASP experiment, which rates                 followed by homology-based modelling
                                         protein structure prediction methods on                cannot create genuinely novel protein
                                         blind predictions and aims at developing               structures. But it turns out to be quite
                                         standardised and generally agreed upon                 sensitive in creating structure models
                                         assessment procedures both for fold                    based on known folds. Models that have
                                         identification and the evaluation of                   been reasonably accurate (eg down to
                                         alignment accuracy as well as homology                 1.4Å for some 60 amino acids of the
                                         models. A blind prediction is a prediction             active site of herpes virus thymidine
                                         of the three-dimensional structure for a               kinase93) have been reported in blind
                                         protein sequence at a time, at which the               studies of proteins with a sequence
                                         actual structure of the protein is not                 identity to the template protein of as low
                                         known (yet). After the structure has been              as 10 per cent. Correct folds can be
                                         resolved, the prediction is compared with              assigned in many cases, even if the query
                                         the actual structure. There have been                  sequence and the suitable template
            CASP                         three issues of the CASP experiment;91                 exhibit a very low level of sequence
                                         the fourth one follows this year. The                  similarity (down to 5 per cent, ie far
                                         CASP experiment has been a significant                 below the level of random sequence
                                         help in providing a more solid basis for               similarity of 17–18 per cent in optimal
                                         assessing the power of different protein               alignments).
                                         structure prediction methods.
                                            For fold recognition, detectable                    STRUCTURAL GENOMICS
                                         progress has been observed from CASP1                  The goal of structural genomics projects
                                         to CASP2. In CASP3, similar                            is to solve experimental structures of all
                                         performance as in CASP2 was achieved                   major classes of protein folds
                                         on more difficult targets. There appears to            systematically independent of some
                                         be a certain limit of current fold                     functional interest in the proteins.94,95 The
            structure space              recognition methods, which is still well               aim is to chart the protein structure space
                                         below the limit of detectable structural               efficiently; functional annotations and/or
                                         similarity (via structural comparisons). In            assignment are made afterwards. This
                                         addition, in CASP3 several groups                      affords a thoroughly thought-out strategy
                                         produced reasonable models of up to 60                 of mixing experimental protein structure
                                         residues for ab initio target fragments.               determination, eg via X-ray, with
                                            In CASP3 from 43 protein targets, 15                computer-based protein structure
                                         could be classified as comparative                     prediction. The experiments have to yield
                                         homology modelling targets, ie related                 novel protein structures. The proteins to
                                         folds and accompanying alignments could                be resolved experimentally are again

            280          © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000




08-lengauer.p65                    280                                                           9/19/00, 1:49 PM
Protein structure prediction methods for drug design



                                        selected by computer. The computer part                  characteristics are imprinted onto the
                                        deduces the remaining structures based                   protein structure by specific patterns of
                                        on homology-based modelling and                          amino acid side chains that make up the
                                        protein threading. One goal of the overall               binding pocket. The conservation of
                                        structural genomics endeavour is to have                 these amino acids is what makes two
                                        an experimentally resolved protein                       proteins have the same function. Since
                                        structure within a certain structural                    nature varies sequence quite flexibly, this
                                        distance to any possible protein sequence,               level of conservation is only maintained
                                        which allows for computing reliable                      among orthologous proteins that exhibit
                                        models for all protein sequences.                        a high level of sequence similarity.
                                           Once a map of the protein structure                      Thus, if the template protein from
                                        space is available, this knowledge should                which we predict protein structure is not
                                        provide additional insights on what the                  orthologous to the query protein, other
                                        function of the protein in the cell is and               methods of function prediction have to
                                        with what other partners it might                        come to bear. It is quite natural to
                                        interact. Such information should add to                 consider conservation patterns in the
                                        information gained from high-                            protein sequence here, such as exhibited
                                        throughput screening and biological                      in databases containing functional
            functional motifs           assays. So far, glimpses of what will be                 sequence motifs such as PROSITE. An
                                        possible could be obtained by analysing                  alternative that has been investigated
                                        complete genomes or large sets of                        more recently is to analyse conservation
                                        proteins from expression experiments                     in 3D space.98 Experience shows that
            structural motifs           with the structural knowledge available                  such ‘structural’ motifs provide more
                                        today, ie more or less complete                          information than motifs derived purely
                                        representative sets and a quite coarse                   from sequence, even if the sequence
                                        coverage of structure space.63,96,97                     motifs are distributed over several regions
                                                                                                 (BLOCKS+, PRINTS). Recently, the
                                        METHODS FOR                                              notion of an approximate structural motif
                                        PREDICTING PROTEIN                                       has been introduced – sometimes called
            fuzzy functional forms      FUNCTION FROM                                            fuzzy functional form (FFF).99 Using a
                                        PROTEIN STRUCTURE                                        library of approximate structural motifs
                                        Aspects of protein structure that are                    enhances the range of applicability of
                                        useful for drug design studies typically                 motif search at the price of reduced
                                        have to involve three-dimensional                        sensitivity and specificity. Such
                                        structure. Predicting the secondary                      approaches are supported by the fact that,
                                        structure of the protein is not sufficient.              often, binding sites of proteins are much
                                        Even the similarity of the three-                        more conserved than the overall protein
                                        dimensional structures of two proteins                   structure (eg bacterial and eukaryotic
                                        cannot be taken as an indication for a                   serine proteases), such that an inexact
                                        similar function of these proteins. The                  model can have an accurately modelled
                                        reason is that protein structure is                      part responsible for function. As the
                                        conserved much more than protein                         structural genomics projects produce a
                                        function. Indeed, protein folds such as the              more and more complete picture of the
                                        TIM barrel (triose-phosphate isomerase)                  protein structure space, comprehensive
                                        are quite ubiquitous and can be                          libraries of highly discriminative
                                        considered as general scaffolds that lend                structural motifs can be expected.
                                        molecular stability to the protein and are                  The relationship between structure and
                                        not directly tied to its function. In                    function is a true many-to-many relation.
                                        contrast, the molecular function of the                  Recent studies have shown that
                                        protein is tied to local structural                      particular functions could be mounted
                                        characteristics pertaining to binding                    onto several different protein folds100 and,
                                        pockets on the protein surface. These                    conversely, several protein fold classes can


                          © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000      281




08-lengauer.p65                   281                                                             9/19/00, 1:49 PM
Lengauer and Zimmer


            docking                        perform a wide range of functions.101                      search for drug leads. A docking method
                                           This limits our potential of deducing                      that takes a minute per instance can be
                                           function from structure. But knowledge                     used to screen up to thousands of
                                           on which folds support a given function                    compounds on a PC or hundreds of
                                           and which functions are based on a given                   thousands of drugs on a suitable parallel
                                           fold can still help in predicting function                 computer. Docking methods that take the
                                           from structure. In addition, local                         better part of an hour cannot be suitably
            drug screening                 structural templates such as FFFs                          employed for such large-scale screening
                                           indicative for a particular function can                   purposes. In order to screen really large
                                           identify similar sites and the associated                  drug databases with several hundred
                                           function despite a globally different fold.                thousand compounds docking methods
                                           Such 3D patterns can also discriminate                     that can handle single protein/drug pairs
                                           among globally similar folds with respect                  within seconds are needed.
                                           to containing particular conserved 3D                         The high conformational flexibility of
                                           functional motifs in order to classify them                small molecules as well as the subtle
                                           into different functional categories.                      structural changes in the protein binding
                                             Though it is not easy to derive                          pocket upon docking (induced fit) are
                                           functions from resolved protein                            major complications in docking.
                                           structures, the availability of structural                 Furthermore, docking necessitates careful
                                           information improves the chances                           analysis of the binding energy. The energy
            scoring function               compared with relying on sequence                          model is cast into the so-called scoring
                                           methods alone.                                             function that rates the protein–ligand
                                                                                                      complex energetically. Challenges in the
                                           METHODS FOR                                                energy model include the handling of
                                           DEVELOPING DRUGS                                           entropic contributions, and solvation
                                           BASED ON PROTEIN                                           effects, and the computation of long-
                                           STRUCTURE                                                  range forces in fast docking methods.
                                           The object of drug design is to find or                       The state of the art in docking can be
                                           develop a, mostly small, drug molecule                     summarised as follows (see also Table 1).
            structural flexibility         that tightly binds to the target protein,                  Handling the structural flexibility of the
                                           moderating (often blocking) its function                   drug molecule can be done within the
                                           or competing with natural substrates of                    regime up to about a minute per
                                           the protein. Such a drug can be best                       molecular complex on a PC (see, eg,
                                           found on the basis of knowledge of the                     Kramer et al.102). A suitable analysis of the
                                           protein structure. If the spatial shape of                 structural changes in the protein still
                                           the site of the protein is known, to which                 necessitates more computing time.
                                           the drug is supposed to bind, then                            Today, tools that are able to dock a
                                           docking methods can be applied to select                   molecule to a protein within seconds are
                                           suitable lead compounds that have the                      still based on rigid-body docking (both
                                           potential of being refined to drugs. The                   the protein and ligand conformational
                                           speed of a docking method determines                       flexibility is omitted).
                                           whether the method can be employed for                        Recently, fast docking tools have been
                                           screening compound databases in the                        adapted to screening combinatorial drug


                                           Table 1: Taxonomy of docking methods

                                            Runtime on a PC                           Fraction of a second       About a minute   An hour or longer
                                            Flexibility of the drug molecule                                     X                X
                                            Flexibility of the protein binding site                                               X
                                            Energy model                              None                       Short-range      Force field




            282             © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000




08-lengauer.p65                      282                                                               9/19/00, 1:49 PM
Protein structure prediction methods for drug design



                                        libraries (see, eg, Rarey and Lengauer103).              advantage that it does not have to deal
                                        Such libraries provide a carefully selected              with insufficiently powerful computer
                                        set of molecular building blocks together                models, at the expense of high laboratory
                                        with a small set of chemical reactions that              cost and the absence of structural
                                        link the modules. In this way, a                         knowledge on ‘why’ a compound binds
            combinational library       combinatorial library can theoretically                  to the protein.
                                        provide a diversity of up to billions of
                                        molecules from a small set of reactants.                 CONCLUSION
                                           The accuracy of docking predictions                   In summary, the field is still in an early
                                        lies within 50–80 per cent ‘correct’                     stage of development. Ab initio protein
                                        predictions depending on the evaluation                  structure prediction continues to be a
                                        measure and the method. That means that                  grand challenge for which no
                                        docking methods are far from perfectly                   comprehensive solution is in sight. The
                                        accurate. Nevertheless, they are very                    quality of fold prediction based on
                                        useful in pharmaceutical practice. The                   homology rises and tools has reached the
                                        major benefit of docking is that a large                 stage where one can generate confident
                                        drug library can be ranked with respect                  predictions for soluble proteins that in a
                                        to the potential that its molecules have                 substantial fraction (about half) of the
                                        for being a useful lead compound for the                 cases provide significant threading hits in
                                        target protein in question. The quality of               the structure database. Protein threading
                                        a method in this context can be                          and homology-based prediction become
            enrichment factor           measured by an enrichment factor. Roughly,               especially helpful in an environment
                                        this is the ratio between the number of                  where the methods can be used in
                                        active compounds (drugs that bind                        concert with experimental techniques for
                                        tightly to the protein) in a top fraction                structure and function determination.
                                        (say the top 1 per cent) of the ranked                   Here, the prediction methods can
                                        drug database divided by the same figure                 exercise their strengths, which lie in
                                        in the randomly arranged drug database.                  being used interactively by experts and
                                        State-of-the-art docking methods in the                  making suggestions that can be followed
                                        middle regime (minutes per molecular                     up by succeeding experimentation, rather
                                        pair), eg FlexX,104 achieve enrichment                   than being required to provide proven
                                        factors of up to about 15. Fast methods                  fact. The process of going from structure
                                        (seconds per pair), eg FeatureTrees,105                  to function is far from being automated.
                                        achieve similar enrichment factors, but                  In a scenario that combines structure
                                        deliver molecules similar to known                       prediction methods with
                                        binding ligands and do not detect as                     experimentation, the step from structure
                                        diverse a range of binding molecules.                    to function can be performed in a
                                           Even if the structure of the protein                  customised manner.
                                        binding site is not known, computer-                        Protein structure prediction by
                                        based methods can be used to select                      homology is definitely not yet a turn-key
                                        promising lead compounds. Such                           technology. But we can expect it to enter
                                        methods compare the structure of a                       the ‘production’ stage through the
                                        molecule with that of a ligand that is                   activities in structural genomics. Still the
                                        known to bind to the protein, for                        field of protein structure prediction is
                                        instance, its natural substrate.                         very busy, generating the tools and
                                           Alternatives to docking for lead finding              processes for raising the number of
            high-throughput             include high-throughput screening                        confident structure predictions and the
            screening                   (HTS). This laboratory method allows for                 accompanying estimates of significance.
                                        testing the binding affinity of up to more               Problems for applying these results in
                                        than several thousand compounds to the                   drug design are not only that the models
                                        same target protein in a day. In                         may not be sufficiently accurate but also
                                        comparison this method has the                           that the structures of many interesting


                          © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000      283




08-lengauer.p65                   283                                                             9/19/00, 1:49 PM
Lengauer and Zimmer



                                       target proteins will not be accessible by                2. Altschul, S. F., Madden, T. L., Schaffer, A. A.
                                                                                                   et al. (1997), ‘Gapped BLAST and PSI-
                                       homology-based modelling, at all, for                       BLAST: a new generation of protein
                                       some time to come. This includes the                        database search programs’, Nucleic Acids
                                       therapeutically particularly interesting                    Res., Vol. 25(17), pp. 3389–3402. http://
                                       class of membrane proteins, for which                       ncbi.nlm.nih.gov/blast/psiblast.cgi
                                       essentially no structures have been                      3. Tatusov, R. L., Galperin, M. Y., Natale, D.
                                       resolved.                                                   A. and Koonin, E. V. (2000), ‘The COG
                                                                                                   database: a tool for genome-scale analysis
                                         Docking is used frequently in                             of protein functions and evolution’, Nucleic
                                       structure-based drug design. To the                         Acids Res., Vol. 28(1), pp. 33–36.
                                       authors’ knowledge, the first drug                       4. Tatusov, R. L., Koonin, E. V. and Lipman,
            drugs developed with       developed with structure-based                              D. J. (1997), ‘A genomic perspective on
            computer techniques        techniques was the HIV protease                             protein families’, Science, Vol. 278(5338),
                                                                                                   pp. 631–637.
                                       inhibitor Dorzolamide. In the past few
                                       years structural considerations have begun               5. Corpet, F., Servant, F., Gouzy, J. and Kahn,
                                                                                                   D. (2000), ‘ProDom and ProDom-CG: tools
                                       to pervade the design of new drugs. A                       for protein domain analysis and whole
                                       point in case is that of the neuraminidase                  genome comparisons’, Nucleic Acids Res.,
                                       inhibitors for HIV. Such studies mostly                     Vol. 28(1), pp. 267–269.
                                       involve experimentally resolved protein                  6. Bateman, A., Birney, E., Durbin, R. et al.
                                       structures. However, even models can                        (2000), ‘The Pfam protein families
                                                                                                   database’, Nucleic Acids Res., Vol. 28(1),
                                       serve to guide drug development. Based                      pp. 263–266.
                                       on the experimentally resolved structure
                                                                                                7. Schultz, J., Milpetz, F., Bork, P. and Ponting,
                                       of the membrane protein                                     C. P. (1998), ‘SMART, a simple modular
                                       bacteriorhodopsin, several groups are                       architecture research tool: identification of
                                       attempting to model binding sites of G-                     signaling domains’, Proc. Natl Acad. Sci.
                                       protein coupled receptors that are                          USA, Vol. 95(11), pp. 5857–5864.
                                       believed to be structurally similar.                     8. Schultz, J., Copley, R. R., Doerks, T. et al.
                                       Nevertheless, the authors are not aware                     (2000), ‘SMART: a web-based tool for the
                                                                                                   study of genetically mobile domains’,
                                       of any instance where the whole process                     Nucleic Acids Res., Vol. 28(1), pp. 231–234.
                                       line from the protein sequence to the
                                                                                                9. Attwood, T. K., Croning, M. D., Flower,
                                       lead structure has been exercised in an                     D. R. et al. (2000), ‘PRINTS-S: the
                                       integrated manner and with significant                      database formerly known as PRINTS’,
                                       help of computer predictions. The field                     Nucleic Acids Res., Vol. 28(1), pp. 225–227.
                                       has not reached this level of maturity                   10. Henikoff, S., Henikoff, J. G. and
                                       yet. While structural aspects – even as                      Pietrokovski, S. (1999), ‘Blocks+: a non-
                                                                                                    redundant database of protein alignment
                                       predicted by the computer – can be                           blocks derived from multiple compilations’,
                                       expected to invade the search for target                     Bioinformatics, Vol. 15(6), pp. 471–479.
                                       proteins and the development of new                      11. Yona, G., Linial, N. and Linial, M. (2000),
                                       drugs, experimental data, where they are                     ‘ProtoMap: automatic classification of
                                       accessible, will always be highly welcome                    protein sequences and hierarchy of protein
                                                                                                    families’, Nucleic Acids Res., Vol. 28(1), pp.
                                       and often be indispensable in this                           49–55.
                                       process.
                                                                                                12. Yona, G., Linial, N. and Linial, M. (1999),
                                                                                                    ‘ProtoMap: automatic classification of
                                       Acknowledgements                                             protein sequences, a hierarchy of protein
                                       We thank Matthias Rarey for helpful comments on              families, and local maps of the protein
                                       this paper and Gerhard Barnickel and Gerhard                 space’, Proteins, Vol. 37(3), pp. 360–378.
                                       Klebe for information on the state of drugs              13. http://www.ebi.ac.uk/interpro/
                                       developed by structure-based techniques.
                                                                                                14. Rose, G. D. (1979), ‘Hierarchic organization
                                                                                                    of domains in globular proteins’, J. Mol.
                                       References                                                   Biol., Vol. 134(3),
                                                                                                    pp. 447–470.
                                       1. Altschul, S. F., Gish, W., Miller, W.
                                          et al. (1990), ‘Basic local alignment search          15. Nichols, W. L., Rose, G. D., Ten Eyck, L. F.
                                          tool’, J. Mol. Biol., Vol. 215(3), pp. 403–410.           and Zimm, B. H. (1995), ‘Rigid domains
                                          http://ncbi.nlm.nih. gov/BLAST/                           in proteins: an algorithmic approach to


            284          © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000




08-lengauer.p65                  284                                                             9/19/00, 1:49 PM
Protein structure prediction methods for drug design


                                    their identification’, Proteins, Vol. 23(1),          28. Bork, P., Dandekar, T., Diaz-Lazcoz, Y.
                                    pp. 38–48.                                                et al. (1998), ‘Predicting function: from
                                                                                              genes to genomes and back’, J. Mol. Biol.,
                                16. Gracy, J. and Argos, P. (1998), ‘Automated                Vol. 283(4), pp. 707–725.
                                    protein sequence database classification. II.
                                    Delineation of domain boundaries from                 29. Roemer, K., Johnson, P. A. and
                                    sequence similarities’, Bioinformatics,                   Friedmann, T. (1991), Knock-in and
                                    Vol. 14(2), pp. 174–187.                                  knock-out: Transgenes, Development and
                                                                                              Disease: A Keystone Symposium
                                17. Gracy, J. and Argos, P. (1998), ‘DOMO:                    sponsored by Genentech and Immunex,
                                    a new database of aligned protein                         Tamarron, CO, USA, January 12–18
                                    domains’, Trends Biochem. Sci., Vol. 23(12),              1991’, New Biol., Vol. 3(4), pp. 331–335.
                                    pp. 495–497.
                                                                                          30. Sato, T. N. (1999), ‘Gene trap, gene
                                18. Sowdhamini, R., Rufino, S. D. and                         knockout, gene knock-in, and transgenics
                                    Blundell, T. L. (1996), ‘A database of                    in vascular development’, Thromb.
                                    globular protein structural domains:                      Haemost., Vol. 82(2), pp. 865–869.
                                    clustering of representative family members
                                    into similar folds’, Fold Des., Vol. 1(3),            31. Collins, F. S., Guyer, M. S. and
                                    pp. 209–220.                                              Charkravarti, A. (1997), ‘Variations on a
                                                                                              theme: cataloging human DNA sequence
                                19. Jones, S., Stewart, M., Michie, A. et al.                 variation’, Science, Vol. 278(5343),
                                    (1998), ‘Domain assignment for protein                    pp. 1580–1581.
                                    structures using a consensus approach:
                                    characterization and analysis’, Protein Sci.,         32. Brookes, A. J. (1999), ‘The essence of
                                    Vol. 7(2), pp. 233–242.                                   SNPs’, Gene, Vol. 234(2), pp. 177–186.

                                20. Orengo, C. A., Martin, A. M.,                         33. Kuska, B. (1999), ‘Snipping “SNPs”: a
                                    Hutchinson, G. et al. (1998), ‘Classifying a              new tool for mining gene variations’,
                                    protein in the CATH database of domain                    J. Natl Cancer Inst., Vol. 91(13), p. 1110.
                                    structures’, Acta Crystallogr. D Biol.                34. Vilain, E. (1998), ‘CYPs, SNPs,
                                    Crystallogr., Vol. 54(1(Pt 6)), pp. 1155–1167.            and molecular diagnosis in the
                                21. Murzin, A. G. (1996), ‘Structural                         postgenomic era’, Clin. Chem.,
                                    classification of proteins: new                           Vol. 44(12), pp. 2403–2404.
                                    superfamilies’, Curr. Opin. Struct. Biol.,            35. Collins, F. S. (1999), ‘Shattuck lecture –
                                    Vol. 6(3), pp. 386–394.                                   medical and societal consequences of the
                                22. Murzin, A. G., Brenner, S. E., Hubbard, T.                Human Genome Project’, N. Engl. J.
                                    and Chothia, C. (1995), ‘SCOP: a                          Med., Vol. 341(1), pp. 28–37.
                                    structural classification of proteins database        36. Ellsworth, D. L. and Manolio, T. A. (1999),
                                    for the investigation of sequences and                    ‘The emerging importance of genetics in
                                    structures’, J. Mol. Biol., Vol. 247(4),                  epidemiologic research II. Issues in study
                                    pp. 536–540.                                              design and gene mapping’, Ann.
                                                                                              Epidemiol., Vol. 9(2), pp. 75–90.
                                23. Fischer, D. and Eisenberg, D. (1999),
                                    ‘Predicting structures for genome                     37. Ellsworth, D. L. and Manolio, T. A.
                                    proteins’, Curr. Opin. Struct. Biol., Vol. 9(2),          (1999), ‘The emerging importance of
                                    pp. 208–211.                                              genetics in epidemiologic research III.
                                                                                              Bioinformatics and statistical genetic
                                24. Huynen, M., Doerks, T., Eisenhaber, F. et al.             methods’, Ann. Epidemiol., Vol. 9(4),
                                    (1998), ‘Homology-based fold predictions
                                                                                              pp. 207–224.
                                    for Mycoplasma genitalium proteins’, J. Mol.
                                    Biol., Vol. 280(3), pp. 323–326.                      38. Terwilliger, J. D. and Ott, J. (1994),
                                                                                              ‘Handbook of Human Genetic Linkage’,
                                25. Marcotte, E. M., Pellegrini, M.,                          Johns Hopkins University Press,
                                    Thompson, M. J. et al. (1999), ‘A combined                Baltimore.
                                    algorithm for genome-wide prediction of
                                    protein function’, Nature, Vol. 402(6757),            39. Drews, J. (1996), ‘Genomic sciences and
                                    pp. 83–86.                                                the medicine of tomorrow’, Nat.
                                                                                              Biotechnol., Vol. 14(11), pp. 1516–1518.
                                26. Marcotte, E. M., Pellegrini, M., Ng, H. L.,
                                    Rice, D. W. et al. (1999), ‘Detecting protein         40. Simons, K. T., Bonneau, R., Ruczinski, I.
                                    function and protein-protein interactions                 and Baker, D. (1999), ‘Ab initio protein
                                    from genome sequences’, Science, Vol.                     structure prediction of CASP III targets
                                    285(5428), pp. 751–753.                                   using ROSETTA’, Proteins, Vol. 37(S3),
                                                                                              pp. 171–176.
                                27. Pellegrini, M., Marcotte, E. M., Thompson,
                                    M. J. et al. (1999), ‘Assigning protein               41. Karchin, R. and Hughey, R. (1998),
                                    functions by comparative genome analysis:                 ‘Weighting hidden Markov models for
                                    protein phylogenetic profiles’, Proc. Natl                maximum discrimination’, Bioinformatics,
                                    Acad. Sci. USA, Vol. 96(8), pp. 4285–4288.                Vol. 14(9), pp. 772–782.


                  © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000                285




08-lengauer.p65           285                                                              9/19/00, 1:49 PM
Lengauer and Zimmer


                                     42. Bateman, A., Birney, E., Durbin, R. et al.           54. Hendlich, M., Lackner, P., Weitckus, S. et al.
                                         (1999), ‘Pfam 3.1: 1313 multiple alignments              (1990), ‘Identification of native protein folds
                                         and profile HMMs match the majority of                   amongst a large number of incorrect
                                         proteins’, Nucleic Acids Res., Vol. 27(1),               models. The calculation of low energy
                                         pp. 260–262.                                             conformations from potentials of mean
                                                                                                  force’, J. Mol. Biol., Vol. 216(1), pp. 167–180.
                                     43. Park, J., Karplus, K., Barrett, C. et al.
                                         (1998), ‘Sequence comparisons using                  55. Sippl, M. J. (1995), ‘Knowledge-based
                                         multiple sequences detect three times as                 potentials for proteins’, Curr. Opin. Struct.
                                         many remote homologues as pairwise                       Biol., Vol. 5(2), pp. 229–235.
                                         methods’, J. Mol. Biol., Vol. 284(4),
                                         pp. 1201–1210.                                       56. Sippl, M. J. and Flockner, H. (1996),
                                                                                                  ‘Threading thrills and threats’, Structure,
                                     44. Barrett, C., Hughey, R. and Karplus, K.                  Vol. 4(1), pp. 15–19.
                                         (1997), ‘Scoring hidden Markov
                                         models’, Comput. Appl. Biosci., Vol. 13(2),          57. Lathrop, R. H. and Smith, T. F. (1996),
                                         pp. 191–199.                                             ‘Global optimum protein threading with
                                                                                                  gapped alignment and empirical pair score
                                     45. McClure, M. A., Smith, C. and Elton, P.                  functions’, J. Mol. Biol., Vol. 255(4),
                                         (1996), ‘Parameterization studies for the                pp. 641–665.
                                         SAM and HMMER methods of hidden
                                         Markov model generation’, ‘Proc. 4th                 58. Thiele, R., Zimmer, R. and Lengauer, T.
                                         International Conference on Intelligent                  (1999), ‘Protein threading by recursive
                                         Systems for Molecular Biology’, AAAI                     dynamic programming’, J. Mol. Biol., Vol.
                                         Press, Menlo Park, CA, pp. 155–164                       290(3), pp. 757–779.

                                     46. Eddy, S. R. (1998), ‘Profile hidden                  59. Xu, Y., Xu, D. and Uberbacher, E. C.
                                         Markov models’, Bioinformatics, Vol. 14(9),              (1998), ‘An efficient computational method
                                         pp. 755–763.                                             for globally optimal threading’, J. Comput.
                                                                                                  Biol., Vol. 5(3), pp. 597–614.
                                     47. Sonnhammer, E. L., Eddy, S. R., Birney, E.,
                                         Bateman, A. and Durbin, R. (1998), ‘Pfam:            60. Sali, A. (1995), ‘Modeling mutations and
                                         multiple sequence alignments and HMM-                    homologous proteins’, Curr. Opin.
                                         profiles of protein domains’, Nucleic Acids              Biotechnol., Vol. 6(4), pp. 437–451.
                                         Res., Vol. 26(1), pp. 320–322.                       61. Sali, A., Potterton, L., Yuan, F. et al. (1995),
                                     48. Eddy, S. R. (1996), ‘Hidden Markov                       ‘Evaluation of comparative protein
                                         models’, Curr. Opin. Struct. Biol., Vol. 6(3),           modeling by MODELLER’, Proteins,
                                         pp. 361–365. (1995), ‘Proc. 3rd                          Vol. 23(3), pp. 318–326.
                                         International Conference on Intelligent              62. Sali, A. (1998), ‘100,000 protein structures
                                         Systems for Molecular Biology’, AAAI                     for the biologist’, Nat. Struct. Biol., Vol.
                                         Press, Menlo Park, CA, pp. 114–120.                      5(12), pp. 1029–1032.
                                     49. Bowie, J. U., Luthy, R. and Eisenberg, D.
                                                                                              63. Sanchez, R. and Sali, A. (1998),
                                         (1991), ‘A method to identify protein                    ‘Large-scale protein structure modeling of
                                         sequences that fold into a known three-                  the Saccharomyces cerevisiae genome’,
                                         dimensional structure’, Science, Vol.                    Proc. Natl Acad. Sci. USA, Vol. 95(23),
                                         253(5016), pp. 164–170.                                  pp. 13597–13602.
                                     50. Luthy, R., Bowie, J. U. and Eisenberg, D.
                                                                                              64. Sanchez, R. and Sali, A. (1997), ‘Evaluation
                                         (1992), ‘Assessment of protein models with
                                                                                                  of comparative protein structure modeling
                                         three-dimensional profiles’, Nature, Vol.                by MODELLER-3’, Proteins, Suppl 1,
                                         356(6364), pp. 83–85.                                    pp. 50–58.
                                     51. Luthy, R., Xenarios, I. and Bucher, P.               65. Sanchez, R., Pieper, U., Mirkovic, N. et al.
                                         (1994), ‘Improving the sensitivity of the                (2000), ‘MODBASE, a database of
                                         sequence profile method’, Protein Sci., Vol.             annotated comparative protein structure
                                         3(1), pp. 139–146.                                       models’, Nucleic Acids Res., Vol. 28(1),
                                     52. Alexandrov, N. N., Nussinov, R. and                      pp. 250–253.
                                         Zimmer, R. M. (1996), ‘Fast protein fold
                                                                                              66. Guex, N., Diemand, A. and Peitsch, M. C.
                                         recognition via sequence to structure
                                                                                                  (1999), ‘Protein modelling for all’, Trends
                                         alignment and contact capacity potentials’,              Biochem. Sci., Vol. 24(9), pp. 364–367.
                                         Pacific Symposium on Biocomputing,
                                         pp. 53–72.                                           67. Guex, N. and Peitsch, M. C. (1997),
                                                                                                  ‘SWISS-MODEL and the Swiss-
                                     53. Sippl, M. J. (1990), ‘Calculation of
                                                                                                  PdbViewer: an environment for
                                         conformational ensembles from potentials
                                                                                                  comparative protein modeling’,
                                         of mean force. An approach to the
                                                                                                  Electrophoresis, Vol. 18(15), pp. 2714–2723.
                                         knowledge-based prediction of local
                                         structures in globular proteins’, J. Mol. Biol.,     68. Petrella, R. J., Lazaridis, T. and Karplus, M.
                                         Vol. 213(4), pp. 859–883.                                (1998), ‘Protein sidechain conformer


            286        © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000




08-lengauer.p65                286                                                             9/19/00, 1:49 PM
Protein structure prediction methods for drug design


                                    prediction: a test of the energy function’,          82. Marchler-Bauer, A., Levitt, M. and Bryant,
                                    Fold Des., Vol. 3(5), pp. 353–377.                       S. H. (1997), ‘A retrospective analysis of
                                                                                             CASP2 threading predictions’, Proteins,
                                69. Karplus, M. and Petsko, G. A. (1990),
                                                                                             Suppl 1, pp. 83–91.
                                    ‘Molecular dynamics simulations in
                                    biology’, Nature, Vol. 347(6294),                    83. Marchler-Bauer, A. and Bryant, S. H.
                                    pp. 631–639                                              (1997), ‘A measure of success in fold
                                                                                             recognition’, Trends Biochem. Sci., Vol. 22(7),
                                70. Brooks, B. R., Bruccoleri, R. E., Olafson,
                                                                                             pp. 236–240.
                                    B. D. et al. (1983), ‘CHARMM: A program
                                    for macromolecular energy, minimization,             84. Lackner, P., Koppensteiner, W. A.,
                                    and dynamics calculation’,                               Domingues, F. S. and Sippl, M. J. (1999),
                                    J. Comp. Chem., Vol. 4, pp. 187–213.                     ‘Automated large scale evaluation of
                                71. Van Gunsteren, W. F. and Berendsen, H. J.                protein structure predictions’, Proteins, Vol.
                                    (1982), ‘Molecular dynamics: perspective                 37(S3), pp. 7–14.
                                    for complex systems’, Biochem. Soc. Trans.,          85. Holm, L. and Sander, C. (1998), ‘Dictionary
                                    Vol. 10(5), pp. 301–305.                                 of recurrent domains in protein structures’,
                                72. Van Gunsteren, W. F. and Berendsen,                      Proteins, Vol. 33(1),
                                    H. J. (1990), ‘Moleküldynamik-                           pp. 88–96.
                                    Computersimulationen: Methodik,                      86. Holm, L. and Sander, C. (1998), ‘Touring
                                    Anwendungen und Perspektiven in                          protein fold space with Dali/FSSP’, Nucleic
                                    der Chemie’, Angew. Chem., Vol. 102,                     Acids Res., Vol. 26(1), pp. 316–319.
                                    pp. 1020–1055.
                                                                                         87. Orengo, C. A. and Taylor, W. R. (1996),
                                73. Levitt, M. (1983), ‘Protein folding by                   ‘SSAP: sequential structure alignment
                                    restrained energy minimization and                       program for protein structure comparison’,
                                    molecular dynamics’, J. Mol. Biol.,                      Methods Enzymol., Vol. 266, pp. 617–635.
                                    Vol. 170(3), pp. 723–764.
                                                                                         88. Gibrat, J. F., Madej, T. and Bryant, S. H.
                                74. Novotny, J., Bruccoleri, R. and Karplus, M.              (1996), ‘Surprising similarities in structure
                                    (1984), ‘An analysis of incorrectly folded               comparison’, Curr. Opin. Struct. Biol., Vol.
                                    protein models. Implications for structure               6(3), pp. 377–385.
                                    predictions’, J. Mol. Biol., Vol. 177(4),
                                    pp. 787–818.                                         89. Lackner, P., Koppensteiner, W. A.,
                                                                                             Domingues, F. S. and Sippl, M. J. (1999),
                                75. van Vlijmen, H. W. and Karplus, M. (1997),               ‘Automated large scale evaluation of
                                    ‘PDB-based protein loop prediction:                      protein structure predictions’, Proteins, Vol.
                                    parameters for selection and methods for
                                                                                             37(S3), pp. 7–14.
                                    optimization’, J. Mol. Biol., Vol. 267(4),
                                    pp. 975–1001.                                        90. Alexandrov, N. N. (1996), ‘SARFing the
                                                                                             PDB’, Protein Eng., Vol. 9(9), pp. 727–732.
                                76. Lessel, U. and Schomburg, D. (1997),
                                    ‘Creation and characterization of a new,             91. Lattman, E. E. (ed.) (1999), ‘Third Meeting
                                    non-redundant fragment data bank’, Protein               on the Critical Assessment of Techniques
                                    Eng., Vol. 10(6), pp. 659–664.                           for Protein Structure Prediction’, Proteins,
                                                                                             Vol. 37, Suppl. 3..
                                77. Lessel, U. and Schomburg, D. (1999),
                                    ‘Importance of anchor group positioning              92. Kolinski, A., Rotkiewicz, P., Ilkowski, B.
                                    in protein loop prediction’, Proteins,                   and Skolnick, J. (1999), ‘A method for the
                                    Vol. 37(1), pp. 56–64.                                   improvement of threading-based protein
                                                                                             models’, Proteins, Vol. 37(4), pp. 592–610.
                                78. Fechteler, T., Dengler, U. and Schomburg,
                                    D. (1995), ‘Prediction of protein three-             93. Zimmer, R. and Thiele, R. (1997), ‘Fast
                                    dimensional structures in insertion and                  protein fold recognition and accurate
                                    deletion regions: a procedure for searching              sequence–structure alignment’, in ‘German
                                    data bases of representative protein                     Conference on Bioinformatics, GCB ’96’,
                                    fragments using geometric scoring criteria’,             Hofestädt, R., Lengauer, T., Löffler, M.
                                    J. Mol. Biol., Vol. 253(1), pp. 114–131.                 and Schomburg, D. Eds, Springer, Berlin,
                                                                                             pp. 137–148.
                                79. Lo Conte, L., Ailey, B., Hubbard, T. J. et al.
                                    (2000), ‘SCOP: a structural classification of        94. Kim, S. H. (1998), ‘Shining a light on
                                    proteins database’, Nucleic Acids Res., Vol.             structural genomics’, Nat. Struct. Biol., Vol.
                                    28(1), pp. 257–259.                                      5 Suppl., pp. 643–645.
                                80. Orengo, C. A., Michie, A. D., Jones, S. et al.       95. Montelione, G. T. and Anderson, S. (1999),
                                    (1997), ‘CATH – a hierarchic classification              ‘Structural genomics: keystone for a
                                    of protein domain structures’, Structure, Vol.           Human Proteome Project’, Nat. Struct.
                                    5(8), pp. 1093–1108.                                     Biol., Vol. 6(1), pp. 11–12.
                                81. Marchler-Bauer, A. and Bryant, S. H.                 96. Sali, A. (1998), ‘100,000 protein structures
                                    (1997), ‘Measures of threading specificity               for the biologist’, Nat. Struct. Biol.,
                                    and accuracy’, Proteins, Suppl 1, pp. 74–82.             Vol. 5(12), pp. 1029–1032.


                  © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000              287




08-lengauer.p65           287                                                             9/19/00, 1:50 PM
275

Contenu connexe

Tendances

A Reliable Password-based User Authentication Scheme for Web-based Human Geno...
A Reliable Password-based User Authentication Scheme for Web-based Human Geno...A Reliable Password-based User Authentication Scheme for Web-based Human Geno...
A Reliable Password-based User Authentication Scheme for Web-based Human Geno...
Thitichai Sripan
 
Tyler functional annotation thurs 1120
Tyler functional annotation thurs 1120Tyler functional annotation thurs 1120
Tyler functional annotation thurs 1120
Sucheta Tripathy
 
Improving DNA Barcode-based Fish Identification System on Imbalanced Data usi...
Improving DNA Barcode-based Fish Identification System on Imbalanced Data usi...Improving DNA Barcode-based Fish Identification System on Imbalanced Data usi...
Improving DNA Barcode-based Fish Identification System on Imbalanced Data usi...
TELKOMNIKA JOURNAL
 

Tendances (20)

Structural genomics
Structural genomicsStructural genomics
Structural genomics
 
A Reliable Password-based User Authentication Scheme for Web-based Human Geno...
A Reliable Password-based User Authentication Scheme for Web-based Human Geno...A Reliable Password-based User Authentication Scheme for Web-based Human Geno...
A Reliable Password-based User Authentication Scheme for Web-based Human Geno...
 
ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code
ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing codeISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code
ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code
 
SURVEY ON MODELLING METHODS APPLICABLE TO GENE REGULATORY NETWORK
SURVEY ON MODELLING METHODS APPLICABLE TO GENE REGULATORY NETWORKSURVEY ON MODELLING METHODS APPLICABLE TO GENE REGULATORY NETWORK
SURVEY ON MODELLING METHODS APPLICABLE TO GENE REGULATORY NETWORK
 
SooryaKiran Bioinformatics
SooryaKiran BioinformaticsSooryaKiran Bioinformatics
SooryaKiran Bioinformatics
 
project
projectproject
project
 
BIOL335: How to annotate a genome
BIOL335: How to annotate a genomeBIOL335: How to annotate a genome
BIOL335: How to annotate a genome
 
Biotechnology as Career Option 2012
Biotechnology as Career Option 2012Biotechnology as Career Option 2012
Biotechnology as Career Option 2012
 
Tyler functional annotation thurs 1120
Tyler functional annotation thurs 1120Tyler functional annotation thurs 1120
Tyler functional annotation thurs 1120
 
MICCAI - Workshop on High Performance and Distributed Computing for Medical I...
MICCAI - Workshop on High Performance and Distributed Computing for Medical I...MICCAI - Workshop on High Performance and Distributed Computing for Medical I...
MICCAI - Workshop on High Performance and Distributed Computing for Medical I...
 
2 md2016 annotation
2 md2016 annotation2 md2016 annotation
2 md2016 annotation
 
Bm Systems Scientific Epa Conference Heuristic Mathematic Concepts Synergies ...
Bm Systems Scientific Epa Conference Heuristic Mathematic Concepts Synergies ...Bm Systems Scientific Epa Conference Heuristic Mathematic Concepts Synergies ...
Bm Systems Scientific Epa Conference Heuristic Mathematic Concepts Synergies ...
 
Genome annotation 2013
Genome annotation 2013Genome annotation 2013
Genome annotation 2013
 
Analytical Study of Hexapod miRNAs using Phylogenetic Methods
Analytical Study of Hexapod miRNAs using Phylogenetic MethodsAnalytical Study of Hexapod miRNAs using Phylogenetic Methods
Analytical Study of Hexapod miRNAs using Phylogenetic Methods
 
EiB Seminar from Antoni Miñarro, Ph.D
EiB Seminar from Antoni Miñarro, Ph.DEiB Seminar from Antoni Miñarro, Ph.D
EiB Seminar from Antoni Miñarro, Ph.D
 
2016 bioinformatics i_wim_vancriekinge_vupload
2016 bioinformatics i_wim_vancriekinge_vupload2016 bioinformatics i_wim_vancriekinge_vupload
2016 bioinformatics i_wim_vancriekinge_vupload
 
Presentation
PresentationPresentation
Presentation
 
Nanoscalecommunication 130210114110-phpapp02-130409153557-phpapp01
Nanoscalecommunication 130210114110-phpapp02-130409153557-phpapp01Nanoscalecommunication 130210114110-phpapp02-130409153557-phpapp01
Nanoscalecommunication 130210114110-phpapp02-130409153557-phpapp01
 
Improving DNA Barcode-based Fish Identification System on Imbalanced Data usi...
Improving DNA Barcode-based Fish Identification System on Imbalanced Data usi...Improving DNA Barcode-based Fish Identification System on Imbalanced Data usi...
Improving DNA Barcode-based Fish Identification System on Imbalanced Data usi...
 
Structural genomics consortiam
Structural genomics consortiamStructural genomics consortiam
Structural genomics consortiam
 

Similaire à 275

Introducción a la bioinformatica
Introducción a la bioinformaticaIntroducción a la bioinformatica
Introducción a la bioinformatica
Martín Arrieta
 
Jms V47 I4 Special Feature (1)
Jms V47 I4 Special Feature (1)Jms V47 I4 Special Feature (1)
Jms V47 I4 Special Feature (1)
vsharma78
 
bioinformatics simple
bioinformatics simple bioinformatics simple
bioinformatics simple
nadeem akhter
 

Similaire à 275 (20)

Structural genomics
Structural genomicsStructural genomics
Structural genomics
 
Bioinformatics, application by kk sahu sir
Bioinformatics, application by kk sahu sirBioinformatics, application by kk sahu sir
Bioinformatics, application by kk sahu sir
 
Geomics proteomics
Geomics proteomicsGeomics proteomics
Geomics proteomics
 
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICSSTRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
 
Genomics & Proteomics Based Drug Discovery
Genomics & Proteomics Based Drug DiscoveryGenomics & Proteomics Based Drug Discovery
Genomics & Proteomics Based Drug Discovery
 
Role of genomics proteomics, and bioinformatics.
Role of genomics proteomics, and bioinformatics.Role of genomics proteomics, and bioinformatics.
Role of genomics proteomics, and bioinformatics.
 
Structural genomics
Structural genomicsStructural genomics
Structural genomics
 
Bioinformatics Introduction and Use of BLAST Tool
Bioinformatics Introduction and Use of BLAST ToolBioinformatics Introduction and Use of BLAST Tool
Bioinformatics Introduction and Use of BLAST Tool
 
Proteomics, definatio , general concept, signficance
Proteomics,  definatio , general concept, signficanceProteomics,  definatio , general concept, signficance
Proteomics, definatio , general concept, signficance
 
Salisha ppt (1) (1)
Salisha ppt (1) (1)Salisha ppt (1) (1)
Salisha ppt (1) (1)
 
Genomics and proteomics in drug discovery and development
Genomics and proteomics in drug discovery and developmentGenomics and proteomics in drug discovery and development
Genomics and proteomics in drug discovery and development
 
Introducción a la bioinformatica
Introducción a la bioinformaticaIntroducción a la bioinformatica
Introducción a la bioinformatica
 
Molecular target and development models
Molecular target and development modelsMolecular target and development models
Molecular target and development models
 
Enhancing Genomic Insights: 40 Pivotal Use Cases of Data Science and Machine ...
Enhancing Genomic Insights: 40 Pivotal Use Cases of Data Science and Machine ...Enhancing Genomic Insights: 40 Pivotal Use Cases of Data Science and Machine ...
Enhancing Genomic Insights: 40 Pivotal Use Cases of Data Science and Machine ...
 
LECTURE NOTES ON BIOINFORMATICS
LECTURE NOTES ON BIOINFORMATICSLECTURE NOTES ON BIOINFORMATICS
LECTURE NOTES ON BIOINFORMATICS
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Jms V47 I4 Special Feature (1)
Jms V47 I4 Special Feature (1)Jms V47 I4 Special Feature (1)
Jms V47 I4 Special Feature (1)
 
Proteomics
ProteomicsProteomics
Proteomics
 
Structural Genomics
Structural GenomicsStructural Genomics
Structural Genomics
 
bioinformatics simple
bioinformatics simple bioinformatics simple
bioinformatics simple
 

Dernier

Dernier (20)

Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

275

  • 1. Protein structure prediction methods for drug design Thomas Lengauer was a professor of Computer Protein structure prediction Science at the University of Paderborn, before he joined GMD, the German National methods for drug design Research Centre for Thomas Lengauer and Ralf Zimmer Information Technology, in Date received (in revised form): 4th July 2000 1992 as Director of the Institute for Algorithms and Scientific Computing. Jointly, Abstract he is Professor of Computer Along the long path from genomic data to a new drug, the knowledge of three-dimensional Science at the University of Bonn. His research interests protein structure can be of significant help in several places. This paper points out such places, include computational biology discusses the virtues of protein structure knowledge and reviews bioinformatics methods for and bioinformatics, gaining such knowledge on the protein structure. computational chemistry and combinatorial optimisation problems in technological applications. INTRODUCTION NOTIONS OF PROTEIN Ralf Zimmer FUNCTION The long path from genomic data to a is a research scientist at GMD. He directs the research new drug can conceptually be divided The increased accessibility of genomic group on algorithmic into two parts (see left side of Figure 1). data and, especially, that of large-scale structural genomics. His The first task is to select a target protein expression data has opened new research interests include whose molecular function is to be possibilities for the search for target algorithms and statistical methods for genomics, moderated, in many cases blocked, by a proteins. This development has proteomics, protein sequence drug molecule binding to it. Given the prompted large-scale investments into and structure analysis, and target protein, the second task is to the new technology by many target finding, as well as select a suitable drug that binds to the pharmaceutical companies. The connections between molecular biology and protein tightly, is easy to synthesise, is respective screening experiments rely computing (DNA computing). bio-accessible and has no adverse effects critically on appropriate bioinformatics such as toxicity. The knowledge of the support for interpreting the generated three-dimensional structure of a protein data. Specifically, methods are required can be of significant help in both phases. to identify interesting differentially Keywords: protein structure The steric and physicochemical expressed genes and to predict the prediction, protein target, protein–ligand docking complementarity of the binding site of function and structure of putative target the protein and the drug molecule is an proteins from differential expression data important, if not the dominating, feature generated in an appropriate screening of strong binding. Thus, in many cases, experiment. the knowledge of the protein structure Protein function is a colourful notion affords well-founded hypotheses of the whose meaning can range over several function of the protein. If the structure levels: of the relevant binding site of the protein is known in detail, we can even q a very general classification (globular, start to employ structure-based methods enzyme, hormone, structural protein, in order to develop a drug binding viral capsid protein, transmembrane Thomas Lengauer, tightly to the protein. protein, etc.); Institute for Algorithms and Scientific Computing (SCAI), In this paper bioinformatics methods GMD – National Research for prediction aspects of the protein q biochemical function (biochemical Center for Information structure are described and their use reaction, enzyme specificity, binding Technology, Sankt Augustin, towards the goal of drug design is partners, cofactors); Germany D53754. discussed. The possibilities and limitations of using protein structure knowledge q classification via broad cellular function Tel: +49 2241 14 2776/2777 Fax: +49 2241 14 2656 towards the goal of developing new drug (interaction with DNA and other E-mail: lengauer@gmd.de therapies are also discussed. proteins, cellular localisation); © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000 275 08-lengauer.p65 275 9/19/00, 1:49 PM
  • 2. Lengauer and Zimmer Genome/Organism/Disease Target Protein Search Structure Families Evolutionary Expression Phenotyp SNPs, Linkage SEARCH Para-/Analogs Information Genotyp Mutations Target Protein IDENTIFY Structure Sequence Fusion Co-Evolution Co-Expression Motifs Target Protein Function MODEL Structure Assay/ Target Protein Structure Screening Drug Lead DESIGN Rational Drug Design Search Ligand Computer Docking Combinatorial Design HTS Libraries HTS Trial&Error Target Lead Structure / Drug Figure 1 q broad phenotypic function (changes function simply because they originate observed for organisms with deleted or from a common ancestor and they still mutated genes); fulfil their role within the cellular processes, mutations occur independently q identification of detailed physiological after speciation events. Depending on the function such as the localisation in a extent of the evolutionary changes, the metabolic or regulatory pathway and recognition of homology or orthology the associated cellular role of the among proteins can be difficult, but still protein; in these cases consistent evidence for relatedness should be expected on the q identification of molecular binding sequence, structure and function levels. partners and their mode of interaction Sometimes, the situation is complicated with the protein. because of gene duplications within a species leading to paralogous copies of the The derivation of protein function from same gene. These paralogous copies are protein sequence by theoretical means is subject to evolutionary changes and the commonly performed by transferring evolutionary pressure on structure or functional information from related function is much relaxed for all but one proteins (eg from other organisms). copy, which still serves the original Usually the transfer is from proteins purpose, such that greater deviations in whose function has been established with sequence, structure and function occur for experimental evidence. The establishment these copies. As still considerable, ie of the relevant protein relationship based significantly more than random, sequence on sequence is complicated by some similarity among paralogous proteins can subtleties of evolutionary processes. be observed, this messes up the situation, Though it is often true that organisms leading to erroneous transfer of functions share related proteins with similar to already functionally disabled or sequence, similar structure and the same functionally completely different proteins. 276 © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000 08-lengauer.p65 276 9/19/00, 1:49 PM
  • 3. Protein structure prediction methods for drug design Therefore, in the following, we have to families that form clusters of structurally distinguish between three notions. or functionally related proteins are helpful similarity Similarity is a quantitative measure on the in the prediction of protein function in sequence, structure or function level. these cases. There are several protein homology Homology is used when there is a clear classifications available on the internet established or potential (assumed, that can serve for this purpose predicted) evolutionary relationship orthology between proteins. The term orthology, in q COGS3,4 addition, indicates homologous proteins with (established or potential) the same q ProDom5 or at least similar function. The notion of paralogy paralogy, in contrast, is used, when q PFAM6 homologous proteins are expected to have evolved enough to expect changes q SMART7,8 in function (with or without a change in 3D structure). q PRINTS9 For drug design, we need to know more of the function of the protein than q Blocks+10 follows from just a general classification. It would be best both to know natural q ProtoMap11,12 binding partners and to have a detailed structural model of the binding sites of A number of these databases (Pfam, the protein. PROSITE, PRINTS, ProDom, SWISSPROT+TREMBL) are currently METHODS OF being united in the InterPro13 database. PREDICTING PROTEIN Since protein function is basically tied to FUNCTION protein domains, protein domain analysis There are a number of ways to predict is an integral part of the methodology protein function from sequence. Most of that leads to protein family databases.14–22 them are based on sequence similarity. A Since only 20–40 per cent of the large database of protein sequences is protein sequences in a genome such as screened for ‘model sequences’ that Mycoplasma genitalium, M. janaschii and M. exhibit a high level of similarity to the tuberculosis have significant sequence query protein sequence. Sequence similarity to proteins of known BLAST alignment tools such as BLAST1 and function,23,24 we need to be able to make PSI-BLAST PSI-BLAST2 are the work-horses of such conclusions on the function of proteins analyses. If one or more model sequences that exhibit no significant sequence are found that exhibit a sufficiently high similarity to suitable model proteins. As level of similarity to the query sequence the similarity between query sequence and about whose function we have some and model sequence decreases below a knowledge, then conclusions may be threshold of, say, 25 per cent, safe possible on the function of the query conclusions on a common evolutionary sequence. If the homology is above, say, origin of the query sequence and the 40 per cent and functionally important model sequence can no longer be made. motifs are conserved then we can However, it turns out that, in many cases, hypothesise that the query sequence has a the protein fold can still be reliably function that is quite similar to that of predicted, and in several cases even the model sequence. As the level of detailed structural models of protein similarity decreases, the conclusions on binding sites can be generated. Thus, function that can be drawn from especially in this similarity range, protein sequence similarity become less and less structure prediction – again together with protein classifications reliable. Classifications of proteins into the identification of conserved sequence © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000 277 08-lengauer.p65 277 9/19/00, 1:49 PM
  • 4. Lengauer and Zimmer or spatial motifs – can help to ascertain them in more detail here. While these aspects of protein function. methods are reported to generate Other sources of information beside significant insight into protein function on sequence similarity have been explored in a higher level and to point to putative order to gain insight into protein target proteins,39 in the end, drug design function. These methods are represented can be expected to necessitate structural by five arrows pointing downwards in the knowledge of either the target protein or top right part of Figure 1. The following its binding partners. comments on these methods apply in the order from left to right: METHODS FOR PREDICTING PROTEIN sequence alignment q Sequence alignment has long been used STRUCTURE for ascertaining protein function. This is In the authors’ view, computational the standard method and we methods for predicting protein structure commented on it above. This approach from sequence alone are still well out of is only reliable if there is high sequence range, although, there are recent similarity such that we can argue about methodical advances – sometimes called orthologous proteins, since we know mini-threading – that are based on the the function of one of the proteins. assembly of fragments (see eg ROSETTA40). In contrast, modelling q Recently, the Rosetta stone method has protein structures after folds that have been been introduced. This method uses over seen before has become quite a powerful 20 completely sequenced genomes and method for protein structure prediction. analyses evolutionary correlations of two Here, the query sequence is aligned domains being fused into one protein in (threaded) to a model sequence whose one species and occurring in separate three-dimensional structure is known (the proteins in another species. From these template protein). All proteins in a given classifications the method establishes protein structure database – usually, an pairwise links between functionally appropriate representative set of structures related proteins25 and elicits putative are tried — and each template is ranked protein–protein interactions.26 using heuristic scoring functions. The score reflects the likelihood that the query q For the same purpose, the phylogenetic sequence assumes the template structure. profile method analyses the co- The approach of modelling a protein occurrence of genes in the genomes of structure after a known template is called homology-based different organisms.27 homology-based modelling and the selection modelling of a suitable template protein is often done protein threading q The analysis of change of phenotype via protein threading. based on mutated genes (eg by knock- Protein threading has three major out experiments) yields important objectives: first, to provide orthogonal information on aspects of protein evidence of possible homology for function.28–30 distantly related protein sequences; second, to detect possible homology in q In the future, the analysis of genetic cases where sequence methods fail; and variations31 among individuals, eg single third, to improve structural models for nucleotide polymorphisms (SNPs),32–34 the query sequence via structurally more will be helpful in ascertaining protein accurate alignments. function beyond mere disease linkage or There are several successful protein association (right arrow in Figure 1).35–38 threading methods, including: None of these methods looks at protein q methods based on hidden Markov structures, and thus we do not discuss models;41–48 278 © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000 08-lengauer.p65 278 9/19/00, 1:49 PM
  • 5. Protein structure prediction methods for drug design q dynamic programming methods based q Modeller60–64 and ModBase;65 on profiles;49–51 q Swiss-Model;66,67 q environment compatibility (ie contact capacity potentials as used in the q or commercial versions included in protein threader 123D).52 Quanta (MSI) or Sybyl (Tripos, Inc.). side-chain modelling These programs are very fast. A mid-size For protein side-chain modelling there protein sequence can be threaded against are two contrasting approaches based on a database of about 1,500 protein knowledge deduced from structural structures in a few minutes on a PC or databases and methods such as energy workstation. However, the underlying minimisation and molecular dynamics,68 methods assume that the assignment of respectively. Methods based on side-chain chemical properties to spatial regions in rotamer libraries that have been created the protein is the same in the query via the analysis of the protein structure protein and the template protein. This is database are usually employed to get a not the case, in practice, especially if one first model. Energy minimisation or compares proteins with partly different molecular dynamics69 is often used to folds or different functions. Extensions of refine the model. Such methods have the homology-based modelling approach been in use for crystallography/nuclear to proteins with very similar protein magnetic resonance (NMR) for many structures but different chemical make- years and are available in several program up require the solution of packages and tools (Charmm,70 algorithmically provably hard problems GROMOS/GROMACS71,72 and many and thus necessitate much more others73,74). In general these methods are computing time.There are: quite computer-intensive and can only be exercised on one or a few proteins. q heuristic approaches based on distance- Generally, the backbone alignment is an based pair potentials of mean force;53–56 input to homology-based modelling tools and the quality of the derived models is q optimal or approximate combinatorial highly sensitive to the accuracy of the tree search techniques.57–59 provided alignments. loop modelling Loops are modelled by a related host of Such approaches need hours to thread a methods. Loops that involve more than protein through a database of 1,500 about five residues are still hard to templates. However, they can yield more model.75–78 accurate alignments and models of The evaluation of the accuracy of binding pockets of proteins. assigning a protein fold (general protein The process of protein threading architecture) to a query sequence is selects a suitable template protein for a commonly based on generally accepted protein query sequence and computes an fold classifications such as SCOP79 or quality assurance alignment of the backbone of the two CATH.80 The quality of backbone proteins that is the starting point for alignments is much harder to rate, and no generating a structural model for the generally accepted scheme is available, as query protein based on the structure of of today.81–84 Rating the quality of the template protein. What is left is to protein structure models is generally place the side chains of the query protein based on the root mean square (rms) and to model the loops of the query deviation of the model and the actual protein that are not modelled by the structure on a selected set of residues. template structure. These two tasks are The problem here is that the model must performed by homology-based be superposed with the actual structure. modelling tools such as: There are several tools that perform this © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000 279 08-lengauer.p65 279 9/19/00, 1:49 PM
  • 6. Lengauer and Zimmer task – DALI/FSSP,85,86 SSAP,87 VAST,88 be derived beyond doubt. For more than PROSUP89 or SARF90 – and they can half of the 21 more difficult cases yield different results. Thus, there is no reasonable models could be predicted by accepted gold standard for protein at least one of the participating prediction CAFASP structure superposition. However, for the teams. In addition, the CAFASP purpose of rating the structures of target subsection of the assessment has proteins, the available superposition demonstrated that 10 out of 19 folds methods are sufficient. could be solved via completely automatic application of the best threading methods PERFORMANCE OF without any manual intervention. PROTEIN STRUCTURE Methods for refining rough structural PREDICTION METHODS models towards the true native structure There are strong efforts to render the of the query protein are also not predition assessment quality of protein structure prediction straightforward. This is an active area of methods more transparent and easier to research.92 evaluate. The centre of these efforts is the A combination of protein threading bi-annual CASP experiment, which rates followed by homology-based modelling protein structure prediction methods on cannot create genuinely novel protein blind predictions and aims at developing structures. But it turns out to be quite standardised and generally agreed upon sensitive in creating structure models assessment procedures both for fold based on known folds. Models that have identification and the evaluation of been reasonably accurate (eg down to alignment accuracy as well as homology 1.4Å for some 60 amino acids of the models. A blind prediction is a prediction active site of herpes virus thymidine of the three-dimensional structure for a kinase93) have been reported in blind protein sequence at a time, at which the studies of proteins with a sequence actual structure of the protein is not identity to the template protein of as low known (yet). After the structure has been as 10 per cent. Correct folds can be resolved, the prediction is compared with assigned in many cases, even if the query the actual structure. There have been sequence and the suitable template CASP three issues of the CASP experiment;91 exhibit a very low level of sequence the fourth one follows this year. The similarity (down to 5 per cent, ie far CASP experiment has been a significant below the level of random sequence help in providing a more solid basis for similarity of 17–18 per cent in optimal assessing the power of different protein alignments). structure prediction methods. For fold recognition, detectable STRUCTURAL GENOMICS progress has been observed from CASP1 The goal of structural genomics projects to CASP2. In CASP3, similar is to solve experimental structures of all performance as in CASP2 was achieved major classes of protein folds on more difficult targets. There appears to systematically independent of some be a certain limit of current fold functional interest in the proteins.94,95 The structure space recognition methods, which is still well aim is to chart the protein structure space below the limit of detectable structural efficiently; functional annotations and/or similarity (via structural comparisons). In assignment are made afterwards. This addition, in CASP3 several groups affords a thoroughly thought-out strategy produced reasonable models of up to 60 of mixing experimental protein structure residues for ab initio target fragments. determination, eg via X-ray, with In CASP3 from 43 protein targets, 15 computer-based protein structure could be classified as comparative prediction. The experiments have to yield homology modelling targets, ie related novel protein structures. The proteins to folds and accompanying alignments could be resolved experimentally are again 280 © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000 08-lengauer.p65 280 9/19/00, 1:49 PM
  • 7. Protein structure prediction methods for drug design selected by computer. The computer part characteristics are imprinted onto the deduces the remaining structures based protein structure by specific patterns of on homology-based modelling and amino acid side chains that make up the protein threading. One goal of the overall binding pocket. The conservation of structural genomics endeavour is to have these amino acids is what makes two an experimentally resolved protein proteins have the same function. Since structure within a certain structural nature varies sequence quite flexibly, this distance to any possible protein sequence, level of conservation is only maintained which allows for computing reliable among orthologous proteins that exhibit models for all protein sequences. a high level of sequence similarity. Once a map of the protein structure Thus, if the template protein from space is available, this knowledge should which we predict protein structure is not provide additional insights on what the orthologous to the query protein, other function of the protein in the cell is and methods of function prediction have to with what other partners it might come to bear. It is quite natural to interact. Such information should add to consider conservation patterns in the information gained from high- protein sequence here, such as exhibited throughput screening and biological in databases containing functional functional motifs assays. So far, glimpses of what will be sequence motifs such as PROSITE. An possible could be obtained by analysing alternative that has been investigated complete genomes or large sets of more recently is to analyse conservation proteins from expression experiments in 3D space.98 Experience shows that structural motifs with the structural knowledge available such ‘structural’ motifs provide more today, ie more or less complete information than motifs derived purely representative sets and a quite coarse from sequence, even if the sequence coverage of structure space.63,96,97 motifs are distributed over several regions (BLOCKS+, PRINTS). Recently, the METHODS FOR notion of an approximate structural motif PREDICTING PROTEIN has been introduced – sometimes called fuzzy functional forms FUNCTION FROM fuzzy functional form (FFF).99 Using a PROTEIN STRUCTURE library of approximate structural motifs Aspects of protein structure that are enhances the range of applicability of useful for drug design studies typically motif search at the price of reduced have to involve three-dimensional sensitivity and specificity. Such structure. Predicting the secondary approaches are supported by the fact that, structure of the protein is not sufficient. often, binding sites of proteins are much Even the similarity of the three- more conserved than the overall protein dimensional structures of two proteins structure (eg bacterial and eukaryotic cannot be taken as an indication for a serine proteases), such that an inexact similar function of these proteins. The model can have an accurately modelled reason is that protein structure is part responsible for function. As the conserved much more than protein structural genomics projects produce a function. Indeed, protein folds such as the more and more complete picture of the TIM barrel (triose-phosphate isomerase) protein structure space, comprehensive are quite ubiquitous and can be libraries of highly discriminative considered as general scaffolds that lend structural motifs can be expected. molecular stability to the protein and are The relationship between structure and not directly tied to its function. In function is a true many-to-many relation. contrast, the molecular function of the Recent studies have shown that protein is tied to local structural particular functions could be mounted characteristics pertaining to binding onto several different protein folds100 and, pockets on the protein surface. These conversely, several protein fold classes can © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000 281 08-lengauer.p65 281 9/19/00, 1:49 PM
  • 8. Lengauer and Zimmer docking perform a wide range of functions.101 search for drug leads. A docking method This limits our potential of deducing that takes a minute per instance can be function from structure. But knowledge used to screen up to thousands of on which folds support a given function compounds on a PC or hundreds of and which functions are based on a given thousands of drugs on a suitable parallel fold can still help in predicting function computer. Docking methods that take the from structure. In addition, local better part of an hour cannot be suitably drug screening structural templates such as FFFs employed for such large-scale screening indicative for a particular function can purposes. In order to screen really large identify similar sites and the associated drug databases with several hundred function despite a globally different fold. thousand compounds docking methods Such 3D patterns can also discriminate that can handle single protein/drug pairs among globally similar folds with respect within seconds are needed. to containing particular conserved 3D The high conformational flexibility of functional motifs in order to classify them small molecules as well as the subtle into different functional categories. structural changes in the protein binding Though it is not easy to derive pocket upon docking (induced fit) are functions from resolved protein major complications in docking. structures, the availability of structural Furthermore, docking necessitates careful information improves the chances analysis of the binding energy. The energy scoring function compared with relying on sequence model is cast into the so-called scoring methods alone. function that rates the protein–ligand complex energetically. Challenges in the METHODS FOR energy model include the handling of DEVELOPING DRUGS entropic contributions, and solvation BASED ON PROTEIN effects, and the computation of long- STRUCTURE range forces in fast docking methods. The object of drug design is to find or The state of the art in docking can be develop a, mostly small, drug molecule summarised as follows (see also Table 1). structural flexibility that tightly binds to the target protein, Handling the structural flexibility of the moderating (often blocking) its function drug molecule can be done within the or competing with natural substrates of regime up to about a minute per the protein. Such a drug can be best molecular complex on a PC (see, eg, found on the basis of knowledge of the Kramer et al.102). A suitable analysis of the protein structure. If the spatial shape of structural changes in the protein still the site of the protein is known, to which necessitates more computing time. the drug is supposed to bind, then Today, tools that are able to dock a docking methods can be applied to select molecule to a protein within seconds are suitable lead compounds that have the still based on rigid-body docking (both potential of being refined to drugs. The the protein and ligand conformational speed of a docking method determines flexibility is omitted). whether the method can be employed for Recently, fast docking tools have been screening compound databases in the adapted to screening combinatorial drug Table 1: Taxonomy of docking methods Runtime on a PC Fraction of a second About a minute An hour or longer Flexibility of the drug molecule X X Flexibility of the protein binding site X Energy model None Short-range Force field 282 © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000 08-lengauer.p65 282 9/19/00, 1:49 PM
  • 9. Protein structure prediction methods for drug design libraries (see, eg, Rarey and Lengauer103). advantage that it does not have to deal Such libraries provide a carefully selected with insufficiently powerful computer set of molecular building blocks together models, at the expense of high laboratory with a small set of chemical reactions that cost and the absence of structural link the modules. In this way, a knowledge on ‘why’ a compound binds combinational library combinatorial library can theoretically to the protein. provide a diversity of up to billions of molecules from a small set of reactants. CONCLUSION The accuracy of docking predictions In summary, the field is still in an early lies within 50–80 per cent ‘correct’ stage of development. Ab initio protein predictions depending on the evaluation structure prediction continues to be a measure and the method. That means that grand challenge for which no docking methods are far from perfectly comprehensive solution is in sight. The accurate. Nevertheless, they are very quality of fold prediction based on useful in pharmaceutical practice. The homology rises and tools has reached the major benefit of docking is that a large stage where one can generate confident drug library can be ranked with respect predictions for soluble proteins that in a to the potential that its molecules have substantial fraction (about half) of the for being a useful lead compound for the cases provide significant threading hits in target protein in question. The quality of the structure database. Protein threading a method in this context can be and homology-based prediction become enrichment factor measured by an enrichment factor. Roughly, especially helpful in an environment this is the ratio between the number of where the methods can be used in active compounds (drugs that bind concert with experimental techniques for tightly to the protein) in a top fraction structure and function determination. (say the top 1 per cent) of the ranked Here, the prediction methods can drug database divided by the same figure exercise their strengths, which lie in in the randomly arranged drug database. being used interactively by experts and State-of-the-art docking methods in the making suggestions that can be followed middle regime (minutes per molecular up by succeeding experimentation, rather pair), eg FlexX,104 achieve enrichment than being required to provide proven factors of up to about 15. Fast methods fact. The process of going from structure (seconds per pair), eg FeatureTrees,105 to function is far from being automated. achieve similar enrichment factors, but In a scenario that combines structure deliver molecules similar to known prediction methods with binding ligands and do not detect as experimentation, the step from structure diverse a range of binding molecules. to function can be performed in a Even if the structure of the protein customised manner. binding site is not known, computer- Protein structure prediction by based methods can be used to select homology is definitely not yet a turn-key promising lead compounds. Such technology. But we can expect it to enter methods compare the structure of a the ‘production’ stage through the molecule with that of a ligand that is activities in structural genomics. Still the known to bind to the protein, for field of protein structure prediction is instance, its natural substrate. very busy, generating the tools and Alternatives to docking for lead finding processes for raising the number of high-throughput include high-throughput screening confident structure predictions and the screening (HTS). This laboratory method allows for accompanying estimates of significance. testing the binding affinity of up to more Problems for applying these results in than several thousand compounds to the drug design are not only that the models same target protein in a day. In may not be sufficiently accurate but also comparison this method has the that the structures of many interesting © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000 283 08-lengauer.p65 283 9/19/00, 1:49 PM
  • 10. Lengauer and Zimmer target proteins will not be accessible by 2. Altschul, S. F., Madden, T. L., Schaffer, A. A. et al. (1997), ‘Gapped BLAST and PSI- homology-based modelling, at all, for BLAST: a new generation of protein some time to come. This includes the database search programs’, Nucleic Acids therapeutically particularly interesting Res., Vol. 25(17), pp. 3389–3402. http:// class of membrane proteins, for which ncbi.nlm.nih.gov/blast/psiblast.cgi essentially no structures have been 3. Tatusov, R. L., Galperin, M. Y., Natale, D. resolved. A. and Koonin, E. V. (2000), ‘The COG database: a tool for genome-scale analysis Docking is used frequently in of protein functions and evolution’, Nucleic structure-based drug design. To the Acids Res., Vol. 28(1), pp. 33–36. authors’ knowledge, the first drug 4. Tatusov, R. L., Koonin, E. V. and Lipman, drugs developed with developed with structure-based D. J. (1997), ‘A genomic perspective on computer techniques techniques was the HIV protease protein families’, Science, Vol. 278(5338), pp. 631–637. inhibitor Dorzolamide. In the past few years structural considerations have begun 5. Corpet, F., Servant, F., Gouzy, J. and Kahn, D. (2000), ‘ProDom and ProDom-CG: tools to pervade the design of new drugs. A for protein domain analysis and whole point in case is that of the neuraminidase genome comparisons’, Nucleic Acids Res., inhibitors for HIV. Such studies mostly Vol. 28(1), pp. 267–269. involve experimentally resolved protein 6. Bateman, A., Birney, E., Durbin, R. et al. structures. However, even models can (2000), ‘The Pfam protein families database’, Nucleic Acids Res., Vol. 28(1), serve to guide drug development. Based pp. 263–266. on the experimentally resolved structure 7. Schultz, J., Milpetz, F., Bork, P. and Ponting, of the membrane protein C. P. (1998), ‘SMART, a simple modular bacteriorhodopsin, several groups are architecture research tool: identification of attempting to model binding sites of G- signaling domains’, Proc. Natl Acad. Sci. protein coupled receptors that are USA, Vol. 95(11), pp. 5857–5864. believed to be structurally similar. 8. Schultz, J., Copley, R. R., Doerks, T. et al. Nevertheless, the authors are not aware (2000), ‘SMART: a web-based tool for the study of genetically mobile domains’, of any instance where the whole process Nucleic Acids Res., Vol. 28(1), pp. 231–234. line from the protein sequence to the 9. Attwood, T. K., Croning, M. D., Flower, lead structure has been exercised in an D. R. et al. (2000), ‘PRINTS-S: the integrated manner and with significant database formerly known as PRINTS’, help of computer predictions. The field Nucleic Acids Res., Vol. 28(1), pp. 225–227. has not reached this level of maturity 10. Henikoff, S., Henikoff, J. G. and yet. While structural aspects – even as Pietrokovski, S. (1999), ‘Blocks+: a non- redundant database of protein alignment predicted by the computer – can be blocks derived from multiple compilations’, expected to invade the search for target Bioinformatics, Vol. 15(6), pp. 471–479. proteins and the development of new 11. Yona, G., Linial, N. and Linial, M. (2000), drugs, experimental data, where they are ‘ProtoMap: automatic classification of accessible, will always be highly welcome protein sequences and hierarchy of protein families’, Nucleic Acids Res., Vol. 28(1), pp. and often be indispensable in this 49–55. process. 12. Yona, G., Linial, N. and Linial, M. (1999), ‘ProtoMap: automatic classification of Acknowledgements protein sequences, a hierarchy of protein We thank Matthias Rarey for helpful comments on families, and local maps of the protein this paper and Gerhard Barnickel and Gerhard space’, Proteins, Vol. 37(3), pp. 360–378. Klebe for information on the state of drugs 13. http://www.ebi.ac.uk/interpro/ developed by structure-based techniques. 14. Rose, G. D. (1979), ‘Hierarchic organization of domains in globular proteins’, J. Mol. References Biol., Vol. 134(3), pp. 447–470. 1. Altschul, S. F., Gish, W., Miller, W. et al. (1990), ‘Basic local alignment search 15. Nichols, W. L., Rose, G. D., Ten Eyck, L. F. tool’, J. Mol. Biol., Vol. 215(3), pp. 403–410. and Zimm, B. H. (1995), ‘Rigid domains http://ncbi.nlm.nih. gov/BLAST/ in proteins: an algorithmic approach to 284 © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000 08-lengauer.p65 284 9/19/00, 1:49 PM
  • 11. Protein structure prediction methods for drug design their identification’, Proteins, Vol. 23(1), 28. Bork, P., Dandekar, T., Diaz-Lazcoz, Y. pp. 38–48. et al. (1998), ‘Predicting function: from genes to genomes and back’, J. Mol. Biol., 16. Gracy, J. and Argos, P. (1998), ‘Automated Vol. 283(4), pp. 707–725. protein sequence database classification. II. Delineation of domain boundaries from 29. Roemer, K., Johnson, P. A. and sequence similarities’, Bioinformatics, Friedmann, T. (1991), Knock-in and Vol. 14(2), pp. 174–187. knock-out: Transgenes, Development and Disease: A Keystone Symposium 17. Gracy, J. and Argos, P. (1998), ‘DOMO: sponsored by Genentech and Immunex, a new database of aligned protein Tamarron, CO, USA, January 12–18 domains’, Trends Biochem. Sci., Vol. 23(12), 1991’, New Biol., Vol. 3(4), pp. 331–335. pp. 495–497. 30. Sato, T. N. (1999), ‘Gene trap, gene 18. Sowdhamini, R., Rufino, S. D. and knockout, gene knock-in, and transgenics Blundell, T. L. (1996), ‘A database of in vascular development’, Thromb. globular protein structural domains: Haemost., Vol. 82(2), pp. 865–869. clustering of representative family members into similar folds’, Fold Des., Vol. 1(3), 31. Collins, F. S., Guyer, M. S. and pp. 209–220. Charkravarti, A. (1997), ‘Variations on a theme: cataloging human DNA sequence 19. Jones, S., Stewart, M., Michie, A. et al. variation’, Science, Vol. 278(5343), (1998), ‘Domain assignment for protein pp. 1580–1581. structures using a consensus approach: characterization and analysis’, Protein Sci., 32. Brookes, A. J. (1999), ‘The essence of Vol. 7(2), pp. 233–242. SNPs’, Gene, Vol. 234(2), pp. 177–186. 20. Orengo, C. A., Martin, A. M., 33. Kuska, B. (1999), ‘Snipping “SNPs”: a Hutchinson, G. et al. (1998), ‘Classifying a new tool for mining gene variations’, protein in the CATH database of domain J. Natl Cancer Inst., Vol. 91(13), p. 1110. structures’, Acta Crystallogr. D Biol. 34. Vilain, E. (1998), ‘CYPs, SNPs, Crystallogr., Vol. 54(1(Pt 6)), pp. 1155–1167. and molecular diagnosis in the 21. Murzin, A. G. (1996), ‘Structural postgenomic era’, Clin. Chem., classification of proteins: new Vol. 44(12), pp. 2403–2404. superfamilies’, Curr. Opin. Struct. Biol., 35. Collins, F. S. (1999), ‘Shattuck lecture – Vol. 6(3), pp. 386–394. medical and societal consequences of the 22. Murzin, A. G., Brenner, S. E., Hubbard, T. Human Genome Project’, N. Engl. J. and Chothia, C. (1995), ‘SCOP: a Med., Vol. 341(1), pp. 28–37. structural classification of proteins database 36. Ellsworth, D. L. and Manolio, T. A. (1999), for the investigation of sequences and ‘The emerging importance of genetics in structures’, J. Mol. Biol., Vol. 247(4), epidemiologic research II. Issues in study pp. 536–540. design and gene mapping’, Ann. Epidemiol., Vol. 9(2), pp. 75–90. 23. Fischer, D. and Eisenberg, D. (1999), ‘Predicting structures for genome 37. Ellsworth, D. L. and Manolio, T. A. proteins’, Curr. Opin. Struct. Biol., Vol. 9(2), (1999), ‘The emerging importance of pp. 208–211. genetics in epidemiologic research III. Bioinformatics and statistical genetic 24. Huynen, M., Doerks, T., Eisenhaber, F. et al. methods’, Ann. Epidemiol., Vol. 9(4), (1998), ‘Homology-based fold predictions pp. 207–224. for Mycoplasma genitalium proteins’, J. Mol. Biol., Vol. 280(3), pp. 323–326. 38. Terwilliger, J. D. and Ott, J. (1994), ‘Handbook of Human Genetic Linkage’, 25. Marcotte, E. M., Pellegrini, M., Johns Hopkins University Press, Thompson, M. J. et al. (1999), ‘A combined Baltimore. algorithm for genome-wide prediction of protein function’, Nature, Vol. 402(6757), 39. Drews, J. (1996), ‘Genomic sciences and pp. 83–86. the medicine of tomorrow’, Nat. Biotechnol., Vol. 14(11), pp. 1516–1518. 26. Marcotte, E. M., Pellegrini, M., Ng, H. L., Rice, D. W. et al. (1999), ‘Detecting protein 40. Simons, K. T., Bonneau, R., Ruczinski, I. function and protein-protein interactions and Baker, D. (1999), ‘Ab initio protein from genome sequences’, Science, Vol. structure prediction of CASP III targets 285(5428), pp. 751–753. using ROSETTA’, Proteins, Vol. 37(S3), pp. 171–176. 27. Pellegrini, M., Marcotte, E. M., Thompson, M. J. et al. (1999), ‘Assigning protein 41. Karchin, R. and Hughey, R. (1998), functions by comparative genome analysis: ‘Weighting hidden Markov models for protein phylogenetic profiles’, Proc. Natl maximum discrimination’, Bioinformatics, Acad. Sci. USA, Vol. 96(8), pp. 4285–4288. Vol. 14(9), pp. 772–782. © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000 285 08-lengauer.p65 285 9/19/00, 1:49 PM
  • 12. Lengauer and Zimmer 42. Bateman, A., Birney, E., Durbin, R. et al. 54. Hendlich, M., Lackner, P., Weitckus, S. et al. (1999), ‘Pfam 3.1: 1313 multiple alignments (1990), ‘Identification of native protein folds and profile HMMs match the majority of amongst a large number of incorrect proteins’, Nucleic Acids Res., Vol. 27(1), models. The calculation of low energy pp. 260–262. conformations from potentials of mean force’, J. Mol. Biol., Vol. 216(1), pp. 167–180. 43. Park, J., Karplus, K., Barrett, C. et al. (1998), ‘Sequence comparisons using 55. Sippl, M. J. (1995), ‘Knowledge-based multiple sequences detect three times as potentials for proteins’, Curr. Opin. Struct. many remote homologues as pairwise Biol., Vol. 5(2), pp. 229–235. methods’, J. Mol. Biol., Vol. 284(4), pp. 1201–1210. 56. Sippl, M. J. and Flockner, H. (1996), ‘Threading thrills and threats’, Structure, 44. Barrett, C., Hughey, R. and Karplus, K. Vol. 4(1), pp. 15–19. (1997), ‘Scoring hidden Markov models’, Comput. Appl. Biosci., Vol. 13(2), 57. Lathrop, R. H. and Smith, T. F. (1996), pp. 191–199. ‘Global optimum protein threading with gapped alignment and empirical pair score 45. McClure, M. A., Smith, C. and Elton, P. functions’, J. Mol. Biol., Vol. 255(4), (1996), ‘Parameterization studies for the pp. 641–665. SAM and HMMER methods of hidden Markov model generation’, ‘Proc. 4th 58. Thiele, R., Zimmer, R. and Lengauer, T. International Conference on Intelligent (1999), ‘Protein threading by recursive Systems for Molecular Biology’, AAAI dynamic programming’, J. Mol. Biol., Vol. Press, Menlo Park, CA, pp. 155–164 290(3), pp. 757–779. 46. Eddy, S. R. (1998), ‘Profile hidden 59. Xu, Y., Xu, D. and Uberbacher, E. C. Markov models’, Bioinformatics, Vol. 14(9), (1998), ‘An efficient computational method pp. 755–763. for globally optimal threading’, J. Comput. Biol., Vol. 5(3), pp. 597–614. 47. Sonnhammer, E. L., Eddy, S. R., Birney, E., Bateman, A. and Durbin, R. (1998), ‘Pfam: 60. Sali, A. (1995), ‘Modeling mutations and multiple sequence alignments and HMM- homologous proteins’, Curr. Opin. profiles of protein domains’, Nucleic Acids Biotechnol., Vol. 6(4), pp. 437–451. Res., Vol. 26(1), pp. 320–322. 61. Sali, A., Potterton, L., Yuan, F. et al. (1995), 48. Eddy, S. R. (1996), ‘Hidden Markov ‘Evaluation of comparative protein models’, Curr. Opin. Struct. Biol., Vol. 6(3), modeling by MODELLER’, Proteins, pp. 361–365. (1995), ‘Proc. 3rd Vol. 23(3), pp. 318–326. International Conference on Intelligent 62. Sali, A. (1998), ‘100,000 protein structures Systems for Molecular Biology’, AAAI for the biologist’, Nat. Struct. Biol., Vol. Press, Menlo Park, CA, pp. 114–120. 5(12), pp. 1029–1032. 49. Bowie, J. U., Luthy, R. and Eisenberg, D. 63. Sanchez, R. and Sali, A. (1998), (1991), ‘A method to identify protein ‘Large-scale protein structure modeling of sequences that fold into a known three- the Saccharomyces cerevisiae genome’, dimensional structure’, Science, Vol. Proc. Natl Acad. Sci. USA, Vol. 95(23), 253(5016), pp. 164–170. pp. 13597–13602. 50. Luthy, R., Bowie, J. U. and Eisenberg, D. 64. Sanchez, R. and Sali, A. (1997), ‘Evaluation (1992), ‘Assessment of protein models with of comparative protein structure modeling three-dimensional profiles’, Nature, Vol. by MODELLER-3’, Proteins, Suppl 1, 356(6364), pp. 83–85. pp. 50–58. 51. Luthy, R., Xenarios, I. and Bucher, P. 65. Sanchez, R., Pieper, U., Mirkovic, N. et al. (1994), ‘Improving the sensitivity of the (2000), ‘MODBASE, a database of sequence profile method’, Protein Sci., Vol. annotated comparative protein structure 3(1), pp. 139–146. models’, Nucleic Acids Res., Vol. 28(1), 52. Alexandrov, N. N., Nussinov, R. and pp. 250–253. Zimmer, R. M. (1996), ‘Fast protein fold 66. Guex, N., Diemand, A. and Peitsch, M. C. recognition via sequence to structure (1999), ‘Protein modelling for all’, Trends alignment and contact capacity potentials’, Biochem. Sci., Vol. 24(9), pp. 364–367. Pacific Symposium on Biocomputing, pp. 53–72. 67. Guex, N. and Peitsch, M. C. (1997), ‘SWISS-MODEL and the Swiss- 53. Sippl, M. J. (1990), ‘Calculation of PdbViewer: an environment for conformational ensembles from potentials comparative protein modeling’, of mean force. An approach to the Electrophoresis, Vol. 18(15), pp. 2714–2723. knowledge-based prediction of local structures in globular proteins’, J. Mol. Biol., 68. Petrella, R. J., Lazaridis, T. and Karplus, M. Vol. 213(4), pp. 859–883. (1998), ‘Protein sidechain conformer 286 © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000 08-lengauer.p65 286 9/19/00, 1:49 PM
  • 13. Protein structure prediction methods for drug design prediction: a test of the energy function’, 82. Marchler-Bauer, A., Levitt, M. and Bryant, Fold Des., Vol. 3(5), pp. 353–377. S. H. (1997), ‘A retrospective analysis of CASP2 threading predictions’, Proteins, 69. Karplus, M. and Petsko, G. A. (1990), Suppl 1, pp. 83–91. ‘Molecular dynamics simulations in biology’, Nature, Vol. 347(6294), 83. Marchler-Bauer, A. and Bryant, S. H. pp. 631–639 (1997), ‘A measure of success in fold recognition’, Trends Biochem. Sci., Vol. 22(7), 70. Brooks, B. R., Bruccoleri, R. E., Olafson, pp. 236–240. B. D. et al. (1983), ‘CHARMM: A program for macromolecular energy, minimization, 84. Lackner, P., Koppensteiner, W. A., and dynamics calculation’, Domingues, F. S. and Sippl, M. J. (1999), J. Comp. Chem., Vol. 4, pp. 187–213. ‘Automated large scale evaluation of 71. Van Gunsteren, W. F. and Berendsen, H. J. protein structure predictions’, Proteins, Vol. (1982), ‘Molecular dynamics: perspective 37(S3), pp. 7–14. for complex systems’, Biochem. Soc. Trans., 85. Holm, L. and Sander, C. (1998), ‘Dictionary Vol. 10(5), pp. 301–305. of recurrent domains in protein structures’, 72. Van Gunsteren, W. F. and Berendsen, Proteins, Vol. 33(1), H. J. (1990), ‘Moleküldynamik- pp. 88–96. Computersimulationen: Methodik, 86. Holm, L. and Sander, C. (1998), ‘Touring Anwendungen und Perspektiven in protein fold space with Dali/FSSP’, Nucleic der Chemie’, Angew. Chem., Vol. 102, Acids Res., Vol. 26(1), pp. 316–319. pp. 1020–1055. 87. Orengo, C. A. and Taylor, W. R. (1996), 73. Levitt, M. (1983), ‘Protein folding by ‘SSAP: sequential structure alignment restrained energy minimization and program for protein structure comparison’, molecular dynamics’, J. Mol. Biol., Methods Enzymol., Vol. 266, pp. 617–635. Vol. 170(3), pp. 723–764. 88. Gibrat, J. F., Madej, T. and Bryant, S. H. 74. Novotny, J., Bruccoleri, R. and Karplus, M. (1996), ‘Surprising similarities in structure (1984), ‘An analysis of incorrectly folded comparison’, Curr. Opin. Struct. Biol., Vol. protein models. Implications for structure 6(3), pp. 377–385. predictions’, J. Mol. Biol., Vol. 177(4), pp. 787–818. 89. Lackner, P., Koppensteiner, W. A., Domingues, F. S. and Sippl, M. J. (1999), 75. van Vlijmen, H. W. and Karplus, M. (1997), ‘Automated large scale evaluation of ‘PDB-based protein loop prediction: protein structure predictions’, Proteins, Vol. parameters for selection and methods for 37(S3), pp. 7–14. optimization’, J. Mol. Biol., Vol. 267(4), pp. 975–1001. 90. Alexandrov, N. N. (1996), ‘SARFing the PDB’, Protein Eng., Vol. 9(9), pp. 727–732. 76. Lessel, U. and Schomburg, D. (1997), ‘Creation and characterization of a new, 91. Lattman, E. E. (ed.) (1999), ‘Third Meeting non-redundant fragment data bank’, Protein on the Critical Assessment of Techniques Eng., Vol. 10(6), pp. 659–664. for Protein Structure Prediction’, Proteins, Vol. 37, Suppl. 3.. 77. Lessel, U. and Schomburg, D. (1999), ‘Importance of anchor group positioning 92. Kolinski, A., Rotkiewicz, P., Ilkowski, B. in protein loop prediction’, Proteins, and Skolnick, J. (1999), ‘A method for the Vol. 37(1), pp. 56–64. improvement of threading-based protein models’, Proteins, Vol. 37(4), pp. 592–610. 78. Fechteler, T., Dengler, U. and Schomburg, D. (1995), ‘Prediction of protein three- 93. Zimmer, R. and Thiele, R. (1997), ‘Fast dimensional structures in insertion and protein fold recognition and accurate deletion regions: a procedure for searching sequence–structure alignment’, in ‘German data bases of representative protein Conference on Bioinformatics, GCB ’96’, fragments using geometric scoring criteria’, Hofestädt, R., Lengauer, T., Löffler, M. J. Mol. Biol., Vol. 253(1), pp. 114–131. and Schomburg, D. Eds, Springer, Berlin, pp. 137–148. 79. Lo Conte, L., Ailey, B., Hubbard, T. J. et al. (2000), ‘SCOP: a structural classification of 94. Kim, S. H. (1998), ‘Shining a light on proteins database’, Nucleic Acids Res., Vol. structural genomics’, Nat. Struct. Biol., Vol. 28(1), pp. 257–259. 5 Suppl., pp. 643–645. 80. Orengo, C. A., Michie, A. D., Jones, S. et al. 95. Montelione, G. T. and Anderson, S. (1999), (1997), ‘CATH – a hierarchic classification ‘Structural genomics: keystone for a of protein domain structures’, Structure, Vol. Human Proteome Project’, Nat. Struct. 5(8), pp. 1093–1108. Biol., Vol. 6(1), pp. 11–12. 81. Marchler-Bauer, A. and Bryant, S. H. 96. Sali, A. (1998), ‘100,000 protein structures (1997), ‘Measures of threading specificity for the biologist’, Nat. Struct. Biol., and accuracy’, Proteins, Suppl 1, pp. 74–82. Vol. 5(12), pp. 1029–1032. © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000 287 08-lengauer.p65 287 9/19/00, 1:50 PM