PSI The Protein Model Portal
logo

PMP | Documentation

The Protein Model Portal

The Protein Model Portal (PMP) is part of the Protein Structure Initiative Knowledgebase (PSI KB). The goal of the portal is to give unified access to the various models that can be leveraged from PSI targets and other experimental protein structures. Models are provided by the PSI centers (CSMP, JCSG, MCSG, NESG, NYSGXRC, JCMM), and by independent modeling groups.

Table of contents

Model Types
Target-Template Alignments
Model Quality
1) Model quality and applications of models
2) Determinants of model accuracy
a) Sequence identity between target and template of known structure
b) Actuality of template selection
c) Variability among available templates
3) Validation of protein models
a) By analysing the template situation
b) By analysing the model ensemble
c) By employing scoring functions to analyse the geometry of protein structures
References
Structure Comparison
Sequence Annotation
Model Preview Images & Visualization
PSI Partner Sites
Contact

Model Types

Homology (or comparative) modeling methods make use of experimental protein structures to build models for evolutionary related proteins. Experimental structural biology and homology modeling thereby complement each other in the exploration of the protein structure space. For every structure determined, hundreds of models can be derived using a variety of established methods. Sequence-Centric Models (SC) are generated by searching the best available template structures to build a model for a given protein target sequence, while Template-Centric Models (TC) result from using a specific solved structure as a template to build a number of models for a series of target protein sequences.

Target-Template Alignments

The target-template alignment provided on the model info pages are generated dynamically by structural superposition of model and template structures using MAMMOTH (Lupyan D., Leo-Macias A., Ortiz AR. (2005) Bioinformatics, 21, 3255-3263).

Model Quality

Protein structure models are theoretical models which may contain large errors and therefore need to be treated with caution. The quality of protein models therefore needs to be analysed carefully.

1) Model quality and applications of models
Generally, protein structure models can support the design of experiments and may help explaining experimental observations but have only limited predictive value. The quality of a model determines its suitability for a particular application ( see schematic representation below).

Applications of protein models have been extensively discussed in a recent workshop on "Applications of Protein Models in Biomedical Research" at the UCSF in July 2008. Examples of applications of computational models with a significant impact on various areas of life science research are described in the workshop White Paper (Schwede et al. 2009).


sources of error

Schematic representation of possible sources of errors in modeling and important areas of applications of theoretical protein structure models.


Knowledge of the expected accuracy of a protein structure model is of crucial importance for a biologist intending to use the model. The importance of quality estimation in modeling has been underlined in the literature (Hooft et al. 1996, Martí-Renom et al. 2000, Fischer 2006, Kryshtafovych&Fidelis 2008, Cozzetto et al. 2007+2009, Bordoli et al. 2009, Schwede et al.2009). There are basically two sources of information supporting the estimation of the accuracy of homology models.

The first source is the availability of structural knowledge which is primarily determined by the evolutionary distance between the query protein and template proteins of known structure. This is based on the observation that there is a direct correlation between sequence identity of a pair of proteins and the structural similarity of their common core (Chothia&Lesk 1986, Rost 1999). Section 2 describes in detail several 'determinants of model accuracy' given the structural information of known templates.

The second source of information comes from the analysis of the geometry of the model. Especially when the sequence identity is low, individual models may vary considerably from the expect average quality due to various sources of errors in modeling (see scheme above) and inaccuracies introduced by the modeling programs. It is therefore necessary to independently check the geometric plausibility and the 'energy' of the model. For this purpose scoring (or energy) functions have been developed which are discribed in the last section including a short description of the quality estimation server integrated in PMP.


2) Determinants of model accuracy
a) Sequence identity between target and template of known structure

The sequence identity between the target protein and template of known structure is commonly seen as a first indicator for the expected accuracy of a model, as confirmed by various studies (Chothia&Lesk 1986, Rost 1999, Baker&Sali 2001, Koh et al. 2003).

quality_indicator
Based on the sequence identity to the template we assign a model to one of three categories of modeling complexity (see traffic light symbol). The classification roughly agrees with the one introduced by Rost (Rost 1999) who defines three zones of sequence similarity: midnight zone (zone A, red), twilight zone (zone B, yellow), safe zone (zone C, green). A rough description of the three zones is given below followed by a more detailed explanation of possible next steps.

The coloring roughly also corresponds to the one chosen in the schematic representation above which allows to relate modeling difficulty to different area of applications in life science research.


seq id vs rmsd

Schematic representation of the 3 zones of sequence/structure similarity. Within PMP, the location of the query model is represented by a red vertical line assigning it to a of the three categories described below.


A: In models based on a target-template sequence alignment lower than 30% sequence identity frequently substantial alignment errors and suboptimal template selection are observed (Rost 1999, Martí-Renom et al. 2000). Careful validation of these models quality is strongly advised.

B: In models based on a target-template sequence alignment between 30% and 50% sequence identity alignment errors in non-conserved segments of the target protein, structural variation in templates, and incorrect reconstruction of loops (insertions and deletions) are frequent sources of model inaccuracies (Martí-Renom et al. 2000, Fiser et al. 2000). Careful validation of the model quality and variability among template structures is advised.

C: Models based on a target-template sequence alignment higher than 50% sequence identity typically have the correct fold and the alignments tends to be mainly correct. Structural variation in templates, and incorrect reconstruction of loops (insertions and deletions) are the main sources of model inaccuracies (Fiser et al. 2000, Zhang 2009). Validation of the model quality and analysis of the variability among template structures is advised.

b) Actuality of template selection
warning
The Protein Model Portal provides access to several modeling repositories. These repositories contain models based on the best available template at the time of model building. It should be therefore always checked whether a newer template with a considerably higher sequence identity with respect to the query protein has become available in the PDB. The model creation date as well as the date of the latest verification of the template selection actuality ('Template verification') are provided in the model detail section. Outdated models older than 3 months are clearly highlighted allowing for straightforward identification of models potentially based on lower quality templates. The sequence of the query protein can be sent to several modeling servers by selecting under 'modeling services' in the PMP navigation menu.

Example: Before the release of the experimental structures of the β1 and the β2 adrenergic receptors as well as the A2A adenosine receptor (2007-2008), GPCR models were built based on the Rhodopsin structure (and earlier on Bacteriorhodopsin) which differs significantly. Old templates based on Rhodopsin should therefore not be used anymore. The use of the best available template structure has a direct effect on the outcome of subsequent experiments based on the model such as for example structure-based drug design (i.e. ligand docking and virtual screening).

c) Variability among available templates
In homology modeling, often several evolutionarily related proteins with known experimental structure are detected for a given query protein of interest. Depending on the protein family these templates may be structurally quite similar or vary considerably. Usually, some regions in the core of the templates agree more (the 'structural core') and some parts, mainly protein surface loops, are less similar (the 'structurally variable regions'). The structural core, which also tends to be also more conserved in sequence, serves a template for structural extrapolation. These parts of the model which are directly inherited from the template(s) are generally more accurate compared to the remaining regions which need to be predicted from scratch.

Structural variations among templates can have several regions such as differences in experimental conditions, presence or absence of ligands/co-factors but also evolutionary reasons. The variations may be characteristic for the family and a sign for flexibility or disorder. There are many examples of proteins which largely disordered and whose function can only be explained by taking into account the non-existence of a well-defined three-dimensional structure (see e.g. Dunker et al. 2002, Pentony et al. 2010).

superposition

Example: Adenylate kinases catalyze the interconversion of adenine nucleotides. They undergo large conformational changes from the open form (PDB id 4ake, depicted in grey) to the enzymatically-active closed conformation in presence of the ligand (1ake, structure colored according to the local deviation to the open form). In homology modeling, template selection in this case would have a strong effect on the explanatory value of the resulting model and its applicability for subsequent experiment.

3) Validation of protein models
In this section, a guide to a stepwise analysis of a protein model is provided, in order to have a first guess about its quality and as a consequence its suitability for specific experiments.
How can we predict the quality of a model without knowing the correct answer?
a) By analysing the template situation:
Is the model based on the best available template?

  • check up-to-dateness of template selection -> 'verification date'
  • sequence identity correlates with modeling difficulty
  • check the resolution of the experimental structure
  • check the experimental conditions and the environment (e.g. solved with or w/o ligand)?

The analysis of variability among templates:

  • regions not differing between various templates (i.e. the structural core) can be inherited directly and are therefore modelled potentially more accurately than structurally variable regions (e.g. surface loops)
  • Where is the structural variability located?
  • Are flexible loops part of the active site?
  • Are there shift/distortions in the core of the protein (e.g. among secondary structure elements)? This would indicate a difficult modeling case with lower expected model accuracy
  • Variation may be sign of flexibility in the protein family or there may be even disordered regions (i.e. regions not resolved in many templates) This flexibility may be needed for protein function (use of disorder prediction tools may help in this situation, see Pentony et al. 2010)

b) By analysing the model ensemble:
The variability among the models of a given protein predicted by different programs/servers may be to a large extend explained by the variation in the templates but the model ensemble also contains additional information:

  • A strong consensus among models of various servers is a good sign for the correctness of a model since the probability that many modeling resources predict the same feature all wrong is much lower than doing it all right.
  • On the query result page of PMP, the structure comparison tool can be used to compare any subset of models and analyse the variability among them. See also section [ Structure Comparison ] described below.



c) By employing scoring functions to analyse the geometry of protein structures

Scientific background:
Errors in models tend to increase with decreasing sequence identity to available templates (see see schematic representation above), at the same time inaccuracies introduced by the modeling programs increase as well, which make it necessary to independently check the geometry (or 'energy') of the models. Several methods and scoring functions have been described in the literature analysing different aspects of proteins and investigating both the global quality of the entire model as well as local aspects.

In the early 1990's tools analysing the stereo-chemical plausibility of a protein structures came up (Laskowski et al. 1993, Hooft et al. 1996). Deviations from ideal stereo-chemical values are reported by programs such as ProCheck and WhatCheck which are still widely used especially in the field of experimental structure determination. But they can also help identifying 'suspicious geometries' in models.

Another category of methods investigates the compatibility of individual amino acids or the entire sequence (i.e. threading) with the structural environment described by the model (Luthy et al. 1992, Jones et al. 1992).

The most extensively used methods for assessing protein models are scoring functions based on statistical potentials or potentials of mean force (PMF's) (Sippl 1990). Statistical potentials are usually formalised as distance-dependent non-bonded interaction potentials (Melo&Feytmans 1998, Samudrala&Moult 1998, Zhou&Zhou 2002, Shen&Sali 2006) but also other structural features are are used such as torsion angles, contacts, residue burialness, hydrogen bonds, etc. Combining different geometrical features in a composite scoring function has been shown to further improve the performance of these methods in identifying good models (Wallner&Elofsson 2003, Pettitt et al. 2005, Tosatto 2005, Benkert et al. 2008, McGuffin 2008, Randall&Baldi 2008).

Stepwise analysis:

  • The analyse of the geometric plausibility of a model helps identifying unusual geometries as a result of modeling errors or inaccuracies of the modeling program.
  • If the target-template sequence identity is very low, models may have even have the wrong fold:
    • There are a few methods available which allow to estimate whether a structure has the correct fold by delivering a Z-score or expectation value relating the model energy to a random (or other) background distribution
  • If multiple models with alternative conformations are available a ranking of the model ensemble using a scoring function analysing different geometrical aspects of protein structures may help identifying more reliable candidates.
    • In PMP, each single model can be sent to several model quality estimation servers covering several scoring functions (see below)
    • The possibility to send multiple models from the PMP overview page will be provided soon
  • For the analysis of the quality of a single model local scoring functions can be used in order to try to locate regions potentially deviating stronger. The local estimation of model quality is still an active field of research.

Scoring functions available in PMP:
From the model detail page in PMP, a model can be sent to several scoring function for model quality estimation. Currently, the following four state-of-the-art model quality estimation servers are accessible. The model coordinates are sent to the server and the user receives the answer by e-mail.

model quality application
ModFOLD
ModFOLD (McGuffin 2008) ( is a Model Quality Assessment Program (MQAP) used for the global and local assessment of models. The original ModFOLD method is a combination of the ModSSEA method (McGuffin, 2007), MODCHECK (Pettitt et al., 2005) and two scores provided by ProQ (Wallner and Elofsson, 2003). The scores are combined using a neural network.

QMEAN server
QMEAN (Benkert et al. 2008) is a composite scoring function for the quality estimation of protein structure models. QMEAN consists of six structural descriptors. Four of them are statistical potentials analyzing torsion angles, solvation and non-bonded interactions. The other two terms reflect the agreement between predicted and calculated secondary structure and solvent accessibility.


References:

  • Schwede, T., Sali, A. et al. (2009). 'Outcome of a workshop on applications of protein models in biomedical research.' Structure 17(2):151-9.
  • Chothia, C. and Lesk. A.M. (1986). 'The relation between the divergence of sequence and structure in proteins.' EMBO J. 5(4):823-6.
  • Rost, B. (1999). 'Twilight zone of protein sequence alignments.' Prot Eng 12: 85-94.
  • Baker, D. and Sali, A. (2001). 'Protein structure prediction and structural genomics.' Science 294(5540):93-6.
  • Hooft, R. W., G. Vriend, et al. (1996). 'Errors in protein structures.' Nature 381(6580): 272.
  • Martí-Renom, M.A., Stuart, A.C. et al. (2000). 'Comparative protein structure modeling of genes and genomes.' Annu Rev Biophys Biomol Struct. 29:291-325. Review.
  • Fischer D. (2006). 'Servers for protein structure prediction.' Curr Opin Struct Biol. 16(2):178-82. Review.
  • Cozzetto, D., Kryshtafovych, A., Ceriani, M. and Tramontano, A. (2007). 'Assessment of predictions in the model quality assessment category.' Proteins 69 Suppl 8:175-83.
  • Cozzetto, D., Kryshtafovych, A. and Tramontano, A. (2009). 'Evaluation of CASP8 model quality predictions.' Proteins 77 Suppl 9:157-66.
  • Kryshtafovych, A. and Fidelis, K. (2009). 'Protein structure prediction and model quality assessment.' Drug Discov Today 14(7-8):386-93.
  • Bordoli, L., Kiefer, F., Arnold, K., Benkert, P., Battey, J. and Schwede, T. (2009). 'Protein structure homology modeling using SWISS-MODEL workspace.' Nat Protoc. 4(1):1-13.
  • Sippl, M.J. (1990). 'Calculation of conformational ensembles from potentials of mean force. An approach to the knowledge-based prediction of local structures in globular proteins.' J Mol Biol 213: 859-883.
  • Luthy, R., Bowie, J.U. and Eisenberg, D. (1992). 'Assessment of protein models with three-dimensional profiles.' Nature 356 (6364): 83-85.
  • Jones, D.T.,Taylor, W.R. And Thornton, J.M. (1992). 'A new approach to protein fold recognition.' Nature 358: 86-89.
  • Laskowski, R.A., MacArthur, M.W., Moss, D.S. and Thornton, J.M. (1993). 'PROCHECK: A program to check the stereochemical quality of protein structures ' J. Appl. Cryst. 26: 283-291.
  • Melo, F. and Feytmans, E. (1998). 'Assessing protein structures with a non-local atomic interaction energy.' J Mol Biol. 277(5):1141-52.
  • Samudrala R, Moult J. (1998). 'An all-atom distance-dependent conditional probability discriminatory function for protein structure prediction.' J Mol Biol. 275(5):895-916.
  • Zhou, H., and Zhou, Y. (2002). 'Distance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stability prediction. ' Protein Sci. 11:2714-2726.
  • Shen, M.-Y. and Sali, A. (2006). 'Statistical potential for assessment and prediction of protein structures.' Protein Sci 15: 2507-2524.
  • Wallner, B. and Elofsson A. (2003). 'Can correct protein models be identified?' Protein Sci 12: 1073-1086.
  • Pettitt, C.S., McGuffin. L.J. And Jones, D.T. (2005). 'Improving sequence-based fold recognition by using 3D model quality assessment.' Bioinformatics 21: 3509-3515.
  • Tosatto SCE. (2005). 'The victor/FRST function for model quality estimation.' J Comput Biol 12: 1316-1327.
  • Benkert, P., Tosatto,S.C. and Schomburg, D. (2008). 'QMEAN: A comprehensive scoring function for model quality assessment.' Proteins 71:261-77.
  • McGuffin, L.J. (2008) 'The ModFOLD server for the quality assessment of protein structural models.' Bioinformtics 24:586-7.
  • Randall, A. and Baldi, P. (2008). 'SELECTpro: effective protein model selection using a structure-based energy function resistant to BLUNDERs.' BMC structural biology 8:52.
  • Koh, I. Y., V. A. Eyrich, et al. (2003). 'EVA: Evaluation of protein structure prediction servers.' Nucleic Acids Res 31(13): 3311-3315.
  • Fiser, A., Do, R.K. and Sali, A. (2000). 'Modeling of loops in protein structures.' Protein Science 9(9):1753-73.
  • Zhang, Y. (2009). 'Protein structure prediction: when is it useful?' Curr Opin Struct Biol. 19: 145-155.
  • Dunker, A.K., Brown ,C.J., Lawson, J.D., Iakoucheva, L.M. and Obradović, Z. (2002). 'Intrinsic disorder and protein function.' Biochemistry 41(21):6573-82.
  • Pentony, M.M., Ward, J. and Jones, D.T. (2010). 'Computational resources for the prediction and analysis of native disorder in proteins.' Methods Mol Biol. 604:369-93.

Structure Comparison

The variability among the models (the term 'model' applies to both homology models and experimental structures) of a given protein predicted by different programs/servers may be to a large extend explained by the variation in the templates but the model ensemble also contains additional information. A strong consensus among models of various servers, e.g., is a good sign for the correctness of a model since the probability that many modeling resources predict the same feature all wrong is much lower than doing it all right. In the model overview page of PMP, the structure comparison tool can be used to compare any subset of models and analyse the variability among the them (Please note, that the range information about experimental structures is based on SEQRES, which may differ from the actual sequence obtained from the crystal structure).

For each model N an internal Cα based distance matrix D (all against all) is calculated. Each column n in one of these matrices represents the distance of the respective (residue) Cα atom i to all other Cα atoms j. For a protein of length L, n = L. Subsequently for all the N models the standard deviation of each of the n corresponding columns of D (1,...N) is determined and stored in a single matrix M1. The standard deviation is calculated using the following formula:

std dev formula

The Euclidean distance from the mean is weighted with an exponential term analogous to the Holm/Sander approach scaling down the influence of long-range distances*. The results (stored in the matrix M1) are shown in Figure 3. on the "Structure Comparison Results" page.

matrix


To identify regions of individual models deviating from the ensemble of models a second (per residue) plot is generated: As before, for each of the corresponding columns n of each of the D matrices, an average column containing the mean is calculated and stored in a single matrix M2. Subsequently for each model N the differences between the values stored in a column n of D (representing the Cα atom distances) and the mean values stored in the corresponding column in M2 are calculated and averaged over the number of residues (res) in the model:

distance from mean formula

This value is used to generate the per residue plot for each of the N models, shown in Figure 1. on the "Structure Comparison Results" page.

dist2mean exampleprotein example

*L. Holm & C. Sander, J. Mol. Biol. (1993) 233, 123-138. Protein Structure Comparison by Alignment of Distance Matrices.

Sequence Annotation

Annotation of the target model sequences is retrieved from UniProt using the REST interface (Bairoch A., Apweiler R., Wu C.H., Barker W.C., Boeckmann B., Ferro S., Gasteiger E., Huang H., Lopez R., Magrane M., Martin M.J., Natale D.A., O Donovan C., Redaschi N., Yeh L.S. Nucleic Acids Res.(2005) 33, D154-159). PFAM Domain structure for the model target sequence is annotated using the InterPro Distributed Annotation System ( R.D. Finn, J. Mistry, B. Schuster-Böckler, S. Griffiths-Jones, V. Hollich, T. Lassmann, S. Moxon, M. Marshall, A. Khanna, R. Durbin, S.R. Eddy, E.L.L. Sonnhammer and A. Bateman Nucleic Acids Research (2006) 34:D247-D251).

Model Preview Images & Visualization

The model preview images on the model info pages are generated dynamically using Molscript (Per J. Kraulis, Journal of Applied Crystallography (1991) 24,946-950.) and Raster3d (E.A. Merritt & M.E.P. Murphy, Acta. Cryst. (1994) D50,869-873.).

Interactive in-line visualization using Jmol (Jmol: an open-source Java viewer for chemical structures in 3D. http://www.jmol.org/)

PSI Partner Sites

Models and interactive tools made accessible by the Protein Model Portal are provided by the following partners:

  • CSMP - Center for Structures of Membrane Proteins
  • JCSG - Joint Center for Structural Genomics
  • MCSG - Midwest Center for Structural Genomics
  • NESG - Northeast Structural Genomics Consortium
  • NMHRCM - New Methods for High-Resolution Comparative Modeling
  • NYSGXRC - New York SGX Research Center for Structural Genomics
  • JCMM - Joint Center for Molecular Modeling
  • ModBase and ModPipe - UCSF University of California, San Francisco
  • SWISS-MODEL - SIB Swiss Institute of Bioinformatics & Biozentrum University of Basel

Contact

PMP is developed by the Computational Structural Biology Group at the Swiss Institute of Bioinformatics (SIB) and the Biozentrum of the University of Basel.