I.PolyPhen Overview
Most of human genetic variation is represented by SNPs (Single Nucleotide Polymorphisms) and many of them are believed to cause phenotypic differences between human individuals.

We specifically focus on non-synonymous SNPs (nsSNPs), i.e. SNPs located in coding regions and resulting in amino acid variation in protein products of genes. It was shown in several recent studies that impact of amino acid allelic variants on protein structure/function can be reliably predicted via analysis of multiple sequence alignments and protein 3D-structures. As we demonstrated in an earlier work, these predictions correlate with the effect of natural selection seen as an excess of rare alleles. Therefore, predictions at the molecular level reveal SNPs affecting actual phenotypes.

PolyPhen (=Polymorphism Phenotyping) is an automatic tool for prediction of possible impact of an amino acid substitution on the structure and function of a human protein. This prediction is based on straightforward empirical rules which are applied to the sequence, phylogenetic and structural information characterizing the substitution.

For a given amino acid substitution in a human protein, PolyPhen performs several steps:

See also:

References

  I.1.Sequence-based characterisation of the substitution site
A substitution may occur at a specific site, e.g., active or binding, or in a non-globular, e.g., transmembrane, region. PolyPhen tries to identify a query protein as an entry in the hs_swall, the human proteins subset of UniProt database and use the FT (feature table) section of the corresponding entry. Information on FT records can be found in the Swiss-Prot Help. PolyPhen checks if the amino acid replacement occurs at a site which is annotated in hs_swall as
  • DISULFID, THIOLEST, THIOETH bond or
  • BINDING, ACT_SITE, LIPID, METAL, SITE, MOD_RES, CARBOHYD, SE_CYS site
At this step PolyPhen memorises all positions which are annotated in the query protein as BINDING, ACT_SITE, LIPID, and METAL. At a later stage if the search for a homologous protein with known 3D structure is successful, it is checked whether the substitution site is in spatial contact with these critical for protein function residues.

PolyPhen also checks if the substitution site is located in the region annotated as

  • TRANSMEM, SIGNAL, PROPEP
For hs_swall, PolyPhen augments annotation data with predictions made with TMHMM algorithm to predict transmembrane regions, Coils2 program to predict coiled coil regions and SignalP program to predict signal peptide regions of the protein sequences.

For a substitution in an annotated or predicted transmembrane region, PolyPhen uses the PHAT transmembrane specific matrix score to evaluate possible functional effect of a nsSNP in the transmembrane region.

See also:

UniProt

Swiss-Prot Help

TMHMM

Coils

SignalP

PHAT

  I.2.Calculation of PSIC profile scores for two amino acid variants
The amino acid replacement may be incompatible with the spectrum of substitutions observed at the position in the family of homologous proteins. PolyPhen identifies homologues of the input sequences via BLAST search in the nrdb database. The search is performed with `-F T' (enabled filtration), `-e 1e-4' (E-value set to 1e-4), and `-b 500 -v 500' (500 hits shown) options.

After the BLAST search, the set of hits is filtered to retain hits that have
(a) sequence identity to the input sequence in the range 30- 94%, inclusively, and
(b) lehgth of alignment with the query sequence not smaller than 50.
Sequence identity is defined as the number of matches divided by the complete alignment length.

The resulting multiple alignment is used by the new version of the PSIC software (Position-Specific Independent Counts) to calculate the so-called profile matrix. Elements of the matrix (profile scores) are logarithmic ratios of the likelihood of given amino acid occurring at a particular position to the likelihood of this amino acid occurring at any position (background frequency).

PolyPhen computes the absolute value of the difference between profile scores of both allelic variants in the polymoprphic position. Big values of this difference may indicate that the studied substitution is rarely or never observed in the protein family. PolyPhen also shows the number of aligned sequences at the query position. This number may be used to assess the reliability of profile score calculations.

See also:

BLAST

nrdb

PSIC paper

PSIC server

  I.3.Calculation of structural parameters and contacts
Mapping of amino acid replacement to the known 3D structure reveals whether the replacement is likely to destroy the hydrophobic core of a protein, electrostatic interactions, interactions with ligands or other important features of a protein. If the spatial structure of a query protein is unknown, one can use the homologous proteins with known structure.  
  I.3.1.Mapping of the substitution site to known protein 3D structures
PolyPhen BLASTs (with options `-F F -b 500 -v 500') query sequence against protein structure database (PDB or PQS) and by default retains all hits that meet the given criteria:

(a) sequence identity threshold is set to 50%, since this value guarantees the conservation of basic structural characteristics
(b) minimal hit length is set to 100
(c) maximal number of gaps is set to 20

By default, a hit is rejected if its amino acid at the corresponding position differs from the amino acid in the input sequence. The position of the substitution is then mapped onto the corresponding positions in all retained hits. Hits are sorted according to the sequence identity or E-value of the sequence alignment with the input protein.

See also:

PDB

PQS

  I.3.2.Structural parameters
Further analysis performed by PolyPhen is based on the use of several structural parameters. Importantly, although all parameters are reported in the output, only some of them are used in the final decision rules.  
  I.3.2.1.Parameters taken from DSSP   PolyPhen uses DSSP (Dictionary of Secondary Structure in Proteins) database to get the following structural parameters for the mapped amino acid residues:
  • Secondary structure (according to the DSSP nomenclature)
  • Solvent accessible surface area (absolute value in Ų)
  • Phi-psi dihedral angles
See also:

DSSP

  I.3.2.2.Calculated parameters   The following values are calculated by PolyPhen:
  • Normed accessible surface area: the absolute value divided by the maximal area defined as a 99%-quantile of surface area distribution for this particular amino acid type in PDB
  • Change in accessible surface propensity resulting from the substitution. Accessible surface propensities (knowledge-based hydrophobic "potentials") are logarithmic ratios of the likelihood of given amino acid occuring at a site with a particular accessibility to the likelihood of this amino acid occuring at any site (background frequency)
  • Change in residue side chain volume measured in ų. Side chain volumes are here
  • Region of the phi-psi map (Ramachandran map) derived from the residue dihedral angles. Ramachandran map is here
  • Normalised B-factor (temperature factor) for the residue. B-factor, or temperature factor, is used in crystallographic studies of macromolecules to characterise the "mobility" of an atom. It is believed [Chasman D, Adams RM (2001)] that the values of B-factor of a residue may be correlated with its tolerance to amino acid substitutions.
By default, all parameters above are calculated for the first hit only.
 
  I.3.3.Contacts
The presence of specific spatial contacts of a residue may reveal its role for the protein function. The suggested default threshold for all contacts to be displayed in the output is 6Å. However, the value of 3Å is used in the decision rule. For evaluation of a contact between two atom sets PolyPhen finds the minimal distance amongst all possible between atoms of two sets.

By default, contacts are calculated for all found hits with known structure. This is essential for the cases when several PDB(PQS) entries correspond to one protein, but carry different information about complexes with other macromolecules and ligands (for example, see Fig.2 in [Sunyaev et al 2001]).

PolyPhen checks three types of contacts for a variable amino acid residue:

 
  I.3.3.1.Contacts with heteroatoms   Contacts with ligands defined as all heteroatoms excluding water and "non-biological" crystallographic ligands that are believed to be related to the structure determination procedure rather than to biological function of a protein (I.Koch, 2001, personal communication). See also:

PDB HETATM

  I.3.3.2.Interchain contacts  Interactions between Subunits of the protein molecule. Technically they are defined as contacts of a polymorphic residue with residues from other polypeptide chains present in the PDB(PQS) file. For this particular type of interaction, it is more advantageous to use the PQS (Protein Quaternary Structure) database rather than PDB, since PQS entries are supposed to provide a more adequate picture of protein quaternary structure architecture. PolyPhen discards interchain contacts from PQS entries with multimers that, according to PQS annotation, may not represent a true biological molecule and are denoted as XPACK in the PQS ASALIST file. See also:

PQS

  I.3.3.3.Contacts with functional sites   Third type of contacts analysed by PolyPhen is represented by contacts with critical for protein function residues (BINDING, ACT_SITE, LIPID, and METAL), where the latter are derived from sequence annotation.  
  II.Prediction
PolyPhen uses empirically derived rules to predict that an nsSNP is
  • probably damaging, i.e., it is with high confidence supposed to affect protein function or structure
  • possibly damaging, i.e., it is supposed to affect protein function or structure
  • benign, most likely lacking any phenotypic effect
  • unknown, when in some rare cases, the lack of data do not allow PolyPhen to make a prediction
 
  II.1.Prediction rules
The table below contains rules used by PolyPhen to predict effect of nsSNPs on protein function and structure. One row corresponds to one rule which may consist of several parts connected by logical "and". For a given substitution, all rules are tried one by one, resulting in prediction of functional effect. If no evidence for damaging effect is seen, substitution is considered benign.

Prediction basis and Substitution effect are described below.

  RULES (connected with logical AND) PREDICTION BASIS EFFECT
  PSIC score difference: Substitution site properties: Substitution type properties:
1 arbitrary annotated as a functional site+ arbitrary probably damaging sequence annotation functional, functional site (2.2)
2 arbitrary annotated as a bond formation site++ arbitrary probably damaging sequence annotation structural, bond formation (1.2)
3 arbitrary in a region annotated as transmembrane PHAT matrix difference resulting from substitution is negative possibly damaging sequence annotation functional, functional site, transmembrane (2.2.2)
4 arbitrary in a region predicted as transmembrane possibly damaging sequence prediction
5 <=0.5 arbitrary arbitrary benign multiple alignment  
6 >1.0 atoms are closer than 3Å to atoms of a ligand arbitrary probably damaging structure functional, functional site, ligand binding (2.2.3)
7 atoms are closer than 3Å to atoms of a residue annotated as BINDING, ACT_SITE, or SITE arbitrary probably damaging structure functional, functional site, indirect (2.1)
8 in the interval (0.5..1.5] with normed accessibility <=15% change of accessible surface propensity is >=0.75 possibly damaging structure structural, buried site, hydrophobicity disruption (1.1.1)
9 change of side chain volume is >=60 possibly damaging structure structural, buried site, overpacking (1.1.2)
10 change of side chain volume is <=-60 possibly damaging structure structural, buried site, cavity creation (1.1.3)
11 with normed accessibility <=5% change of accessible surface propensity is >=1.0 probably damaging structure structural, buried site, hydrophobicity disruption (1.1.1)
12 change of side chain volume is >=80 probably damaging structure structural, buried site, overpacking (1.1.2)
13 change of side chain volume is <=-80 probably damaging structure structural, buried site, cavity creation (1.1.3)
14 in the interval (1.5..2.0] change of accessible surface propensity is >=1.0 probably damaging structure structural, buried site, hydrophobicity disruption (1.1.1)
15 change of side chain volume is >=80 probably damaging structure structural, buried site, overpacking (1.1.2)
16 change of side chain volume is <=-80 probably damaging structure structural, buried site, cavity creation (1.1.3)
17 arbitrary arbitrary possibly damaging structure structural, buried site, cavity creation (1.1.3)
18 >2.0 arbitrary arbitrary probably damaging multiple alignment  
+BINDING, ACT_SITE, SITE, MOD_RES, LIPID, METAL, SE_CYS
++DISULFID, THIOLEST, THIOETH
 
  II.2.Available data
PolyPhen makes its predictions using three main source of data:

(1) FT, sequence annotation (or prediction) being a fragment of UniProt feature table (FT) describing the substitution position,
(2) alignment, PSIC profile scores derived from multiple alignment,
(3) structure, structural information, obtained if a search against structural database was successful.

The presence of all three data sources indicates the highest reliability of a prediction. However, as a rough estimate one can expect that approximately only ~10% of all sequences have homologous proteins with known structure.

 
  II.2.Prediction basis
As can be seen from the table above, a prediction is based on one of the following:
  • sequence annotation
  • sequence prediction
  • multiple alignment
  • structure
depending on the rule used to make it.
 
  II.3.Substitution effect
For some rules predicting damaging effect, a brief description of expected effect is given. Current heirarchical classification of possible effects is as shown:
   1. structural
      1.1. buried site
         1.1.1. hydrophobicity disruption
         1.1.2. overpacking
         1.1.3. cavity creation
      1.2. bond formation
         1.2.1. covalent bond
             1.2.1.1. disulphide
             1.2.1.2. thioesther
             1.2.1.3. thioether
         1.2.2. non-covalent
             1.2.2.1. hydrogen
             1.2.2.2. salt bridge (electrostatic)

   2. functional
      2.1. indirect
      2.2. functional site
         2.2.1. signal peptide
         2.2.2. transmembrane
         2.2.3. ligand binding
         2.2.4. protein interaction

The purpose of introduction of this classification is, on the one hand, to suggest an explanation of a damaging effect without strict reference to the prediction rule, which relies on technical details. On the other hand, this classification leaves room for further improvement of a method.

 
  III.PolyPhen input
PolyPhen works with human proteins and identifies them either by ID or accession number from hs_swall database or by the amino acid sequence itself. In the latter case, PolyPhen tries to find exact match of the sequence in hs_swall. If a sequence is identified as a database entry, all entry information (complete sequence, FT, etc.) is used. Amino acid replacement is characterised by position number and substitution, consisting of two amino acid variants, AA1 and AA2.  
  III.1.Query data
The input form contains the following fields:

Protein identifier (ACC or ID) from the UniProt database which is case-insensitive, e.g., pexa_human, XYZ_HUMAN, P12345, p12345, aah01234. PolyPhen maps this value to primary accession number and works with it.

If identifier of the sequence is known then it is sufficient to enter it into the Protein identifier (accession or name) from the UniProt database input field. The Amino acid sequence in FASTA format should be left blank in this case.

Normally, identifier is a UniProt protein accession or entry name, e.g., Q5I7T1, AG10B_HUMAN. Note, that extended accession syntax which includes entry version number (e.g., Q8IVL6-2) is currently not supported. The identifier should be a single word. Spaces in identifier string will cause errors and should be avoided. Examples of illegal input are:

    ACCESSION   P01013
    Q8IVL6-2
    gi|129295

For the first example word ACCESSION should be removed, while in the second example the version number of the accession (i.e., -2) should be removed. In the third example, the identifier used is not a UniProt one.

Amino acid sequence in FASTA format which should obey the "classical" FASTA format, e.g., include a definition line wich provides sequence identifier.

User is supposed to complete only one of the fields above.

Position is checked not to exceed the protein length

Substitution is given by two amino acid variants; the first one is checked to correspond to the actual protein sequence, whereas the second is checked to differ from the first one.

Description is an optional short string (up to 60 characters) providing descriptive name and/or comment for your query. It will be displayed in the query management page to facilitate identifying particular query instances which may be useful when you submit a large number of them.

See also:

FASTA format

UniProt

  III.2.Options
Structural database (PDB/PQS)
PolyPhen can use two protein structure databases, PDB and PQS. In general, queries against PDB can be faster than those against PQS. However, use of PQS (default) is strongly recommended if a user is concerned with residue contacts, especially inter-subunit.

Sort hits by (Identity/E-value)
Hits are sorted according to the sequence identity or E- value (default) of the sequence alignment with the input protein.

Map to mismatch (No/Yes)
By default, a hit is rejected if its amino acid at the corresponding position differs from the amino acid in the input sequence. Mapping to mismatching amino acid residue should be used with caution only when a protein with known structure and matching amino acid can not be found.

Calculate structural parameters (For first hit only/For all hits)
In some cases a user may want to check the conservation of structural parameters of a residue in all hits. By default, parameters are calculated for the first hit only, since they are expected to be very close in all homologous structures.

Calculate contacts (For first hit only/For all hits)
Contrary to the structural parameters, contacts are by default calculated for all found hits with known structure. This is essential for the cases when several PDB(PQS) entries correspond to one protein, but carry different information about complexes with other macromolecules and ligands (for example, see Fig.2 in [Sunyaev et al 2001])

Minimal alignment length (integer number, default: 100)
PolyPhen will filter out hits with structure whose alignment length with the query sequence is smaller than the given value.

Minimal identity in alignment (floating point value, not exceeding 1, default: 0.5)
Hits with structure whose sequence identity to the query sequence is smaller than the given threshold are filtered out

Maximal gap length in alignment (integer number, default: 20)
PolyPhen will filter out hits with structure whose alignment with the query sequence contains gaps with total length greater than this value

Threshold for contacts (floating point value, default: 6.0Å)
PolyPhen will report residue contacts below this threshold

See also:

PDB

PQS

  IV.PolyPhen output
PolyPhen output is divided into three main sections and consists mainly of the tables whose contents are discussed below.  
  IV.1.Query
This section contains query data, mostly resembling the input:

Acc number For entries from hs_swall this column contains link to the SRS system.
Position Substitution position.
AA1 First amino acid variant.
AA2 Second amino acid variant.
Description For entries from hs_swall this column contains protein description from the corresponding database field.
See also:

SRS

  IV.2.Prediction
This section contains prediction itself, e.g., "This variant is predicted to be probably damaging", and the supporting information:

Available data FT, alignment, structure
Data available for prediction as described above
Prediction benign, possibly damaging, probably damaging, uknonwn: one of four predictions, also see above
Prediction basis sequence annotation, sequence prediction, multiple alignment, structure: also see above
Substitution effect For some rules predicting damaging effect, a brief description of expected effect is given. Hierarchy of possible damaging effects is given above . In this column PolyPhen also shows more "friendly" description of effect, e.g., Hydrophobicity change at buried site that corresponds to
1.1.1. structural, buried site,hydrophobicity disruption
Prediction data In case of a damaging substitution, this column summarises (mostly quantitative) data used to make a prediction, e.g.,
Normed accessibility: 0.07, Hydrophobicity change: 1.3
Remarks Amino acid replacement features that were not used when making prediction, but may nevertheless be interesting, e.g., interchain contacts of a residue.
 
  IV.3.Details
This section contains all data processed by PolyPhen:  
  IV.3.1.Sequence features of the substitution site
Please see the Sequence-based characterisation of the substitution site section for more general information.

Region Annotated or predicted specific sequence region embracing the substitution site, as it is given in UniProt FT, e.g., TRANSMEM
Site If a substitution occurs at a specific site, its name from UniProt FT is given, e.g., BINDING, ACT_SITE
Feature table Link to the FT part of the hs_swall entry available via SRS
Critical sites Memorised positions which are annotated in hs_swall as BINDING, ACT_SITE, LIPID, and METAL. These sites will be checked for a spatial contact with the substitution position
 
  IV.3.2.PSIC profile scores for two amino acid variants
Please see Calculation of PSIC profile scores for two amino acid variants above for more details.

Score1 Profile score for the first amino acid variant
Score2 Profile score for the second amino acid variant
|Score1-Score2| Absolute difference between two profile scores
Observations Number of amino acids observed at the substitution position of the multiple alignment
Diagnostics Brief description of profile matrix calculation result:
  • calculated, successfully calculated de novo
  • precomputed, used precomputed alignment (available only for some hs_swall entries)
  • empty blast, blast found no hits for the sequences
  • all sequences filtered out, blast found some hits, but they were all filtered out
  • huge blast, blast search produced too many hits (output file >300Mb)
  • cached, alignment was cached in the temporary directory. This is the case when several queries for the same hs_swall protein are run
Multiple alignment around substitution position By clicking on the Show alignment button, one can view a fragment or the complete multiple alignment with marked substitution position column. User can choose a number of sequences and width of flanks around substitution position to show.
 
  IV.3.3.Structural parameters and contacts
Please see Calculation of structural parameters and contacts above for more details.  
  IV.3.3.1.Mapping of the substitution site to known protein 3D structures   Please see Mapping of the substitution site to known protein 3D structures above for more details.

Database Database (PDB or PQS) used to search for homologous proteins with known structure
Initial number of structures Initial number of hits (i.e., before filtration) found in structural database
Number of structures Number of relevant hits, i.e., hits left after filtration
Num Hit number
ID Structure identifier from PDB/PQS
  Polypeptide chain identifier from PDB/PQS; empty chain is denoted with "_"
Res Residue number from PDB/PQS
AA Amino acid one-letter code
E-value Alignment expectation value calculated by BLAST
Len Alignment length
Ide Sequence identity between query sequence and aligned hit with structure
Gaps Total length of gaps in alignment
Params Link to corresponding row in the Structural parameters table
Cont Link(s) to corresponding row in the Contacts table. A cell may contain up to three links, depending on different contact types of a residue
PDB TITLE Protein description taken from PDB/PQS TITLE field
 
  IV.3.3.2.Structural parameters   Please see Structural parameters above for more details.

Num Hit number
ID Structure identifier from PDB/PQS
  Polypeptide chain identifier from PDB/PQS; empty chain is denoted with "_"
Res Residue number from PDB/PQS
SecStr Amino acid one-letter code
Acc Absolute value of accessible surface area in Å2
Acc Normed Normed accessible surface area
dPropens Change in accessible surface propensity resulting from the substitution
(Phi, Psi) Phi-psi dihedral angles
Map Region Region of the phi-psi map (Ramachandran map) derived from the residue dihedral angles
dVol Change in residue side chain volume measured in Å3
Normed B-factor Normalised B-factor (temperature factor) for the residue
 
  IV.3.3.3.Contacts   Please see Contacts above for more details.

Num Hit number
ID Structure identifier from PDB/PQS
  Polypeptide chain identifier from PDB/PQS; empty chain is denoted with "_"
Res Residue number from PDB/PQS
Heteroatoms List of contacts with heteroatoms in the "residue+chain/name/distance" format, e.g., 901B K21 5.865A
Interchain List of contacts with other chains in the "residue+chain/name/distance" format, e.g., 258A VAL 4.256A
Critical sites List of contacts with critical sites in the "residue+chain/name/distance" format, e.g., 123B SER 2.256A
 
  V.PolyPhen SNP data collection
We present a comprehensive analysis of all human nsSNPs available from HGVBase, version 12, with regard to their possible effect on protein structure and function.

SNP data collection page

See also:

HGVBase

  V.1.Mapping known human SNPs to genes
First necessary step in the analysis of nsSNPs is to identify if an SNP represented by a variation in nucleotde sequence is (a) coding, that is, resides in a coding part of a human gene, (b) non-synonymous, that is, results in an amino acid variation at a protein sequence level.

Flanking genomic sequences of SNPs from HGVbase with length 25 bp each have been translated in all six possible frames and searched for exact match in the hs_swall protein set which contains all human proteins from the UniProt database.

Protein sequences and genomic fragments have been pre-processed by SEG, XNU, RepeatMasker, and DUST programs which are used to filter out areas of low compositional complexity, regions containing internal repeats of short periodicity, and known known human genomic repeat sequences. ALU subfamily proteins were also excluded from the set. We did not consider SNP entries which have N's in their flanking sequences, since this makes translation ambiguous.

We required that at least one translated flanking sequence should have an exact match with a database protein sequence. In case this match has been detected, we further required that the second flanking sequence has either exact match with the protein sequence or matches the protein sequence in all positions until the end of the protein or conventional exon/intron border is observed. The mapping quality is given for each SNP:

LR, both flanks completely match the protein sequence;
L?, exact match of left flank and partial match of the right one;
?R, vice versa

See also:

Filtering sequences

  V.2.Data collection statistics
After processing of HGVbase, version 12 (983,589 SNP entries), we obtained a set of 20,462 coding SNPs. Of them, 11,152 happened to be non-synonymous, whereas 9,310 are synonymous SNPs and do not produce any change of the amino acid sequence. The nsSNPs formed our data collection.

Detailed prediction statistics can be found in the SNP data collection page .

 
  V.3.Data collection search
Data collection is a searchable table in which one row corresponds to one nsSNP. To perform a search, one has to fill in a text field and adjust search parameters (field to search in, SNP types to search, etc) if needed. Text search is case-insensitive. Basic pattern matching symbols can be used:

* asterisk matching everything,
? question mark matching exactly one symbol,
[] brackets to set a character class

For example, search for "transcription factor [12] " in the description field gives 12 hits in transcription factors of type 1 and 2. Note that unless set otherwise, search engine tries to match the search text to any part of the searched field, not to the whole field. Other search parameters are self-explanatory.

 
  V.4.Data collection format
Detailed description of data collection format can be found in the SNP data collection page .