Most of human genetic variation is represented by SNPs (Single
Nucleotide Polymorphisms) and many of them are believed
to cause phenotypic differences between human individuals.
We specifically focus on non-synonymous SNPs (nsSNPs), i.e. SNPs located in coding regions and resulting in amino acid variation in protein products of genes. It was shown in several recent studies that impact of amino acid allelic variants on protein structure/function can be reliably predicted via analysis of multiple sequence alignments and protein 3D-structures. As we demonstrated in an earlier work, these predictions correlate with the effect of natural selection seen as an excess of rare alleles. Therefore, predictions at the molecular level reveal SNPs affecting actual phenotypes.
PolyPhen (=Polymorphism Phenotyping) is an automatic tool for prediction of possible impact of an amino acid substitution on the structure and function of a human protein. This prediction is based on straightforward empirical rules which are applied to the sequence, phylogenetic and structural information characterizing the substitution.
For a given amino acid substitution in a human protein, PolyPhen performs several steps:
|I.1.Sequence-based characterisation of the substitution site|
A substitution may occur at a specific site, e.g., active or binding,
or in a non-globular, e.g., transmembrane, region. PolyPhen tries to
identify a query protein as an entry in the hs_swall,
the human proteins subset of UniProt database and use the FT
(feature table) section of the corresponding entry.
Information on FT records can be found in the Swiss-Prot Help.
PolyPhen checks if the amino acid replacement occurs at a site which
is annotated in hs_swall as
PolyPhen also checks if the substitution site is located in the region annotated as
For a substitution in an annotated or predicted transmembrane region, PolyPhen uses the PHAT transmembrane specific matrix score to evaluate possible functional effect of a nsSNP in the transmembrane region.
|I.2.Calculation of PSIC profile scores for two amino acid variants|
The amino acid replacement may be incompatible with the spectrum of
substitutions observed at the position in the family of homologous
proteins. PolyPhen identifies homologues of the input sequences via
BLAST search in the nrdb database. The search is performed with `-F
T' (enabled filtration), `-e 1e-4' (E-value set to 1e-4),
and `-b 500 -v 500' (500 hits shown) options.
After the BLAST search, the set of hits is filtered to retain hits
The resulting multiple alignment is used by the new version of the PSIC software (Position-Specific Independent Counts) to calculate the so-called profile matrix. Elements of the matrix (profile scores) are logarithmic ratios of the likelihood of given amino acid occurring at a particular position to the likelihood of this amino acid occurring at any position (background frequency).
PolyPhen computes the absolute value of the difference between profile scores of both allelic variants in the polymoprphic position. Big values of this difference may indicate that the studied substitution is rarely or never observed in the protein family. PolyPhen also shows the number of aligned sequences at the query position. This number may be used to assess the reliability of profile score calculations.
|I.3.Calculation of structural parameters and contacts|
|Mapping of amino acid replacement to the known 3D structure reveals whether the replacement is likely to destroy the hydrophobic core of a protein, electrostatic interactions, interactions with ligands or other important features of a protein. If the spatial structure of a query protein is unknown, one can use the homologous proteins with known structure.|
|I.3.1.Mapping of the substitution site to known protein 3D structures|
PolyPhen BLASTs (with options `-F F -b 500 -v 500') query
sequence against protein structure database (PDB or PQS) and by
default retains all hits that meet the given criteria:
(a) sequence identity threshold is set to 50%, since this value
guarantees the conservation of basic structural characteristics
By default, a hit is rejected if its amino acid at
the corresponding position differs from the amino acid in the input
sequence. The position of the substitution is then mapped onto the
corresponding positions in all retained hits. Hits are sorted
according to the sequence identity or
|Further analysis performed by PolyPhen is based on the use of several structural parameters. Importantly, although all parameters are reported in the output, only some of them are used in the final decision rules.|
I.3.2.1.Parameters taken from DSSP
PolyPhen uses DSSP (Dictionary of Secondary Structure in
Proteins) database to get the following structural parameters for
the mapped amino acid residues:
The following values are calculated by PolyPhen:
The presence of specific spatial contacts of a residue may reveal its
role for the protein function. The suggested default threshold for
all contacts to be displayed in the output is 6Å. However, the
value of 3Å is used in the decision rule. For evaluation of a
contact between two atom sets PolyPhen finds the minimal distance
amongst all possible between atoms of two sets.
By default, contacts are calculated for all found hits with known structure. This is essential for the cases when several PDB(PQS) entries correspond to one protein, but carry different information about complexes with other macromolecules and ligands (for example, see Fig.2 in [Sunyaev et al 2001]).
PolyPhen checks three types of contacts for a variable amino acid residue:
|I.3.3.1.Contacts with heteroatoms Contacts with ligands defined as all heteroatoms excluding water and "non-biological" crystallographic ligands that are believed to be related to the structure determination procedure rather than to biological function of a protein (I.Koch, 2001, personal communication).||See also:|
|I.3.3.2.Interchain contacts Interactions between Subunits of the protein molecule. Technically they are defined as contacts of a polymorphic residue with residues from other polypeptide chains present in the PDB(PQS) file. For this particular type of interaction, it is more advantageous to use the PQS (Protein Quaternary Structure) database rather than PDB, since PQS entries are supposed to provide a more adequate picture of protein quaternary structure architecture. PolyPhen discards interchain contacts from PQS entries with multimers that, according to PQS annotation, may not represent a true biological molecule and are denoted as XPACK in the PQS ASALIST file.||See also:|
|I.3.3.3.Contacts with functional sites Third type of contacts analysed by PolyPhen is represented by contacts with critical for protein function residues (BINDING, ACT_SITE, LIPID, and METAL), where the latter are derived from sequence annotation.|
PolyPhen uses empirically derived rules to predict that an
The table below contains rules used by PolyPhen to predict effect of
nsSNPs on protein function and structure. One row corresponds to one
rule which may consist of several parts connected by logical
"and". For a given substitution, all rules are tried one by
one, resulting in prediction of functional effect. If no evidence for
damaging effect is seen, substitution is considered benign.
++DISULFID, THIOLEST, THIOETH
PolyPhen makes its predictions using three main source of data:
(1) FT, sequence annotation (or prediction) being a fragment of
UniProt feature table (FT) describing the substitution position,
The presence of all three data sources indicates the highest reliability of a prediction. However, as a rough estimate one can expect that approximately only ~10% of all sequences have homologous proteins with known structure.
As can be seen from the table above, a prediction is based on one of the
For some rules predicting damaging effect, a brief description of
expected effect is given. Current heirarchical classification of possible
effects is as shown:
1. structural 1.1. buried site 1.1.1. hydrophobicity disruption 1.1.2. overpacking 1.1.3. cavity creation 1.2. bond formation 1.2.1. covalent bond 220.127.116.11. disulphide 18.104.22.168. thioesther 22.214.171.124. thioether 1.2.2. non-covalent 126.96.36.199. hydrogen 188.8.131.52. salt bridge (electrostatic) 2. functional 2.1. indirect 2.2. functional site 2.2.1. signal peptide 2.2.2. transmembrane 2.2.3. ligand binding 2.2.4. protein interaction
The purpose of introduction of this classification is, on the one hand, to suggest an explanation of a damaging effect without strict reference to the prediction rule, which relies on technical details. On the other hand, this classification leaves room for further improvement of a method.
|PolyPhen works with human proteins and identifies them either by ID or accession number from hs_swall database or by the amino acid sequence itself. In the latter case, PolyPhen tries to find exact match of the sequence in hs_swall. If a sequence is identified as a database entry, all entry information (complete sequence, FT, etc.) is used. Amino acid replacement is characterised by position number and substitution, consisting of two amino acid variants, AA1 and AA2.|
The input form contains the following fields:
Protein identifier (ACC or ID) from the UniProt database which is case-insensitive, e.g., pexa_human, XYZ_HUMAN, P12345, p12345, aah01234. PolyPhen maps this value to primary accession number and works with it.
If identifier of the sequence is known then it is sufficient to enter it into the Protein identifier (accession or name) from the UniProt database input field. The Amino acid sequence in FASTA format should be left blank in this case.
Normally, identifier is a UniProt protein accession or entry name, e.g., Q5I7T1, AG10B_HUMAN. Note, that extended accession syntax which includes entry version number (e.g., Q8IVL6-2) is currently not supported. The identifier should be a single word. Spaces in identifier string will cause errors and should be avoided. Examples of illegal input are:
ACCESSION P01013 Q8IVL6-2 gi|129295
For the first example word ACCESSION should be removed, while in the second example the version number of the accession (i.e., -2) should be removed. In the third example, the identifier used is not a UniProt one.
Amino acid sequence in FASTA format which should obey the "classical" FASTA format, e.g., include a definition line wich provides sequence identifier.
User is supposed to complete only one of the fields above.
Description is an optional short string (up to 60 characters) providing descriptive name and/or comment for your query. It will be displayed in the query management page to facilitate identifying particular query instances which may be useful when you submit a large number of them.
PolyPhen can use two protein structure databases, PDB and PQS. In general, queries against PDB can be faster than those against PQS. However, use of PQS (default) is strongly recommended if a user is concerned with residue contacts, especially inter-subunit.
Map to mismatch
Calculate structural parameters
(For first hit only/For all hits)
(For first hit only/For all hits)
Minimal identity in alignment
(floating point value, not exceeding 1, default: 0.5)
Maximal gap length in alignment
(integer number, default: 20)
|PolyPhen output is divided into three main sections and consists mainly of the tables whose contents are discussed below.|
This section contains query data, mostly resembling the input:
This section contains prediction itself, e.g., "This variant is
predicted to be probably damaging", and the supporting information:
|This section contains all data processed by PolyPhen:|
|IV.3.1.Sequence features of the substitution site|
Please see the Sequence-based characterisation of the substitution site
section for more general
|IV.3.2.PSIC profile scores for two amino acid variants|
Please see Calculation of PSIC profile scores for two amino acid variants
above for more details.
|IV.3.3.Structural parameters and contacts|
|Please see Calculation of structural parameters and contacts above for more details.|
IV.3.3.1.Mapping of the substitution site to known protein 3D structures
Please see Mapping of the substitution site to known protein 3D structures
above for more details.
Please see Structural parameters
above for more details.
Please see Contacts
above for more details.
|V.PolyPhen SNP data collection|
|We present a comprehensive analysis of all human nsSNPs available from HGVBase, version 12, with regard to their possible effect on protein structure and function.||See also:|
|V.1.Mapping known human SNPs to genes|
First necessary step in the analysis of nsSNPs is to identify if
an SNP represented by a variation in nucleotde sequence is (a) coding,
that is, resides in a coding part of a human gene, (b) non-synonymous,
that is, results in an amino acid variation at a protein sequence level.
Flanking genomic sequences of SNPs from HGVbase with length 25 bp each have been translated in all six possible frames and searched for exact match in the hs_swall protein set which contains all human proteins from the UniProt database.
Protein sequences and genomic fragments have been pre-processed by SEG, XNU, RepeatMasker, and DUST programs which are used to filter out areas of low compositional complexity, regions containing internal repeats of short periodicity, and known known human genomic repeat sequences. ALU subfamily proteins were also excluded from the set. We did not consider SNP entries which have N's in their flanking sequences, since this makes translation ambiguous.
We required that at least one translated flanking sequence should have an exact match with a database protein sequence. In case this match has been detected, we further required that the second flanking sequence has either exact match with the protein sequence or matches the protein sequence in all positions until the end of the protein or conventional exon/intron border is observed. The mapping quality is given for each SNP:
LR, both flanks completely match the protein
|V.2.Data collection statistics|
After processing of HGVbase, version 12 (983,589 SNP entries), we
obtained a set of 20,462 coding SNPs. Of them, 11,152 happened to be
non-synonymous, whereas 9,310 are synonymous SNPs and do not produce
any change of the amino acid sequence. The nsSNPs formed our data
Detailed prediction statistics can be found in the SNP data collection page .
|V.3.Data collection search|
Data collection is a searchable table in which one row corresponds to
one nsSNP. To perform a search, one has to fill in a text field and
adjust search parameters (field to search in, SNP types to search,
etc) if needed. Text search is case-insensitive. Basic pattern
matching symbols can be used:
* asterisk matching everything,
For example, search for "transcription factor  " in the description field gives 12 hits in transcription factors of type 1 and 2. Note that unless set otherwise, search engine tries to match the search text to any part of the searched field, not to the whole field. Other search parameters are self-explanatory.
|V.4.Data collection format|
|Detailed description of data collection format can be found in the SNP data collection page .|