PolyPhen-2 relies heavily on sequence conservation estimates derived from MSA, while both alignment coverage and quality depend on the set of homologous sequences sourced from UniRef100 database. UniProt Consortium normally updates its databases on a monthly basis while PolyPhen-2 Web service is updated quarterly. If you have installed your standalone version of PolyPhen-2 following instructions included with the software, chances are your local copy of the UniRef100 database (as well as of other databases) is more recent than the one utilized by the Web server.
PolyPhen-2 is not alone to suffer from this issue. In fact, there is a recent publication that claims PolyPhen-2 was the least affected among the four different SNP prediction tools tested.
There are other factors that can affect prediction outcome, among them changes in structural databases (PDB and DSSP), in the classifier models, and in the analysis pipeline itself but they are less likely to introduce noticeable inconsistencies.
Versions of all databases utilized for analysis are displayed on a PolyPhen-2 Web report page and included in all Batch query report files.
The issue was largely addressed in PolyPhen-2 v2.2.2, thanks to integration of MultiZ genomic multiple alignments. This allowed for expanding prediction coverage significantly, especially in non-globular domains. Overall, close to 95% of all sequence positions in known UniProtKB proteins can be now successfully classified. However, you may still encounter UNKNOWN predictions in rare cases.
Most likely reason for such reports is that much of the sequence of your protein of interest is non-aligneable due to large stretches of repeats and/or high compositional biases present. For known proteins, you can easily check this by browsing UniProt sequence annotations for your protein:
Such sequence features often make it impossible to search for homologous sequences, build reliable multiple alignments and ultimately, infer conservation scores. The issue affects many non-globular proteins, e.g. collagens, matrix proteins, DNA/RNA-binding proteins, muscle proteins and more.
You can see it for yourself: in the PolyPhen-2 Web interface, click on “+” icon next to “Multiple sequence alignment” label to inspect MSA for your protein. You can also start interactive MSA browser via link at the bottom of the in-line alignment viewer section (requires Java browser plug-in).
If this is the case, you will probably notice that vast majority of the (top) query sequence has only gaps next to it in the aligned homologous sequences listed below.
To circumvent this problem, we have been working on specific predictors for various non-globular domains, see for example:
“Development and validation of a computational method for assessment of missense variants in hypertrophic cardiomyopathy.” Jordan DM, Kiezun A, Baxter SM, Agarwala V, Green RC, Murray MF, Pugh T, Lebo MS, Rehm HL, Funke BH, Sunyaev SR. Am J Hum Genet. 2011 Feb 11;88(2):183-92. PubMed
This project was aimed at coiled-coils which are relatively simple to model. Unfortunately repeats and compositionally biased regions are much more difficult to deal with.
Another solution involves utilizing UCSC MuliZ46Way exome-based genomic multiple alignments for regions with poor quality of protein sequence alignments. This approach was first introduced in PolyPhen-2 v2.2.2 and works surprisingly well for a wide range of non-globular protein domains.
curlcommand-line utility (http://curl.haxx.se/) to script your batch submission. Here's a sample code that would do this for you:
#!/bin/sh curl \ -F _ggi_project=PPHWeb2 \ -F _ggi_origin=query \ -F _ggi_target_pipeline=1 \ -F MODELNAME=HumDiv \ -F UCSCDB=hg19 \ -F SNPFUNC=m \ -F NOTIFYMEemail@example.com \ -F _ggi_batch_file=@example_batch.txt \ -D - http://genetics.bwh.harvard.edu/ggi/cgi-bin/ggi2.cgi
firstname.lastname@example.org - Name of a local text file containing your batch query, preceded by a '@' symbol.
Note, that the format of web batch query has changed with the PolyPhen-2 v2.2.2 update and now uses exactly the same protein substitution specification format as with the standalone software: one substitution per line; first column always lists protein accession. See examples below:
# example of genomic SNPs query chr1:1158631 A/C/G/T chr12:51493554 C/T chr14:64935148 C/G
All allele nucleotides should be entered on the plus strand.
# example of protein substitutions query Q5UAW9 212 Q E Q5UAW9 217 L P Q5UAW9 222 M T O95479 453 R Q NP_005792 59 L P NP_005792 90 R G NP_005792 110 V I
Only UniProtKB protein accessions or entry names (eg.
GP157_HUMAN) are fully supported; RefSeq protein identifiers while supported may result in sequence errors due to high level of ambiguity in protein mappings. See Sample Query on the Batch query web page for more examples of the supported query formats.
NOTIFYME parameter is optional. If present, notification will be sent via e-mail address provided as its value upon batch completion.
Other optional parameters are listed below with their default values (matching Batch query settings) highlighted in bold:
MODELNAME=HumDiv|HumVar - Classifier model used for predictions.
UCSCDB=hg18|hg19 - Genome assembly version used for chromosome coordinates of the SNPs in user input.
SNPFILTER=0|1|3 - Set of transcripts on which genomic SNPs will be mapped (0 - All, 1 - Canonical, 3 - CCDS).
SNPFUNC=c|m - Functional SNP categories to include in genomic SNPs annotation report (
pph2-snps.txt file); leaving this parameter out or specifying empty value (e.g.:
SNPFUNC=) will result in annotations reported for all SNP categories, same as selecting Annotations→All option on the Batch query web page.
For detailed description of these parameters please go to Batch query page and hover mouse pointer over corresponding option labels under Advanced Options section.
If you want to submit your own set of protein sequences together with your query you will need to add the following extra
-F option to your
curl command line:
This will upload a local file
myproteins.fa with your sequences in FASTA format.
-D option in the
curl script example above will dump a complete server response including all HTTP headers. Look for
Set-Cookie: header in output, it will contain your unique session ID returned in the following string:
This 40-character hash can be used to track your batch query progress and access its results later.
You can also parse HTML code of the page returned looking for other useful bits of information, eg.:
This string will hold grid job ID (number) of the last job in your batch, while a text line like this:
Batch 1: (1/7) Validating input
will contain your batch number, which is always 1 for newly-created sessions unless you have reused an existing session during submission. The latter can be achieved by adding the following parameter to the
curl command line, e.g.:
To track your query progress you can poll server for two semaphore files located inside your session/batch directory:
started.txt - created when batch is dispatched for execution on the compute grid
completed.txt - created when the batch has been fully processed
Both files contain server-generated timestamps in human-readable format.
You can use the following URL to access semaphore files and fetch results of analysis via HTTP protocol:
98ba..782b is your session ID,
1 is your batch number, and
<filename> is one of the following files:
started.txt completed.txt pph2-short.txt pph2-full.txt pph2-snps.txt pph2-log.txt
Let me leave server polling and results fetching automation as an exercise for user. Be warned however, that too frequent server polling from the same client may result in blocking of its IP address. Automatic polling should not exceed a rate of once in 60 sec
If you need more flexibility, a bit of Perl/libwww programming should also do the trick, provided the details of the web server interface described above.