Frequently Asked Questions

Inconsistent predictions

  • Q: I noticed a substantial degree of variance in the predictions and scores output by PolyPhen-2 Web server as compared to my local PolyPhen-2 installation, how would you explain this?
  • A: The most common source of discrepancies in PolyPhen-2 output is different versions of the non-redundant protein sequence database (UniRef100) utilized for constructing multiple sequence alignments (MSA).

PolyPhen-2 relies heavily on sequence conservation estimates derived from MSA, while both alignment coverage and quality depend on the set of homologous sequences sourced from UniRef100 database. UniProt Consortium normally updates its databases on a monthly basis while PolyPhen-2 Web service is updated quarterly. If you have installed your standalone version of PolyPhen-2 following instructions included with the software, chances are your local copy of the UniRef100 database (as well as of other databases) is more recent than the one utilized by the Web server.

PolyPhen-2 is not alone to suffer from this issue. In fact, there is a recent publication that claims PolyPhen-2 was the least affected among the four different SNP prediction tools tested.

There are other factors that can affect prediction outcome, among them changes in structural databases (PDB and DSSP), in the classifier models, and in the analysis pipeline itself but they are less likely to introduce noticeable inconsistencies.

Versions of all databases utilized for analysis are displayed on a PolyPhen-2 Web report page and included in all Batch query report files.

UNKNOWN prediction

  • Q: Why I am getting “This mutation is predicted to be UNKNOWN (score is not available)” reports for my protein?
  • A: Well, what it is trying to say is that predicting the substitution effect was not possible in this particular case.

:!: The issue was largely addressed in PolyPhen-2 v2.2.2, thanks to integration of MultiZ genomic multiple alignments. This allowed for expanding prediction coverage significantly, especially in non-globular domains. Overall, close to 95% of all sequence positions in known UniProtKB proteins can be now successfully classified. However, you may still encounter UNKNOWN predictions in rare cases.

Most likely reason for such reports is that much of the sequence of your protein of interest is non-aligneable due to large stretches of repeats and/or high compositional biases present. For known proteins, you can easily check this by browsing UniProt sequence annotations for your protein:

http://www.uniprot.org/uniprot/<UniProtKB-accession>

Such sequence features often make it impossible to search for homologous sequences, build reliable multiple alignments and ultimately, infer conservation scores. The issue affects many non-globular proteins, e.g. collagens, matrix proteins, DNA/RNA-binding proteins, muscle proteins and more.

You can see it for yourself: in the PolyPhen-2 Web interface, click on ”+” icon next to “Multiple sequence alignment” label to inspect MSA for your protein. You can also start interactive MSA browser via link at the bottom of the in-line alignment viewer section (requires Java browser plug-in).

If this is the case, you will probably notice that vast majority of the (top) query sequence has only gaps next to it in the aligned homologous sequences listed below.

To circumvent this problem, we have been working on specific predictors for various non-globular domains, see for example:

“Development and validation of a computational method for assessment of missense variants in hypertrophic cardiomyopathy.” Jordan DM, Kiezun A, Baxter SM, Agarwala V, Green RC, Murray MF, Pugh T, Lebo MS, Rehm HL, Funke BH, Sunyaev SR. Am J Hum Genet. 2011 Feb 11;88(2):183-92. PubMed

This project was aimed at coiled-coils which are relatively simple to model. Unfortunately repeats and compositionally biased regions are much more difficult to deal with.

Another solution involves utilizing UCSC MuliZ46Way exome-based genomic multiple alignments for regions with poor quality of protein sequence alignments. This approach was first introduced in PolyPhen-2 v2.2.2 and works surprisingly well for a wide range of non-globular protein domains.

Automated batch submission

  • Q: I hate clicking through browser's input page, how do I automate batch submission to the PolyPhen-2 Batch query web interface?
  • A: One simple solution would be to use curl command-line utility (http://curl.haxx.se/) to script your batch submission. Here's a sample code that would do this for you:
#!/bin/sh
curl \
  -F _ggi_project=PPHWeb2 \
  -F _ggi_origin=query \
  -F _ggi_target_pipeline=1 \
  -F MODELNAME=HumDiv \
  -F UCSCDB=hg19 \
  -F SNPFUNC=m \
  -F NOTIFYME=myemail@myisp.com \
  -F _ggi_batch_file=@example_batch.txt \
  -D - http://genetics.bwh.harvard.edu/cgi-bin/ggi/ggi2.cgi

Parameters explained:

_ggi_batch_file=@example.batch - Name of a local text file containing your batch query, preceded by a '@' symbol.

:!: Note, that the format of web batch query has changed with the PolyPhen-2 v2.2.2 update and now uses exactly the same protein substitution specification format as with the standalone software: one substitution per line; first column always lists protein accession. See examples below:

# example of genomic SNPs query
chr1:1158631 A/C/G/T
chr12:51493554 C/T
chr14:64935148 C/G

:!: All allele nucleotides should be entered on the plus strand.

# example of protein substitutions query
Q5UAW9 212 Q E
Q5UAW9 217 L P
Q5UAW9 222 M T
O95479 453 R Q
NP_005792 59 L P
NP_005792 90 R G
NP_005792 110 V I

:!: Only UniProtKB protein accessions or entry names (eg. Q5UAW9 or GP157_HUMAN) are fully supported; RefSeq protein identifiers while supported may result in sequence errors due to high level of ambiguity in protein mappings. See Sample Query on the Batch query web page for more examples of the supported query formats.

NOTIFYME parameter is optional. If present, notification will be sent via e-mail address provided as its value upon batch completion.

Other optional parameters are listed below with their default values (matching Batch query settings) highlighted in bold:

MODELNAME=HumDiv|HumVar - Classifier model used for predictions.

UCSCDB=hg18|hg19 - Genome assembly version used for chromosome coordinates of the SNPs in user input.

SNPFILTER=0|1|3 - Set of transcripts on which genomic SNPs will be mapped (0 - All, 1 - Canonical, 3 - CCDS).

SNPFUNC=c|m - Functional SNP categories to include in genomic SNPs annotation report (pph2-snps.txt file); leaving this parameter out or specifying empty value (e.g.: SNPFUNC=) will result in annotations reported for all SNP categories, same as selecting Annotations→All option on the Batch query web page.

For detailed description of these parameters please go to Batch query page and hover mouse pointer over corresponding option labels under Advanced Options section.

If you want to submit your own set of protein sequences together with your query you will need to add the following extra -F option to your curl command line:

-F uploaded_sequences_1=@myproteins.fa

This will upload a local file myproteins.fa with your sequences in FASTA format.

Note, that -D option in the curl script example above will dump a complete server response including all HTTP headers. Look for Set-Cookie: header in output, it will contain your unique session ID returned in the following string:

Set-Cookie: polyphenweb2=98ba900751d509ce6dc262c078f37c023395782b;

This 40-character hash can be used to track your batch query progress and access its results later.

You can also parse HTML code of the page returned looking for other useful bits of information, eg.:

name="lastJobSubmitted" value="42145"

This string will hold grid job ID (number) of the last job in your batch, while a text line like this:

Batch 1: (1/7) Validating input

will contain your batch number, which is always 1 for newly-created sessions unless you have reused an existing session during submission. The latter can be achieved by adding the following parameter to the curl command line, e.g.:

-F sid=98ba900751d509ce6dc262c078f37c023395782b

To track your query progress you can poll server for two semaphore files located inside your session/batch directory:

started.txt - created when batch is dispatched for execution on the compute grid

completed.txt - created when the batch has been fully processed

Both files contain server-generated timestamps in human-readable format.

You can use the following URL to access semaphore files and fetch results of analysis via HTTP protocol:

http://genetics.bwh.harvard.edu/ggi/pph2/98ba900751d509ce6dc262c078f37c023395782b/1/<filename>

Where 98ba..782b is your session ID, 1 is your batch number, and <filename> is one of the following files:

started.txt
completed.txt
pph2-short.txt
pph2-full.txt
pph2-snps.txt
pph2-log.txt

Let me leave server polling and results fetching automation as an exercise for user. Be warned however, that too frequent server polling from the same client may result in blocking of its IP address. Automatic polling should not exceed a rate of once in 60 sec :!:

If you need more flexibility, a bit of Perl/libwww programming should also do the trick, provided the details of the web server interface described above.

Last modified: 2012/05/30 19:09
   
 
Except where otherwise noted, content on this wiki is licensed under the following license: Public Domain