Dataset:
PolyPhen-2 annotations for whole human exome sequence space
(WHESS)
Source databases:
UCSC GRCh37/hg19 knownGene annotations (08-Oct-2009)
MultiZ46Way multiple alignments of 45 vertebrate genomes with hg19/GRCh37 human genome (08-Oct-2009)
UniProtKB/Swiss-Prot/UniRef100 Release 2011_12 (14-Dec-2011)
Software:
PolyPhen-2 v2.2.2r395
Description:
The dataset consists of PolyPhen-2 annotations for 149,948,690
putative single-nucleotide non-synonymous (missense) codon changes
enumerated for each CDS codon position in the exons of 43,043 UCSC
knownGene transcripts (hg19). PolyPhen-2 predictions were calculated
for all resulting amino acid residue substitutions in the matching
human UniProtKB protein sequences, using both HumDiv and HumVar
models.
Nucleotide sequence coverage:
----------------------------------------------------------------------
* Transcripts : 72.7 % : 43,043 of all 59,182
: : protein-coding transcripts
: :
* Sequence space coverage : 98.4 % : 149,948,690 of all 152,409,179
within the selected : : possible single-nucleotide
transcripts : : missense codon changes
Protein sequence coverage:
----------------------------------------------------------------------
* Proteins : 92.5 % : 18,737 of all 20,248
: : Swiss-Prot proteins
: :
* Sequence sites coverage : 93.2 % : 10,531,022 of all 11,294,211
within the selected : : sequence positions in
proteins : : Swiss-Prot proteins
Files:
========================================
polyphen-2.2.2-whess-2011_12.tab.tar.bz2
File size: 3.4GB
Uncompressed: 83GB
========================================
This archive contains tab-delimited text files with the PolyPhen-2
annotations and predictions, one pair per each UCSC gene transcript.
"uc*.features.tab" are master files containing annotations used by
PolyPhen-2 for mutation classification. The format is combined from
report files described on PolyPhen-2 Wiki: columns 1-15 follow
loosely MapSNPs format, see:
http://genetics.bwh.harvard.edu/pph2/dokuwiki/appendix_b
The order of columns is slightly different but column names are the
same and should be self-explanatory.
Columns 16-55 are similar to full PolyPhen-2 report described here
(from column #5 onwards):
http://genetics.bwh.harvard.edu/pph2/dokuwiki/appendix_a
Again, column names in are the same as used in PolyPhen-2 report.
IMPORTANT! - Ignore "prediction" column contents in "features.tab"
files, these are obsolete decision-tree based outcomes provided for
compatibility only. See "scores.tab" files below for PolyPhen-2
probabilistic scores and prediction outcomes.
"uc*.scores.tab" files contain HumDiv and HumVar-based predictions
and probabilistic scores for each missense allele. Column names are
self-explanatory and similar to what is described in PolyPhen-2 Wiki
Appendix A article (columns #16-20):
http://genetics.bwh.harvard.edu/pph2/dokuwiki/appendix_a
Except for "hdiv_" and "hvar_" prefixes added to column names for
values derived from HumDiv and HumVar models respectively.
Note, that contrary to what Wiki documentation says, tab files do
not use "?" symbol as a missing value indicator. Fields without
values are left empty, which means you should always specify tab
character as a field separator in order to load tables properly in
your software (since values in some of the columns may contain
embedded spaces).
Each matching pair of the tab files have exact same number of rows
sorted identically, hence you can merge each pair line by line, e.g.
(Linux command-line example):
$ paste <ucname>.features.tab <ucname>.scores.tab
If you are only interested in mutation specifications, predictions
and scores, the columns you might want to merge are: #1-10,16-20
from "features" + #1-15 from "scores", e.g.:
$ paste <ucname>.features.tab <ucname>.scores.tab | cut -d$'\t' -f1-10,16-20,56-
To include only prediction outcomes and probabilistic scores for
both models:
$ paste <ucname>.features.tab <ucname>.scores.tab | cut -d$'\t' -f1-10,16-20,56,58,62,64
See below for the description of a more flexible SQL interface to
the data.
========================================
polyphen-2.2.2-whess-2011_12.sqlite.bz2
File size: 6.8GB
Uncompressed: 66GB
========================================
This is a complete set of annotations loaded into a database in
SQLite v3 format. The database includes two tables: "features" and
"scores" which correspond to the two tab-delimited annotation files
described above. Table schemes follow the ones used in the tab
files with the "id" extra column added, a unique row number which
can be used for joining table rows in SQL SELECT statements.
Manipulating columns and selecting values in SQL is easy, for example:
$ sqlite3 -header -column polyphen2.sqlite
SQLite version 3.7.6.2
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> SELECT chrom||':'||chrpos AS chrpos,refa,txname||strand AS txname,gene,nt1,nt2,acc,pos,aa1,aa2,hdiv_prediction,hdiv_prob,hvar_prediction,hvar_prob
...> FROM features JOIN scores USING(id)
...> WHERE gene='MAP2K1' AND hdiv_prediction LIKE '%damaging%' ORDER BY hdiv_prob DESC LIMIT 10;
chrpos refa txname gene nt1 nt2 acc pos aa1 aa2 hdiv_prediction hdiv_prob hvar_prediction hvar_prob
-------------- ---------- ----------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------------- ---------- ----------------- ----------
chr15:66679702 CA uc010bhq.2+ MAP2K1 C A Q02750 6 P Q probably damaging 1.0 probably damaging 0.981
chr15:66679702 CG uc010bhq.2+ MAP2K1 C G Q02750 6 P R probably damaging 1.0 probably damaging 0.981
chr15:66679734 GT uc010bhq.2+ MAP2K1 G T Q02750 17 G C probably damaging 1.0 probably damaging 0.937
chr15:66727411 GC uc010bhq.2+ MAP2K1 G C Q02750 43 D H probably damaging 1.0 probably damaging 0.922
chr15:66727429 CT uc010bhq.2+ MAP2K1 C T Q02750 49 R C probably damaging 1.0 probably damaging 0.997
chr15:66727430 GA uc010bhq.2+ MAP2K1 G A Q02750 49 R H probably damaging 1.0 probably damaging 0.996
chr15:66727430 GC uc010bhq.2+ MAP2K1 G C Q02750 49 R P probably damaging 1.0 probably damaging 0.998
chr15:66727442 TG uc010bhq.2+ MAP2K1 T G Q02750 53 F C probably damaging 1.0 probably damaging 0.988
chr15:66727445 TA uc010bhq.2+ MAP2K1 T A Q02750 54 L H probably damaging 1.0 probably damaging 0.976
chr15:66727453 AC uc010bhq.2+ MAP2K1 A C Q02750 57 K Q probably damaging 1.0 probably damaging 0.994
sqlite> .q
Released:
08-Mar-2012
Contacts:
Ivan Adzhubey <ivan_adzhubey@hms.harvard.edu>
Shamil Sunyaev <ssunyaev@hms.harvard.edu>
Apache/2.4.58 (Ubuntu) Server at genetics.bwh.harvard.edu Port 80