Dataset:
PolyPhen-2 annotations for whole human RefSeq sequence space
(WHRESS)
Source databases:
UCSC GRCh37/hg19 knownGene annotations (08-Oct-2009)
MultiZ46Way multiple alignments of 45 vertebrate genomes with hg19/GRCh37 human genome (08-Oct-2009)
UniProtKB/Swiss-Prot/UniRef100 Release 2011_12 (14-Dec-2011)
NCBI Homo sapiens Annotation Release 104 (02-Nov-2012):
ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/protein/protein.fa.gz
Software:
PolyPhen-2 v2.2.2r398
Description:
The dataset consists of PolyPhen-2 annotations for 157,907,066
amino acid residue substitutions encoded by a putative single-nucleotide
codon change, enumerated for each sequence position in the 35,586
NCBI RefSeq proteins. The database format and contents are similar
to the ones utilized by the Whole Human Exome Sequence Space (WHESS)
database released earlier.
File:
========================================
polyphen-2.2.2-whress-2012_11.sqlite.bz2
File size: 8.2 GB
Uncompressed: 73.0 GB
========================================
This is a complete set of annotations loaded into a database in SQLite v3 format.
The database scheme is identical to the one used for the WHESS database.
SQLite usage example:
$ sqlite3 -header -column polyphen-2.2.2-whress-2012_11.sqlite
SQLite version 3.7.15.2 2013-01-09 11:53:05
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> SELECT chrom||':'||chrpos AS chrpos,refa,txname||strand AS txname,gene,nt1,nt2,refs_acc,cdnpos,aa1,aa2,hdiv_prediction,hdiv_prob,hvar_prediction,hvar_prob
...> FROM features JOIN scores USING(id)
...> WHERE gene='MAP2K1' AND hdiv_prediction LIKE '%damaging%' ORDER BY hdiv_prob DESC LIMIT 10;
chrpos refa txname gene nt1 nt2 refs_acc cdnpos aa1 aa2 hdiv_prediction hdiv_prob hvar_prediction hvar_prob
---------- ---------- ----------- ---------- ---------- ---------- ----------- ---------- ---------- ---------- ----------------- ---------- ----------------- ----------
uc010bhq.2+ MAP2K1 NP_002746.1 6 P H probably damaging 1.0 probably damaging 0.972
chr15:6667 CA uc010bhq.2+ MAP2K1 C A NP_002746.1 6 P Q probably damaging 1.0 probably damaging 0.979
chr15:6667 CG uc010bhq.2+ MAP2K1 C G NP_002746.1 6 P R probably damaging 1.0 probably damaging 0.979
uc010bhq.2+ MAP2K1 NP_002746.1 17 G W probably damaging 1.0 probably damaging 0.957
chr15:6672 CT uc010bhq.2+ MAP2K1 C T NP_002746.1 49 R C probably damaging 1.0 probably damaging 0.997
chr15:6672 GA uc010bhq.2+ MAP2K1 G A NP_002746.1 49 R H probably damaging 1.0 probably damaging 0.996
uc010bhq.2+ MAP2K1 NP_002746.1 49 R M probably damaging 1.0 probably damaging 0.997
chr15:6672 GC uc010bhq.2+ MAP2K1 G C NP_002746.1 49 R P probably damaging 1.0 probably damaging 0.998
uc010bhq.2+ MAP2K1 NP_002746.1 49 R W probably damaging 1.0 probably damaging 0.997
chr15:6672 TG uc010bhq.2+ MAP2K1 T G NP_002746.1 53 F C probably damaging 1.0 probably damaging 0.982
sqlite> .q
Notes:
1) Missing (NULL) values in "chrom", "chrpos", "refa", "nt1" and "nt2" columns indicate
substitutions which cannot result from a single-nucleotide change in the context of
the corresponding transcript's nucleotide sequence. They are included because initially,
substitution are enumerated for the protein sequences alone, without taking into
consideration transcript nucleotide sequences.
2) Original RefSeq identifiers are stored in the "refs_acc" column; original RefSeq
sequence positions can be found in the "cdnpos" column. All RefSeq identifiers
include unique version numbers. Use SQL LIKE operator when you want to ignore
version numbers in your search, e.g.: SELECT ... WHERE refs_acc LIKE 'NP_002746.%'
Released:
16-Apr-2013
Contacts:
Ivan Adzhubey <ivan_adzhubey@hms.harvard.edu>
Shamil Sunyaev <ssunyaev@hms.harvard.edu>
Apache/2.4.58 (Ubuntu) Server at genetics.bwh.harvard.edu Port 80