Dataset:
  PolyPhen-2 annotations for whole human exome sequence space
  (WHESS)
Source databases:
  UCSC GRCh37/hg19 knownGene annotations (08-Oct-2009)
  MultiZ46Way multiple alignments of 45 vertebrate genomes with hg19/GRCh37 human genome (08-Oct-2009)
  UniProtKB/Swiss-Prot/UniRef100 Release 2011_12 (14-Dec-2011)
Software:
  PolyPhen-2 v2.2.2r395
Description:
  The dataset consists of PolyPhen-2 annotations for 149,948,690
  putative single-nucleotide non-synonymous (missense) codon changes
  enumerated for each CDS codon position in the exons of 43,043 UCSC
  knownGene transcripts (hg19). PolyPhen-2 predictions were calculated
  for all resulting amino acid residue substitutions in the matching
  human UniProtKB protein sequences, using both HumDiv and HumVar
  models.
Nucleotide sequence coverage:
----------------------------------------------------------------------
* Transcripts               : 72.7 %  : 43,043 of all 59,182
                            :         : protein-coding transcripts
                            :         :
* Sequence space coverage   : 98.4 %  : 149,948,690 of all 152,409,179
  within the selected       :         : possible single-nucleotide
  transcripts               :         : missense codon changes
Protein sequence coverage:
----------------------------------------------------------------------
* Proteins                  : 92.5 %  : 18,737 of all 20,248
                            :         : Swiss-Prot proteins
                            :         :
* Sequence sites coverage   : 93.2 %  : 10,531,022 of all 11,294,211
  within the selected       :         : sequence positions in
  proteins                  :         : Swiss-Prot proteins
Files:
  ========================================
  polyphen-2.2.2-whess-2011_12.tab.tar.bz2
  File size:    3.4GB
  Uncompressed:  83GB
  ========================================
  This archive contains tab-delimited text files with the PolyPhen-2
  annotations and predictions, one pair per each UCSC gene transcript.
  "uc*.features.tab" are master files containing annotations used by
  PolyPhen-2 for mutation classification. The format is combined from
  report files described on PolyPhen-2 Wiki: columns 1-15 follow
  loosely MapSNPs format, see:
    http://genetics.bwh.harvard.edu/pph2/dokuwiki/appendix_b
  The order of columns is slightly different but column names are the
  same and should be self-explanatory.
  Columns 16-55 are similar to full PolyPhen-2 report described here
  (from column #5 onwards):
    http://genetics.bwh.harvard.edu/pph2/dokuwiki/appendix_a
  Again, column names in are the same as used in PolyPhen-2 report.
  IMPORTANT! - Ignore "prediction" column contents in "features.tab"
  files, these are obsolete decision-tree based outcomes provided for
  compatibility only. See "scores.tab" files below for PolyPhen-2
  probabilistic scores and prediction outcomes.
  "uc*.scores.tab" files contain HumDiv and HumVar-based predictions
  and probabilistic scores for each missense allele. Column names are
  self-explanatory and similar to what is described in PolyPhen-2 Wiki
  Appendix A article (columns #16-20):
    http://genetics.bwh.harvard.edu/pph2/dokuwiki/appendix_a
  Except for "hdiv_" and "hvar_" prefixes added to column names for
  values derived from HumDiv and HumVar models respectively.
  Note, that contrary to what Wiki documentation says, tab files do
  not use "?" symbol as a missing value indicator. Fields without
  values are left empty, which means you should always specify tab
  character as a field separator in order to load tables properly in
  your software (since values in some of the columns may contain
  embedded spaces).
  Each matching pair of the tab files have exact same number of rows
  sorted identically, hence you can merge each pair line by line, e.g.
  (Linux command-line example):
    $ paste <ucname>.features.tab <ucname>.scores.tab
  If you are only interested in mutation specifications, predictions
  and scores, the columns you might want to merge are: #1-10,16-20
  from "features" + #1-15 from "scores", e.g.:
    $ paste <ucname>.features.tab <ucname>.scores.tab | cut -d$'\t' -f1-10,16-20,56-
  To include only prediction outcomes and probabilistic scores for
  both models:
    $ paste <ucname>.features.tab <ucname>.scores.tab | cut -d$'\t' -f1-10,16-20,56,58,62,64
  See below for the description of a more flexible SQL interface to
  the data.
  ========================================
  polyphen-2.2.2-whess-2011_12.sqlite.bz2
  File size:    6.8GB
  Uncompressed: 66GB
  ========================================
  This is a complete set of annotations loaded into a database in
  SQLite v3 format. The database includes two tables: "features" and
  "scores" which correspond to the two tab-delimited annotation files
  described above. Table schemes follow the ones used in the tab
  files with the "id" extra column added, a unique row number which
  can be used for joining table rows in SQL SELECT statements.
  Manipulating columns and selecting values in SQL is easy, for example:
  $ sqlite3 -header -column polyphen2.sqlite
  SQLite version 3.7.6.2
  Enter ".help" for instructions
  Enter SQL statements terminated with a ";"
  sqlite> SELECT chrom||':'||chrpos AS chrpos,refa,txname||strand AS txname,gene,nt1,nt2,acc,pos,aa1,aa2,hdiv_prediction,hdiv_prob,hvar_prediction,hvar_prob
    ...> FROM features JOIN scores USING(id)
    ...> WHERE gene='MAP2K1' AND hdiv_prediction LIKE '%damaging%' ORDER BY hdiv_prob DESC LIMIT 10;
  chrpos          refa        txname       gene        nt1         nt2         acc         pos         aa1         aa2         hdiv_prediction    hdiv_prob   hvar_prediction    hvar_prob
  --------------  ----------  -----------  ----------  ----------  ----------  ----------  ----------  ----------  ----------  -----------------  ----------  -----------------  ----------
  chr15:66679702  CA          uc010bhq.2+  MAP2K1      C           A           Q02750      6           P           Q           probably damaging  1.0         probably damaging  0.981
  chr15:66679702  CG          uc010bhq.2+  MAP2K1      C           G           Q02750      6           P           R           probably damaging  1.0         probably damaging  0.981
  chr15:66679734  GT          uc010bhq.2+  MAP2K1      G           T           Q02750      17          G           C           probably damaging  1.0         probably damaging  0.937
  chr15:66727411  GC          uc010bhq.2+  MAP2K1      G           C           Q02750      43          D           H           probably damaging  1.0         probably damaging  0.922
  chr15:66727429  CT          uc010bhq.2+  MAP2K1      C           T           Q02750      49          R           C           probably damaging  1.0         probably damaging  0.997
  chr15:66727430  GA          uc010bhq.2+  MAP2K1      G           A           Q02750      49          R           H           probably damaging  1.0         probably damaging  0.996
  chr15:66727430  GC          uc010bhq.2+  MAP2K1      G           C           Q02750      49          R           P           probably damaging  1.0         probably damaging  0.998
  chr15:66727442  TG          uc010bhq.2+  MAP2K1      T           G           Q02750      53          F           C           probably damaging  1.0         probably damaging  0.988
  chr15:66727445  TA          uc010bhq.2+  MAP2K1      T           A           Q02750      54          L           H           probably damaging  1.0         probably damaging  0.976
  chr15:66727453  AC          uc010bhq.2+  MAP2K1      A           C           Q02750      57          K           Q           probably damaging  1.0         probably damaging  0.994
  sqlite> .q
Released:
  08-Mar-2012
Contacts:
  Ivan Adzhubey   <ivan_adzhubey@hms.harvard.edu>
  Shamil Sunyaev  <ssunyaev@hms.harvard.edu>
  
Apache/2.4.58 (Ubuntu) Server at genetics.bwh.harvard.edu Port 80