appendix_a
PolyPhen-2 annotation summary report explained
Following is a description of PolyPhen-2 annotation summary report. Reports in this format are produced by both PolyPhen-2 Batch query web service, as well as by standalone PolyPhen-2 software. It is a plain text tab-separated file with each line annotating single protein variant (amino acid residue substitution).
Fourteen columns highlighted below (1-2, 6-11, 41, 45-48, 91) are the ones included in the Short version of the report available via Batch query web page. These are sufficient if you are interested in PolyPhen-2 prediction outcome and prediction confidence scores. The rest of the columns in Full report version are mostly useful only if you want to investigate all features supporting the prediction in detail.
Column No. | Column Name | Description |
---|---|---|
Original query (as copied from user input): | ||
1 | query_no | input query ordinal |
2 | o_acc | original protein identifier |
3 | o_pos | original substitution position in the protein sequence |
4 | o_aa1 | original wild type (reference) amino acid residue |
5 | o_aa2 | original mutant (substitution) amino acid residue |
Annotated query: | ||
6 | rsid | dbSNP reference SNP identifier (rsID) if available |
7 | acc | UniProtKB accession if known protein, otherwise same as o_acc |
8 | length | Length of the protein sequence |
9 | pos | substitution position in UniProtKB protein sequence, otherwise same as o_pos |
10 | aa1 | wild type amino acid residue in relation to UniProtKB sequence |
11 | aa2 | mutant amino acid residue in relation to UniProtKB sequence |
Nucleotide sequence context annotations: | ||
12 | chr_pos | SNP chromosome:position (chromosome coordinates are 1-based) |
13 | str | transcript strand (“+” or “-”) |
14 | gene | gene symbol |
15 | transcript | UCSC transcript name (unique identifier) |
16 | canon | UCSC knowCanonical representative transcript flag: 1 - canonical, 0 - alternative |
17 | cid | UCSC knownCanonical cluster identifier (number) |
18 | txcov | transcript coverage, the number of transcipts in UCSC cluster overlapping the mutation position / total number of transcripts in the cluster |
19 | ntlen | full transcript length (number of nucleotides) |
20 | ntpos | mutation position in the full transcript nucleotide sequence (in the direction of transcription) |
21 | nt1 | reference nucleotide (transcript strand) |
22 | nt2 | variant nucleotide (transcript strand) |
23 | PtGgPaNl | orthologous alleles in chimp (Pt), gorilla (Gg), Orangutan (Pa) and gibbon (Nl) if different from human reference allele, ? - otherwise; . - data not available |
24 | dref | putative derived allele found in human reference, score: 0 - no evidence, 1 - variant allele matches orthologous ancestral allele, 2 - dbSNP minor allele matches reference allele, 3 - both dbSNP and orthologous evidence present, ? - not enough evidence to score |
25 | gerprs | Genomic Evolutionary Rate Profiling (GERP++) position-specific conservation score, RS; 0 - when alignment coverage is insufficient |
26 | phylop | conservation scoring by phyloP (phylogenetic p-values) from the PHAST package for multiple alignments of 99 vertebrate genomes to the human genome |
27 | trv | transversion mutation flag: 0 - transition, 1 - transversion |
28 | CpG | CpG context: 0 - non-CpG context retained, 1 - mutation removes CpG site, 2 - mutation creates new CpG site, 3 - CpG context retained: C(C/G)G substitution |
29 | JXmin | distance from mutation position to the nearest exon / intron junction (“-” for upstream, “+” for downstream) |
30 | JXc | mutation in a codon that is split across two exons: ? - no, 1 - yes |
31 | exon | mutation in exon # / of total exons (exons are enumerated in the direction of transcription) |
32 | cexon | same as above but only coding (CDS) exons are being enumerated |
33 | cdnpos | number of the mutated codon within transcript's CDS (1-base) |
34 | frame | mutation position offset within the codon (0..2) |
35 | dgn | degeneracy index for mutated codon position, by Nei & Kumar (2000) “Molecular Evolution and Phylogenetics”, page 64: 0 - non-degenerate, 2 - simple 2-fold degenerate, 3 - complex 2-fold degenerate, 4 - 4-fold degenerate |
36 | cdn1 | reference codon |
37 | cdn2 | mutated codon |
dbNSP annotations: | ||
38 | dbrsid | dbSNP SNP rsID |
39 | dbminor | dbSNP minor allele nucleotide (transcript strand) |
40 | dbmaf | dbSNP minor allele frequency |
PolyPhen-2 prediction outcome: | ||
41 | prediction | qualitative ternary classification appraised at 5%/10% (HumDiv) or 10%/15% (HumVar) False Discovery Rate (FDR) thresholds: “benign”, “possibly damaging”, “probably damaging” |
PolyPhen-1 prediction description (obsolete, please ignore): | ||
42 | based_on | prediction basis |
43 | effect | predicted substitution effect on the protein structure or function |
PolyPhen-2 classifier outcome and scores: | ||
44 | pph2_class | probabilistic binary classifier outcome: “damaging” or “neutral” |
45 | pph2_prob | classifier probability of the variation being damaging |
46 | pph2_FPR | classifier model False Positive Rate (1 - specificity) at the above probability |
47 | pph2_TPR | classifier model True Positive Rate (sensitivity) at the above probability |
48 | pph2_FDR | classifier model False Discovery Rate at the above probability |
UniProtKB/Swiss-Prot/Pfam protein annotations: | ||
49 | PfamHit | Pfam identifier of the protein family or domain to which substitution maps |
50 | site | substitution SITE annotation |
51 | region | substitution REGION annotation |
52 | PHAT | PHAT matrix element for substitutions in the TRANSMEM region |
Multiple sequence alignment scores: | ||
53 | dScore | difference of PSIC scores for two amino acid residue variants (Score1-Score2) |
54 | Score1 | PSIC score for wild type amino acid residue (aa1) |
55 | Score2 | PSIC score for mutant amino acid residue (aa2) |
56 | MSAv | version of the multiple sequence alignment used in conservation scores calculations: 1 - pairwise BLAST HSP (obsolete), 2 - MAFFT-Leon-Cluspack (default), 3 - MultiZ CDS |
57 | Nobs | number of residues observed at the substitution position in multiple alignment (without gaps) |
58 | Nseqs | number of sequences observed at the substitution position in multiple alignment (including gaps) |
59 | Nsubs | number of residues different from reference residue (aa1) observed at the substitution position in multiple alignment |
60 | Nvars | number of residues same as substitution residue (aa2) observed at the substitution position in multiple alignment (without gaps) |
61 | Nres | number of unique residues observed at the substitution position in multiple alignment (without gaps) |
Substitution scores: | ||
62 | IdPmax | maximum congruency of the substitution amino acid residue across all sequences with a substitution at the substitution position in multiple alignment |
63 | IdPSNP | maximum congruency of the substitution amino acid residue to the sequences in multiple alignment with the substitution residue at the substitution position in multiple alignment |
64 | IdQmax | query sequence identity with the closest homologue deviating from the wild type amino acid residue (aa1) |
Phylogenetic tree based scores: | ||
65 | DistPmin | minimum normalized distance along the phylogenetic tree across all substitution types encountered at the substitution position |
66 | DistPSNP | minimum normalized distance along the phylogenetic tree for substitution residues (aa2) encountered at the substitution position |
67 | DistQmin | minimum distance (sum of branch lengths) along the phylogenetic tree across all substitution types encountered at the substitution position |
68 | BaRE | Bayesian Rate Estimator for scoring evolutionary conservation, D.M. Jordan (2015) |
Gene-based scores: | ||
69 | RVISraw | Residual Variation Intolerance Score (raw), Petrovski et al. (2013) |
70 | RVISranked | Residual Variation Intolerance Score (normalized by rank), Petrovski et al. (2013) |
RCSB PDB annotations: | ||
71 | Nstruct | initial number of BLAST hits to similar proteins with 3D structures in PDB |
72 | Nfilt | number of 3D BLAST hits after identity threshold filtering |
73 | PDB_id | PDB protein structure identifier |
74 | PDB_ch | PDB polypeptide chain identifier |
75 | PDB_len | PDB sequence alignment length |
76 | PDB_pos | position of substitution in the PDB protein sequence |
77 | PDB_idn | sequence identity between query sequence and the aligned PDB sequence |
Amino acid residues structural features: | ||
78 | dVol | change in residue side chain volume |
79 | dProp | change in solvent accessible surface propensity resulting from the substitution |
Protein 3D structure features: | ||
80 | SecStr | DSSP secondary structure assignment |
81 | MapReg | region of the phi-psi map (Ramachandran map) derived from the residue dihedral angles |
82 | NormASA | normalized accessible surface |
83 | B-fact | normalized B-factor (temperature factor) for the residue |
84 | H-bonds | number of hydrogen sidechain-sidechain and sidechain-mainchain bonds formed by the residue |
85 | AveNHet | number of residue contacts with heteroatoms, average per homologous PDB chain |
86 | MinDHet | closest residue contact with a heteroatom, Å |
87 | AveNInt | number of residue contacts with other chains, average per homologous PDB chain |
88 | MinDInt | closest residue contact with other chain, Å |
89 | AveNSit | number of residue contacts with critical sites, average per homologous PDB chain |
90 | MinDSit | closest residue contact with a critical site, Å |
Comments: | ||
91 | Comments | optional user comments, copied from input |
appendix_a.txt · Last modified: 2021/12/04 04:49 by 127.0.0.1