Estimates of selection

On this site, we share gene-based estimates of selection from:

Estimating the Selective Effects of Heterozygous Protein Truncating Variants from Human Exome Data. Cassa CA, Weghorn D, Balick DJ, Jordan DM, Nusinow D, Samocha KE, O'Donnell-Luria A, MacArthur DG, Daly MJ, Beier DR, Sunyaev SR. Nature Genetics 2017. PubMed, Nature Genetics PDF (via SharedIt), Nature Genetics HTML, bioRxiv pre-print

Original Scores: Using data from 60,706 exomes in Exome Aggregation Consortium, we estimated the genome-wide distribution of selection coefficients for heterozygous protein truncating variants, and corresponding Bayesian estimates for individual genes.

Updated Scores: We update the mean of the posterior distribution using 125,748 exomes from gnomAD v2.1.1 and with an updated mutational model, Roulette. We also perform a stringent cutoff of allele frequency of 0.01%, to improve sensitivity to autosomal dominant disease genes under strong selection.

Downloadable scores

  • Latest estimates of heterozygous selection We update the mean of the posterior distribution using 125,748 exomes from gnomAD v2.1.1 and with an updated mutational model, Roulette. We also perform a stringent cutoff of allele frequency of 0.01%, to improve sensitivity to autosomal dominant disease genes under strong selection.
  • Original estimates of heterozygous selection This file includes the mean of the posterior distribution Eq. 7 for each gene as well as the upper and lower 95% credibility intervals for each gene estimate. Credibility intervals have precision of 10^-3 where s_het > 0.005 and 10^-5 otherwise.
  • Predicted mode of inheritance in severe Mendelian clinical exome cases For each gene, we generate a probability of mode of inheritance (either autosomal dominant or autosomal recessive). Estimates are generated using a logistic regression, trained on the full set of labeled case examples from two clinical exome sequencing programs (Baylor and UCLA). These estimates are applicable for interpretation of genes in cases that are similarly ascertained as these two clinical exome sequencing programs.
  • Prioritized lists of least studied and most studied genes under strong selection Full annotations for the PubMed Score in the top s_het decile for the top 250 and bottom 250 PubMed genes scores. From the set of genes under the strongest selection (top 10% of s_het values), we create two sets of 250 genes. We then annotated these lists with the results from neutrally-ascertained screens of gene importance and gene essentiality. We summarize these screens using a heuristic score.