Research

BGM Workflow Overview Research Overview

We are developing software, algorithms, and methodology for the effective analysis of DNA, RNA, and protein data for the discovery of the underlying genetic variants that cause Mendelian diseases. Based on the suggested inheritance mode of a case, we choose the most informative family members for genome sequencing, and analyze the genomic data using our computational pipeline, which is designed to identify rare Mendelian variants. We are also responsible for the analysis of genome sequencing data generated by the Harvard Undiagnosed Diseases Network clinical site and by the Harvard branch of the FaceBase Consortium. Once cases are solved, they offer fresh opportunities for gene discovery, disease pathway elucidation, and discovering previously unrecognized therapeutic options. Several examples are described on the Publications page.

analysis workflow Mendelian Gene Discovery

The bioinformatic analysis consists of two parts: the upstream analysis, where the sequencing data are converted to genetic variant calls following GATK best practices, and the downstream analysis, where we analyze the candidate genes based on their allele frequency, inheritance pattern, gene function, and variant effects. We prioritize the final candidate genes using in-house bioinformatic tools, segregation analysis, literature surveys and crowdsourcing.

Variant Discovery Model
Variant Discovery Model Inset
Variant Callers and Prioritization

We are developing variant callers and prioritization strategies by using statistical and computational tools and error models specific to the problem at hand. In order to increase specificity and sensitivity to rare variants co-segregating with the phenotype, we need more than one variant caller. Therefore, we are developing de-novo, recessive homozygous, compound heterozygous, and shared dominant callers for SNPs and CNVs with error models specific to the inheritance mode. We use IBD information, unrelated samples and training datasets for optimization of our methods.

Slide1 Protein Structure and Function to Interpret Variation

Interestingly, many of the variants that we discover in our program reside in conserved protein domains with very specific functions, such as ion binding catalytic domains (as shown in the figure adapted from V. Lee et al. PNAS, 2016), proteolytic cleavage sites, or a ligand binding sites. This is not surprising, given the expected small mutation target site of the rare phenotypes in our cases. We are developing tools that use protein evolution, structure, and function information to prioritize the candidate genes on variant level.