RareIBD Software

Download

The latest version is rareibd-v1-2-tar.gz. (September 5, 2017)

The previous version v1.1 is rareibd-v1-1-tar.

Please use tar -xvzf rareIBD.v1.2.tar.gz to untar the files. There will be a directory named “rareIBD.v1.2” that contains all necessary files.

New in v1.2: There is a new option -r that removes a variant if its allele frequency is below a certain threshold in a family. For example, if only one individual has a rare variant in a family with 100 individuals, this variant will not be analyzed for this family (it will be analyzed in other families if it is common in those families). For very large families (> 50 individuals in a family), we found that those variants may inflate test statistics. Since they do not provide much segregation information, we decided to ignore them. If your families are not very large (< 50 individuals), you can use -r 0 to turn off this option.

Note1: This version of software has not been tested extensively, and may generate exceptions if the input file formats are incorrect. Contact me (jaehoonsul at mednet dot ucla dot edu) if you encounter an error or exception.

Note2: Only rare variants need to be analyzed using this software; my method has an assumption that only one founder in a family carries a mutation for a specific rare variant. If this assumption is violated, you may observe inflation of test statistics. Hence, all common variants must be filtered out before creating the input files for RareIBD.

Note3: If families are from different populations, they need to be analyzed separately because MAF of a variant can be different for different populations; a variant can be rare in one population, but common in another population. As we showed in our manuscript, you can divide your families into different populations and analyze them separately, which may reduce the power due to the reduced sample size. One can analyze them together, but it requires complicated pre-processing steps and significant changes in RareIBD code, which are currently under development. Please contact me if you are interested in this code.

Note4: Currently, RareIBD supports only a binary trait. Support for quantitative traits is currently under development and will be available as soon as it is tested. In the meanwhile, I recommend using variance component approaches designed for quantitative traits for families (such as famSKAT).

Installation

Step 1. Preparation of input files (all example input files are stored in the example/ directory). RareIBD requires at least 3 input files for each gene. RareIBD can only be applied to one gene in this version. If you want to apply RareIBD to multiple genes, create the following set of files for each gene.

1. Genotype file (gene1.geno file): The first row is a header where the first column is “ped”, the second column is “person”, and from the third column, there’s RSIDs for variants in a gene. From the second row, the first column is family ID, the second column is individual ID, and from the third column, it specifies the number of minor alleles that this individuals contains for each variant (0/1/2). The missing genotype is not allowed, and hence the missing genotype must have been imputed before running RareIBD. Also, the individual ID must be unique, so it is recommended that you create the individual ID that starts with a family ID (e.g. fam1:[individual ID]). If some individuals in a family are not genotyped, those individuals do not need to be present in the genotype file.

2. Pedigree file (pedigree.txt): The first row is a header. From the second row, it is a traditional pedigree file format where the first column is a family ID, the second column is an individual ID, the third column is a father ID, the fourth column is a mother ID, the fifth column is sex (1 for male, and 2 for female), and the last column is a trait. The current version of RareIBD only supports a binary trait, and affected individuals are 1 while unaffected individuals are 0 (missing individuals are -9). Please specify the full pedigree structure even if some of them are not genotyped as RareIBD uses the full pedigree structure information. One important to note about individual ID: the individual ID (in the second column) needs to be in the format of "[Family ID]:[Individual ID]" (there must be a colon separating family ID and individual ID). Kinship file (below) needs to be generated using this format.

3. Kinship coefficient file (kinship.txt): This is generated from the kinship2 R package. You can use the following R code to generate this kinship file.
library(kinship2)
ped = read.table("pedigree.txt",header=T)
kinshipmatrix = kinship(id=ped[,2],dadid=ped[,3],momid=ped[,4],sex=ped[,5],chrtype="autosome")
write.table(kinshipmatrix,"kinship.txt",quote=F)

4. (Optional) Weight file (gene1.weight): Weight for each SNV can be specified using the weight file. In this file, each line specifies weight of each SNV in a gene.

Step 2. Pre-computing RareIBD statistics for founders. For computational efficiency, the mean and SD of RareIBD statistic for each founder need to be pre-computed before running RareIBD. Please use the RareIBDPrecompute.jar program as follows.

Usage: java -jar RareIBDPrecompute.jar [-s seed] [-m max_#_IV_sampling] [-g genotype file] [-p pedigree file] [-f family ID] [-o output dir] [-h]

Required parameters:
[-m max_#_IV_sampling]: maximum number of IV sampling to perform (100000 recommended)
[-g genotype file]: full path to the genotype file.
[-p pedigree file]: full path to the pedigree file.
[-f familiy ID]: family ID for which mean and standard deviation of RareIBD statistics will be computed.
[-o output dir]: full path to directory where the output file will be stored ($famID.mean_SD.txt).

Optional parameters:
[-s seed]: random seed. This will be added to the current time for the final seed.

You need to run RareIBDPrecompute.jar for each family (using its family ID with -f option). For example, there are 5 families in the example (fam1, fam2, fam3, fam4, fam5). Please use the following commands.

java -jar RareIBDPrecompute.jar -s 100 -m 100000 -g example/gene1.geno -p example/pedigree.txt -f fam1 -o mean_sd/

java -jar RareIBDPrecompute.jar -s 100 -m 100000 -g example/gene1.geno -p example/pedigree.txt -f fam2 -o mean_sd/

java -jar RareIBDPrecompute.jar -s 100 -m 100000 -g example/gene1.geno -p example/pedigree.txt -f fam3 -o mean_sd/

java -jar RareIBDPrecompute.jar -s 100 -m 100000 -g example/gene1.geno -p example/pedigree.txt -f fam4 -o mean_sd/

java -jar RareIBDPrecompute.jar -s 100 -m 100000 -g example/gene1.geno -p example/pedigree.txt -f fam5 -o mean_sd/

You can execute these commands using a high-performance cluster (HPC) by submitting each family as each job. The output files will be stored in the “output” directory. Unless the family is very big (> 100 individuals), this program is usually very fast.

Step 3. Running RareIBD. Once the mean and SD of RareIBD statistics for all founders are estimated, you can run RareIBD as follows.

Usage: java -jar RareIBD.jar [-s seed] [-m max_#_IV_sampling] [-g genotype file] [-p pedigree file] [-k kinship file] [-d mean_SD dir] [-o output file] [-w] [-h]

Required parameters:
[-m # of gene dropping permutations]: # of permutations to perform to estimate a p-value.
[-g genotype file]: full path to the genotype file.
[-p pedigree file]: full path to the pedigree file.
[-k kinship file]: full path to the kinship file.
[-d mean_SD dir]: full path to the directory that contains precomputed values (see RareIBDPrecompute.jar).
[-n gene name]: the name of a gene being tested.
[-o output file]: full path to the output file.

Optional parameters:
[-s seed]: random seed. This will be added to the current time for the final seed.
[-w weight file]: full path to the external weight file that specifies weight of each SNV.

Here’s a sample command for the example.

java -jar RareIBD.jar -s 100 -m 10000 -g example/gene1.geno -p example/pedigree.txt -k example/kinship.txt -d mean_sd/ -w example/gene1.weight -n gene1 -o output/rareibd.weight.txt

The output file will contain two columns; the first column is gene name and the second column is p-value.

FAQ

Any FAQ will be listed here.