Home / p/bbmix

bbmix

Bayesian Model for Genotyping using RNA-Seq

The method models RNA-seq reads using a mixture of 3 beta-binomial distributions to generate posterior probabilities for genotyping bi-allelic single nucleotide polymorphisms. Elena Vigorito, Anne Barton, Costantino Pitzalis, Myles J. Lewis and Chris Wallace (2023) <doi:10.1093/bioinformatics/btad393> "BBmix: a Bayesian beta-binomial mixture model for accurate genotyping from RNA-sequencing."

README

<!-- README.md is generated from README.Rmd. Please edit that file -->
The goal of bbmix is to genotype variants using RNA-seq reads. We
developed a 2-step approach. First, we selected RNA-seq reads that
overlap known exonic bi-allelic SNPs to learn the parameters for a
mixture of three Beta-Binomial distributions underlying the three
possible genotypes. In the second step we obtained posterior
probabilities for each of the genotypes using the parameters learnt in
step 1

### Install bbmix from [GitLab](https://about.gitlab.com/):

Installation has been tested on R 3.5.1. Installation time is estimated
as 2 minutes.

    ## Within R:
    install.packages("devtools") # if you don't already have the package
    library(devtools)
    devtools::install_git(url = "https://gitlab.com/evigorito/bbmix.git/")

Running bbmix
-------------

Preparation of the required input files are described in
[bbmix\_pipeline](https://gitlab.com/evigorito/bbmix_pipeline/).

We have shown that training the model with the same sample that we want
to learn genotypes, with a pool of reads from all the study samples or
with an external sample did not affect much the accuracy of genotype
calls. Here we provide examples on how to call genotypes in this 3
scenarios.

### Calling genotypes with the model trained with external default sample (NA12878)

The function to call is **call\_gt**. We provide an example for calling
genotypes on chromosome 22 for the genome in a bottle NA12878 sample.
Genome coordinates are in built 37.

**Arguments**

*allele\_counts\_f*: file name with allele counts for SNPs (details on
how to generate this file are in the pipeline.)

*depth* : minumun number of reads overlapping a SNP for genotypes to be
called, defaults to 10

*stan\_f* : full name to stan object with model fit to extract mean of
parameters. Defaults to NULL, using the model trained with genome in a
bottle reads. Otherwise this object can be generated with fit\_bb
function.

*legend\_f* : name for legend file with allele frequencies to apply for
prior. Requited columns are “position a0 a1” and the name for the column
to extract allele frequencies. The file need to be compressed and
compatible with gunzip.

*pop* : column name from legend\_f for the population to extract allele
frequencies, defaults to EUR

*prob* : threshold for the posterior probability to make hard calls,
defaults to 0.99

*fisher\_f* : file with SNP annotations with Fisher exact test for
detecting strand bias

*fisher* : cut-off for Fisher strand bias, defaults to 30

*cluster\_f* : file with SNP annotation of whether there are in clusters
of at least 3 within 35bp

*out* : file name to save output

    ## Retrive input files for running call_gt
    counts_f <- system.file("extdata/input", "NA12878.chr22.Q20.allelicCounts.txt",
            package = "bbmix",
            mustWork = TRUE)

    legend <- system.file("extdata/input", "1000GP_Phase3_chr22.legend.gz",
           package = "bbmix", mustWork = TRUE)

    fisher_f <- system.file("extdata/input", "chr22.FS.Q20.alleleCounts.txt",
         package = "bbmix", mustWork = TRUE)

    cluster_f <- system.file("extdata/input", "fSNPs_22_RP_maf0_01_cluster3window35.txt",
          package = "bbmix", mustWork = TRUE)

    ## Choose your output directory
    out <- "path/to/output_dir/NA12878.chrom22.gt.txt"

    ## Run call_gt:
    call_gt(allele_counts_f = counts_f,
        legend_f = legend,
        fisher_f = fisher_f,
        cluster_f = cluster_f,
        out = out)

### Calling genotypes with the model trained with a sample of choice

1- We first call the function **fit\_bb** and then **call\_gt** as
follows:

**Arguments for fit\_bb**

*counts\_f*: file name with allele counts for SNPs (details on how to
generate this file are in the pipeline, same format as for
allele\_counts\_f in call\_gt function.)

*N* : Number of SNPs to train the model, defaults to 1000

*prefix* : prefix to add to files when saving (suffix:
\_stan\_beta\_binom\_fit.rds), suggested prefix is sample name, defaults
to NULL (stan\_beta\_binom\_fit.rds)

*alpha\_p* : Beta prior parameter for model training, default
\[1,10,499\]

*beta\_p* : Beta prior parameter for model training, default \[499, 10,
1\]

*out* : directory to save output from model fitting

    ## Retrive input files for running call_gt
    counts_f <- system.file("extdata/input", "NA12878.chr22.Q20.allelicCounts.txt",
            package = "bbmix",
            mustWork = TRUE)

    ## Choose your output directory
    out <- "path/to/output_dir"

    ## Run fit_bb:
    fit_bb(counts_f = counts_f,
        out = out)

2- Call genotypes using **call\_gt**

We use the same arguments as above except that now we use the argument
“stan\_f” with the output file generated by the function fit\_bb:

    ## Run call_gt:
    call_gt(allele_counts_f = counts_f,
        stan_f = output_from_fit_bb,
        legend_f = legend,
        fisher_f = fisher_f,
        cluster_f = cluster_f,
        out = out)

### Calling genotypes with a pool of samples

The function **poolreads** can be use for preparing the input file
‘counts\_fit’ for **bbmix**. **poolreads** takes the following
arguments: \* files: vector with the name of the files to pool the reads
from \* N : number of SNPs to select for each sample, defaults to 1000
\* d : read depth for a SNP to be included in the pool, defaults to 10
\* out : name to write output file

Then call **fit\_bb** for model fitting as in previous example using the
output file from **poolreads** as input for ‘counts\_f’ argument. Last,
call **cal\_gt** with stan\_f argument generated by fit\_bb.

### Calling genotypes output

    ## Inspecting output files
    gt <- data.table::fread(system.file("extdata/output", "gt.NA12878.chr22.txt",
                                      package = "bbmix",
                                  mustWork = TRUE))

    head(gt)
    #>    CHROM      POS REF ALT NREF NALT total        EUR        p0           p1
    #> 1:    22 17538189   T   C   14    0    14 0.07952286 0.9996466 0.0003534366
    #> 2:    22 17538439   C   T   15    0    15 0.50994036 0.9968645 0.0031355337
    #> 3:    22 17538808   T   C   23    0    23 0.79522863 0.9985178 0.0014822119
    #> 4:    22 17538980   C   G   12    0    12 0.59244533 0.9888215 0.0111784524
    #> 5:    22 17539427   G   T   18    0    18 0.79522863 0.9949742 0.0050258204
    #> 6:    22 17539439   T   C   13    0    13 0.79522863 0.9787282 0.0212717969
    #>              p2  expected_GT  expected_sd GT SNPcluster        FS
    #> 1: 1.271118e-29 0.0003534366 0.0001142123  0      FALSE  3.802112
    #> 2: 5.602803e-29 0.0031355337 0.0010943268  0      FALSE  2.178861
    #> 3: 4.304680e-39 0.0014822119 0.0008511031  0      FALSE  9.150132
    #> 4: 4.721349e-24 0.0111784524 0.0029755483 NA      FALSE  0.000000
    #> 5: 3.173322e-32 0.0050258204 0.0021620213  0      FALSE  4.615766
    #> 6: 8.830021e-25 0.0212717969 0.0061388626 NA      FALSE 19.158656

The table list the chromosome (CHROM), position (POS), reference (REF)
and alternative alleles (ALT), the number of reads overlapping the
reference allele (NREF), the alternative allele (NALT) and total. The
next column is the effector allele frequency for the selected ethnicity,
followed by the posterior probabilities for genotypes homozygous
reference (p0), heterozygous (p1) or hom alternative (p2). Then, the
expected genotype (expected\_GT, dosage), its the standard deviation
(expected\_sd) and hard calls based on parameter “prob” follows in
column GT. Last, the quality control measures for SNP in clusters or
Fisher strand bias phred scaled p-value are labelled as SNPclusters and
FS.

Versions across snapshots

Version Repository File Size

1.0.0 rolling linux/jammy R-4.5 bbmix_1.0.0.tar.gz 267.8 KiB

1.0.0 rolling linux/noble R-4.5 bbmix_1.0.0.tar.gz 267.8 KiB

1.0.0 rolling source/ R- bbmix_1.0.0.tar.gz 267.8 KiB

1.0.0 latest linux/jammy R-4.5 bbmix_1.0.0.tar.gz 267.8 KiB

1.0.0 latest linux/noble R-4.5 bbmix_1.0.0.tar.gz 267.8 KiB

1.0.0 latest source/ R- bbmix_1.0.0.tar.gz 267.8 KiB

1.0.0 2026-04-26 source/ R- bbmix_1.0.0.tar.gz 267.8 KiB

1.0.0 2026-04-23 source/ R- bbmix_1.0.0.tar.gz 267.8 KiB

1.0.0 2025-04-20 source/ R- bbmix_1.0.0.tar.gz 267.8 KiB

Version	Repository	File	Size
`1.0.0`	rolling linux/jammy R-4.5	`bbmix_1.0.0.tar.gz`	267.8 KiB
`1.0.0`	rolling linux/noble R-4.5	`bbmix_1.0.0.tar.gz`	267.8 KiB
`1.0.0`	rolling source/ R-	`bbmix_1.0.0.tar.gz`	267.8 KiB
`1.0.0`	latest linux/jammy R-4.5	`bbmix_1.0.0.tar.gz`	267.8 KiB
`1.0.0`	latest linux/noble R-4.5	`bbmix_1.0.0.tar.gz`	267.8 KiB
`1.0.0`	latest source/ R-	`bbmix_1.0.0.tar.gz`	267.8 KiB
`1.0.0`	2026-04-26 source/ R-	`bbmix_1.0.0.tar.gz`	267.8 KiB
`1.0.0`	2026-04-23 source/ R-	`bbmix_1.0.0.tar.gz`	267.8 KiB
`1.0.0`	2025-04-20 source/ R-	`bbmix_1.0.0.tar.gz`	267.8 KiB

Dependencies (latest)

Imports

methods
Rcpp (>= 0.12.0)
rstan (>= 2.18.1)
R.utils
data.table
rmutil

LinkingTo