Skip to content

This is the official development repository for BaseVar, which call variants for large-scale ultra low-pass (<1.0x) WGS data, especially for NIPT data

License

Notifications You must be signed in to change notification settings

ShujiaHuang/BaseVar2

Repository files navigation

BaseVar: Call variants from ultra low-pass WGS data

BaseVar is a specialized tool tailored for variant calling using ultra low-depth (<1x) sequencing data, particularly catering to non-invasive prenatal test (NIPT) data in human genetic studies. Leveraging maximum likelihood and likelihood ratio models, BaseVar accurately identifies polymorphisms at genomic positions and calculates allele frequencies. For in-depth mathematical explanations, refer to the comprehensive documentation available here.

Now, BaseVar has been fully implemented by C++. BaseVar showcases significant enhancements over its original Python counterpart. The C++ implementation delivers a computing speed exceeding 10 times that of the Python version, all while demanding substantially less memory. Typically, each thread (-t/--thread) consumes merely 3GB to 4GB of memory when the -B (--batch-count) option is configured to 200, a stark contrast to the Python version's requirement of over 20GB.

Citation

Please cite the following papers if you use BaseVar in your published projects or papers.

  • Liu, S., Liu, Y., Gu, Y., Lin, X., Zhu, H., Liu, H., Xu, Z., Cheng, S., Lan, X., Li, L., Huang, M., Li, H., Nielsen, R., Davies, RW., Albrechtsen, A., Chen, GB., Qiu, X., Jin, X., Huang, S., (2024). Utilizing non-invasive prenatal test sequencing data for human genetic investigation. Cell Genomics 4(10), 100669 doi:10.1016/j.xgen.2024.100669

Installation

BaseVar requires C++17 or above. Compile basevar from source codes step-by-step.

You can install basevar using either of the following two methods.

Method 1. Install basevar by using cmake (Recommend)

This is the simplest way of installing basevar by cmake

git clone https://github.com/ShujiaHuang/basevar2.git
cd basevar
mkdir build
cmake ..
make 

If you have problems downloading, please try several times.

If everything is smooth, you'll find an exectutable file named basevar in bin/ folder.

Method 2. Manually install processes (Optional)

Sometimes, you may need to complete the installation manually if the above command fails to install BaseVar.

1. Download BaseVar from github

BaseVar is hosted on Github and can be downloaded with the following command:

git clone https://github.com/ShujiaHuang/basevar2.git

WARNING: Please try several times if fail to clone the data causing by the network problem.

2. Navigate into htslib folder and run the following commands

After cloing, navigate into the basevar folder and execute the following:

cd htslib
autoreconf -i
./configure
make

Note: If you encounter an error message similar to the following during the compilation of htslib, you can safely disregard it as the code should continue to function properly:

test/test_khash.c: In function 'write_stats_str2int':
test/test_khash.c:53:9: warning: implicit declaration of function 'kh_stats' [-Wimplicit-function-declaration]
   53 |     if (kh_stats(str2int, h, &empty, &deleted, &hist_size, &hist) == 0) {
      |         ^~~~~~~~
test/test_khash.c:53:18: error: 'str2int' undeclared (first use in this function)
   53 |     if (kh_stats(str2int, h, &empty, &deleted, &hist_size, &hist) == 0) {
      |                  ^~~~~~~
test/test_khash.c:53:18: note: each undeclared identifier is reported only once for each function it appears in
make: *** [test/test_khash.o] Error 1

Feel free to proceed with your installation tasks despite encountering this error during the compilation process.

3. Go back to the upper directory and install basevar by running the commands below

Navigate into bin/ folder firstly and execute the following commands:

Manually install in Linux

cd bin/
g++ -O3 -fPIC ../src/*.cpp ../src/io/*.cpp ../htslib/libhts.a -I ../htslib -lz -lbz2 -lm -llzma -lpthread -lcurl -lssl -lcrypto -o basevar

Manually install in MacOS

cd bin/
g++ -O3 -fPIC ../src/*.cpp ../src/io/*.cpp ../htslib/libhts.a -I ../htslib -lz -lbz2 -lm -llzma -lpthread -lcurl -o basevar

To review each of the parameters, you can type basevar caller -h in the Linux/MacOS Terminal.

$ /path/to/basevar caller -h

About: Call variants and estimate allele frequency by BaseVar.
Usage: basevar caller [options] <-f Fasta> <--output-vcf output_file> [-L bam.list] in1.bam [in2.bam ...] ...

Required arguments:
  -f, --reference FILE         Input reference fasta file.
  -o, --output    FILE         Output VCF file.

Optional options:
  -L, --align-file-list=FILE   BAM/CRAM files list, one file per row.
  -r, --regions=REG[,...]      Skip positions which not in these regions. This parameter could be a list
                               of comma deleimited genome regions(e.g.: chr:start-end).
  -G, --pop-group=FILE         Calculating the allele frequency for specific population.

  -m, --min-af=float           Setting prior precision of MAF and skip ineffective caller positions,
                               a typical approach involves setting it to min(0.001000, 100/x), where x
                               represents the number of input BAM files min(0.001000, 100/x). In most
                               cases, users need not be overly concerned about this parameter, as it
                               is generally handled automatically by the program.
  -Q, --min-BQ INT             Skip bases with base quality < INT [10]
  -q, --mapq=INT               Skip reads with mapping quality < INT [5]
  -B, --batch-count=INT        INT simples per batchfile. [500]
  -t, --thread=INT             Number of threads. [14]

  --filename-has-samplename    If the prefix name of bamfiles/cramfiles start with 'Sample ID', something like 'SampleID.bam', set this
                               argrument could save a lot of time during get the sample id from BAMfile.
  --smart-rerun                Rerun process by checking batchfiles.
  -h, --help                   Show this help message and exit.

Example usage:
  [1]. basevar caller -f reference.fasta -o output.vcf.gz -Q 20 -q 30 -B 500 --filename-has-samplename -L bam.list
  [2]. basevar caller -f reference.fasta -o output.vcf.gz -Q 20 -q 30 -B 500 --filename-has-samplename -L bam.list sample1.bam sample2.bam
  [3]. basevar caller -f reference.fasta -o output.vcf.gz -Q 20 -q 30 -B 500 --filename-has-samplename -L bam.list -r chr1 sample1.bam sample2.bam

This command will provide detailed information about parameters of basevar.

Quick start

Call variants from several bamfiles

basevar caller -f reference.fasta \
    -Q 20 -q 30 -B 500 -t 24 \
    --pop-group=sample_group.info \
    --regions=chr11:5246595-5248428,chr17:41197764-41276135 \
    --output test.vcf.gz 00alzqq6jw.bam 09t3r9n2rg.bam 0fkpl1p55b.bam ...

The format of sample_group.info could be found here.

Or call variants from bamlist

basevar caller -f reference.fasta \
    -Q 20 -q 30 -B 500 -t 24 \
    -L bamfile.list \ 
    --regions=chr11:5246595-5248428,chr17:41197764-41276135 \
    --pop-group=sample_group.info \
    --output test.vcf.gz 

For stramlinened variant calling across the entire genome, you can use the pipeline generator create_pipeline.py, which distributes the computational tasks based on the --delta parameter across a specific chromosome defined by the -c parameter.

python create_pipeline.py -Q 20 -q 30 -f reference.fa --ref_fai reference_fa.fai -c chr20 --delta 5000000 -t 24 -L bamfile.list -o outdir > basevar.chr20.sh

BaseVar is under active development. Obtain the newest version by pulling the newest version and compilling again.

About

This is the official development repository for BaseVar, which call variants for large-scale ultra low-pass (<1.0x) WGS data, especially for NIPT data

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published