Copyright (C) University of Tartu 2018-2021 Please cite: Kaplinski L, Möls M, Puurand T, Pajuste F-D, Remm M. (2021). KATK: fast genotyping of rare variants directly from unmapped sequencing reads.
-
Working principle KATK is a fast and accurate genotype caller from sequencing data. It is designed to discover all variants in personal genomes, including rare variants and de novo mutations. It uses region-specific k-mers to collect sequencing reads of interest, makes local alignment and calls the variants. For sake of speed, the current version of KATK is designed to detect variants not from the entire genome, but only from protein-coding exons and their flanking regions.
-
Compilation KATK package utilizes two binaries: gmer_counter and gassembler. gmer_counter creates the index which associates k-mers and reads from FASTQ file. gassembler retrieves the reads, aligns them using Smoith-Waterman algorithm and calls genotypes at each position. For downloading and compiling gmer_counter and gassembler proceed as follows:
git clone https://github.com/bioinfo-ut/GenomeTester4.git
cd GenomeTester4/
cd src
make gmer_counter
make gassembler
- Usage To use KATK for variant calling you need two files: text file with regions of interest and binary file with region-specific k-mers. The binary file is in FastGT format. You can download pre-made exome-specific files from our webpage http://bioinfo.ut.ee/KATK/
wget http://bioinfo.ut.ee/KATK/downloads/KATK_db_20200401.tar.gz
tar zxvf KATK_db_20200401.tar.gz
For every individual ID you need to build a k-mer index file from raw reads in FASTQ file using program gmer_counter:
gmer_counter -dbb cmd_20190410.dbb --compile_index ID.index --silent FASTQ_FILE(S)
Then the genotypes can be called using gassembler:
gassembler -dbi ID.index --file cmd_20191031.txt > ID.calls
This should give you a text file with variant genotype calls.
- Performance adjustments The performance of gmer_counter can be adjusted with arguments '--num_threads' and '--prefetch'. The performance of gassembler can be adjusted with arguments '--num_threads' and '--prefetch_seq'. In both cases, '--num_threads' should not exceed the number of CPU cores that you have access to (default is 24).
Prefetching of data should be switched on if you have more than 64GB RAM available. This gives significant boost to the performance of KATK, particularly if you plan to call more than one individual. For example:
gmer_counter -dbb cmd_20190410.dbb --compile_index ID.index --prefetch FASTQ_FILE(S)
gassembler -dbi ID.index --file cmd_20191031.txt --prefetch_seq > ID.calls
Another way to speed up the calling by gassembler is to define sex of each individual on command line. This saves time by skipping automated prediction of sex of the individual from the data. For example:
gassembler -dbi ID.index --file cmd_20191031.txt --prefetch_seq --sex male > ID.calls