ConSTRain is a short tandem repeat (STR) variant caller that can account for copy number variants (CNVs), aneuploidies, and polyploid genomes. It is accurate and fast, needing less than 20 minutes to genotype >1.7 million STRs in a 100X alignment of human whole-genome sequencing reads on 32 threads. For more detailed information on ConSTRain, please refer to the preprint.
- Method overview
- Installation
- Running ConSTRain
- Input files and their formats
- Command line arguments
- INFO and FORMAT fields
In case you encounter any problems with ConSTRain or have a question, feel free to open an issue or send an email to max.verbiest@zhaw.ch.
To infer STR genotypes, ConSTRain generates all possible allele length distributions for each locus and returns the one that best matches the observed allele length distribution. An overview of how ConSTRain works internally is shown below.
ConSTRain overview and example.
(1) An STR locus are loaded from the input files.
The locus reference information is parsed from the STR panel.
The STR copy number is set based on the karyotype, and optionally updated if the STR is affected by a CNV.
(2) Reads overlapping the STR region are extracted from the alignment file, and the length of the STR region in each read is determined.
(3) The observed distribution is sorted, and at most as many allele lengths as the STR copy number are kept.
(4) This yields the final observed allele length distribution.
(5) Next, all possible genotypes are generated for the STR copy number and stored in matrix G.
(6) From G, the matrix D is generated by multiplying it with the total number of mapped reads (51 in the example) divided by the STR copy number (3 in the example).
Each row in D corresponds to the expected allele length distribution of one of the genotypes in G.
(7) The expected distribution with the lowest error to the observed distribution is found by taking the absolute difference between each row in D and the observed distribution, then (8) taking the sum of rows and finding the one with the lowest value.
(9) The genotype in G with the lowest error is selected (10) and reported in the output.
The inferred genotype of the STR locus in this example consists of an allele of 4 CAG units (present once), an allele of 5 CAG units (present once), and an allele of 8 CAG units (also present once).
ConSTRain releases ship with pre-compiled binaries for MacOS (Intel) and x86_64 Linux. You can find them here. Please notify us of any issues or if you'd like us to include pre-compiled binaries for other targets.
In order to build ConSTRain from the source code, you will need to install the Rust programming language. Then, clone this repository and build ConSTRain as follows:
git clone https://github.com/acg-team/ConSTRain.git
cd ConSTRain/ConSTRain
cargo build --release --bin ConSTRainThe binary will be created at ./target/release/ConSTRain (alternatively, cargo install --path . from within the ConSTRain/ConSTRain directory will create the binary and put it in ~/.cargo/bin ).
Building hts-sys fails with the following error:
Unable to find libclang: "couldn't find any valid shared libraries matching: ['libclang.so', 'libclang-*.so', 'libclang.so.*' 'libclang-*.so.*'], set the LIBCLANG_PATH environment variable to a path where one of these files can be found (invalid: [])"This is caused because Bindgen requires libclang. The problem can be solved by installing clang (see instructions here) or, if it is already installed, providing the location of the directory containing libclang.so through the LIBCLANG_PATH environment variable:
LIBCLANG_PATH=/path/to/clang/lib cargo build --release --bin ConSTRainBuilding hts-sys fails with an error like this:
./htslib/htslib/hts.h:31:10: fatal error: 'stddef.h' file not foundCaused because some system header file (e.g., stddef.h) is not found. Address this by providing the location of the directory containing system header files through the C_INCLUDE_PATH environment variable:
C_INCLUDE_PATH=/path/to/include cargo build --release --bin ConSTRainSome basic ConSTRain invocations are shown here.
For details on all command line arguments, run ConSTRain --help or go to the command line arguments section below.
Calling STR variants from aligned sequencing reads is done via the ConSTRain alignment subcommand.
To run, ConSTRain alignment needs at least a BED file specifiying locations of STRs in the reference genome, a JSON file encoding the target chromosomes and their ploidies, and an alignment in SAM/BAM/CRAM format.
For some species, input files are provided in resources directory (currently, only Homo sapiens and Musa acuminata).
A basic ConSTRain alignment invocation looks like this:
ConSTRain alignment -a <ALIGNMENT> -k <KARYOTYPE> -r <REPEATS>By default, ConSTRain prints logging information up to and including 'info' level to stderr.
To enable more in-depth logging information, set the RUST_LOG environment variable, e.g.:
RUST_LOG=debug ConSTRain alignment -a <ALIGNMENT> -k <KARYOTYPE> -r <REPEATS>If you know there are CNVs in the sample you're analysing, ConSTRain can account for these if you supply the affected regions as a BED file via the --cnvs flag:
ConSTRain alignment -a <ALIGNMENT> -k <KARYOTYPE> -r <REPEATS> --cnvs <CNVs>Once an alignment has been analysed using ConSTRain alignment the generated VCF file can be used to re-run ConSTRain.
This is useful if you decide you want to use different filtering parameters, or maybe you have access to new CNV information that was not available when you first analysed a sample.
Rather than having to analyse the alignment again, you can use the ConSTRain vcf subcommand to re-estimate STR genotypes.
This is possible because ConSTRain includes the observed allele length distribution for each STR as a FORMAT field in the VCF file.
Running ConSTRain vcf is typically a matter of a few seconds.
The basic invocation of ConSTRain vcf is:
ConSTRain vcf -v <VCF> -k <KARYOTYPE> -r <REPEATS> -s <SAMPLE>Again, you can add CNV information via the --cnv flag:
ConSTRain vcf -v <VCF> -k <KARYOTYPE> -r <REPEATS> -s <SAMPLE> --cnvs <CNVs>Some mock input files are located in the ConSTRain/tests/data folder of this repository. We will use these files to give some small examples of how to run ConSTRain.
ConSTRain alignment \
-a ConSTRain/tests/data/duplication_cn3_out1.bam \
-r ConSTRain/tests/data/APC_repeats.bed \
-k resources/h_sapiens/h_sapiens_female.json | \
grep "^#CHROM" -A 1 | \
cut -f 1,2,4,5,9,10This will run ConSTRain on the mock input files.
Some logging information will be printed to stderr, followed by a VCF header line and a single STR locus (we extract only a few columns using cut here.
#CHROM POS REF ALT FORMAT duplication_cn3_out1
chr5 106700 AAAAAAAAAAAAAAA AAAAAAAAAA,AAAAAAAAAAAAAAAA GT:FT:CN:DP:FREQS:REPLEN 1/2:PASS:2:124:10,44|15,39|16,41:10,16In the above output, you can see that ConSTRain estimates a biallelic genotype since we specify chr2 to be represented by two homologues in the karyotype of this sample.
However, we have observed three distinct repeat lengths (REPLEN).
If we now run ConSTRain on the same input samples, but with the additional of CNV information indicating that this locus is actually located in an amplified region (with copy number three), we see the following:
ConSTRain alignment \
-a ConSTRain/tests/data/duplication_cn3_out1.bam \
-r ConSTRain/tests/data/APC_repeats.bed \
-k resources/h_sapiens/h_sapiens_female.json \
--cnvs ConSTRain/tests/data/cnv_test.bed | \
grep "^#CHROM" -A 1 | \
cut -f 1,2,4,5,9,10#CHROM POS REF ALT FORMAT duplication_cn3_out1
chr5 106700 AAAAAAAAAAAAAAA AAAAAAAAAA,AAAAAAAAAAAAAAAA GT:FT:CN:DP:FREQS:REPLEN 0/1/2:PASS:3:124:10,44|15,39|16,41:10,15,16We see that the same allele length distribution for this locus is observed as before (which makes sense), but that the interpretation is different! Since we have now told ConSTRain that this locus is located in an amplified region with copy number three, this same allele length distribution is now interpreted differently. ConSTRain now estimates that the most likely genotype for this STR consists of three different allele lengths.
NOTE: we also could have piped the output of the initial ConSTRain run to a VCF file, and then run ConSTRain vcf on it with the additional CNV information for the same result.
ConSTRain expects alignments to adhere to file specifications maintained by the GA4GH, and outputs variant calls in VCF format. This means that coordinates in BED files must be 0-based and open ended, in accordance with the BED file specification. However, specific formats are expected for other input files, which are outlined here.
- Karyotype: the organism's karyotype should be specified as a JSON file mapping the names of chromosomes to their ploidies.
E.g.,
{"chr1": 2, ... "chrX": 2, "chrY: 0"}for a human female sample,{"chr1": 2, ... "chrX": 1, "chrY: 1"}for a human male. It is critical that chromosome names in this file match chromosome names in the alignment file exactly. - Repeats: the location of repeats in the reference genome should be provided as a BED3+2 file, with the two extra columns indicating the repeat unit length and the repeat unit.
E.g.,
chr5 21004 21014 2 ATspecifies a dinucleotide repeat with sequenceATATATATATstarting at position 21004 on chromosome 5. Please note: ConSTRain currently only considers perfect repeat loci (i.e., the repeat unit is exactly the same each time, no interruptions or mismatches). If you provide a custom STR reference panel, please ensure that you only specify repeats of this nature, otherwise the ConSTRain output may not accurately reflect the genotype in the samples you're analysing. - Copy number variants: CNVs can be specified as a BED3+1 file, with the extra column indicating the (absolute!) copy number of the region affected by the CNV.
E.g.,
chr5 106680 106750 3. It is important to note that ConSTRain makes no effort to estimate the copy number of loci itself. To generate CNV calls, you will need to run one of the many available CNV calling tools on an alignment (or some orthogonal data modality) of the samples you're analysing. Alternatively, large sequencing efforts such as the 1000 Genomes project or GTEx will provide CNV calls for the samples in their cohort. Such cohorts could be a good starting point if you're interested in using ConSTRain to analyse the interplay between STR variablity and CNVs.
All of ConSTRain's command line arguments are listed here. For details on expected formats of the CNV, karyotype, and repeat file, see the section on input file formats below.
| long | short | subcommand | required | default | description |
|---|---|---|---|---|---|
| --alignment | -a | alignment | y | Input file to extract repeat allele lengths from. Can be SAM/BAM/CRAM | |
| --cnvs | both | n | Copy number variants for this individual. Expected format is BED3+1 | ||
| --flanksize | alignment | y | 5 | Size of flanking region around the target repeat that reads need to cover to be considered (both sides) | |
| --karyotype | -k | alignment | y | File containing chromosome names and their base ploidies. Expected format is JSON | |
| --max-cn | both | y | 20 | Maximum copy number to consider | |
| --max-norm-depth | both | n | Maximum normalised depth of coverage to perform allele length estimation. E.g., max-norm-depth of 30. means loci with more than 60 reads for a locus with copy number 2 will be skipped, 90 for copy number 3 etc. |
||
| --min-norm-depth | both | y | 1.0 | Minimum normalised depth of coverage to perform allele length estimation. E.g., min-norm-depth of 10. means at least 20 reads are needed for a locus with copy number 2, 30 for copy number 3 etc. (must be at least 1.) |
|
| --reference | alignment | n | Reference genome. Expected format is FASTA (NOT gzipped), index file should exist right next to FASTA. Required if alignment is in CRAM format | ||
| --repeats | -r | alignment | y | File specifying target repeat regions. Expected format is BED3+2 | |
| --sample | both | n (y for vcf) | infer from file name | Sample name | |
| --threads | both | y | 1 | Number of threads to use | |
| --vcf | -v | vcf | y | Input file to extract repeat allele lengths from (VCF) |
ConSTRain provides several INFO and FORMAT fields in its VCF output to describe repeat loci and genotyping results.
| INFO field | Description |
|---|---|
| END | End position of reference allele |
| RU | Repeat unit |
| PERIOD | Repeat period (length of unit) |
| REF | Repeat allele length in reference |
| FORMAT field | Description |
|---|---|
| GT | Genotype |
| FT | Filter tag. Contains PASS if all filters passed, otherwise reason for filter |
| CN | Copy number |
| DP | Number of fully spanning reads mapped to locus |
| FREQS | Frequencies observed for each allele length. Keys are allele lengths and values are the number of reads with that allele length |
| REPLEN | Genotype given in the number of times the unit is repeated for each allele |
The FT FORMAT field may hold different values. An overview of the meaning of different FT values is given here:
| FT value | Description |
|---|---|
| PASS | All filters passed |
| UNDEF | Undefined ConSTRain filter |
| DPZERO | No reads were mapped to locus |
| DPOOR | Normalised depth of coverage at locus was out of range specified by --min-norm-depth and --max-norm-depth command line arguments |
| CNZERO | Copy number was zero |
| CNOOR | Copy number was out of range specified by --max-cn command line argument |
| CNMISSING | No copy number set for locus. Can happen if contig is missing from karyotype or if only a part of the STR is affected by a CNA |
| AMBGT | Multiple genotypes are equally likely |
