ConSTRain: copy number guided STR allele inference

ConSTRain is a short tandem repeat (STR) variant caller that can account for copy number variants (CNVs), aneuploidies, and polyploid genomes. It is accurate and fast, needing less than 20 minutes to genotype >1.7 million STRs in a 100X alignment of human whole-genome sequencing reads on 32 threads. For more detailed information on ConSTRain, please refer to the preprint.

Method overview
Installation
Running ConSTRain
Input files and their formats
Command line arguments
INFO and FORMAT fields

In case you encounter any problems with ConSTRain or have a question, feel free to open an issue or send an email to max.verbiest@zhaw.ch.

Method overview

To infer STR genotypes, ConSTRain generates all possible allele length distributions for each locus and returns the one that best matches the observed allele length distribution. An overview of how ConSTRain works internally is shown below.

ConSTRain overview and example. (1) An STR locus are loaded from the input files. The locus reference information is parsed from the STR panel. The STR copy number is set based on the karyotype, and optionally updated if the STR is affected by a CNV. (2) Reads overlapping the STR region are extracted from the alignment file, and the length of the STR region in each read is determined. (3) The observed distribution is sorted, and at most as many allele lengths as the STR copy number are kept. (4) This yields the final observed allele length distribution. (5) Next, all possible genotypes are generated for the STR copy number and stored in matrix G. (6) From G, the matrix D is generated by multiplying it with the total number of mapped reads (51 in the example) divided by the STR copy number (3 in the example). Each row in D corresponds to the expected allele length distribution of one of the genotypes in G. (7) The expected distribution with the lowest error to the observed distribution is found by taking the absolute difference between each row in D and the observed distribution, then (8) taking the sum of rows and finding the one with the lowest value. (9) The genotype in G with the lowest error is selected (10) and reported in the output. The inferred genotype of the STR locus in this example consists of an allele of 4 CAG units (present once), an allele of 5 CAG units (present once), and an allele of 8 CAG units (also present once).

Installation

Using the pre-compiled binaries

ConSTRain releases ship with pre-compiled binaries for MacOS (Intel) and x86_64 Linux. You can find them here. Please notify us of any issues or if you'd like us to include pre-compiled binaries for other targets.

Building from source

In order to build ConSTRain from the source code, you will need to install the Rust programming language. Then, clone this repository and build ConSTRain as follows:

git clone https://github.com/acg-team/ConSTRain.git
cd ConSTRain/ConSTRain
cargo build --release --bin ConSTRain

The binary will be created at ./target/release/ConSTRain (alternatively, cargo install --path . from within the ConSTRain/ConSTRain directory will create the binary and put it in ~/.cargo/bin ).

Possible issues

libclang is not found

Building hts-sys fails with the following error:

Unable to find libclang: "couldn't find any valid shared libraries matching: ['libclang.so', 'libclang-*.so', 'libclang.so.*' 'libclang-*.so.*'], set the LIBCLANG_PATH environment variable to a path where one of these files can be found (invalid: [])"

This is caused because Bindgen requires libclang. The problem can be solved by installing clang (see instructions here) or, if it is already installed, providing the location of the directory containing libclang.so through the LIBCLANG_PATH environment variable:

LIBCLANG_PATH=/path/to/clang/lib cargo build --release --bin ConSTRain

System header files are not found

Building hts-sys fails with an error like this:

./htslib/htslib/hts.h:31:10: fatal error: 'stddef.h' file not found

Caused because some system header file (e.g., stddef.h) is not found. Address this by providing the location of the directory containing system header files through the C_INCLUDE_PATH environment variable:

C_INCLUDE_PATH=/path/to/include cargo build --release --bin ConSTRain

Running ConSTRain

Some basic ConSTRain invocations are shown here. For details on all command line arguments, run ConSTRain --help or go to the command line arguments section below.

Calling STR variants from aligned sequencing reads is done via the ConSTRain alignment subcommand. To run, ConSTRain alignment needs at least a BED file specifiying locations of STRs in the reference genome, a JSON file encoding the target chromosomes and their ploidies, and an alignment in SAM/BAM/CRAM format. For some species, input files are provided in resources directory (currently, only Homo sapiens and Musa acuminata). A basic ConSTRain alignment invocation looks like this:

ConSTRain alignment -a <ALIGNMENT> -k <KARYOTYPE> -r <REPEATS>

By default, ConSTRain prints logging information up to and including 'info' level to stderr. To enable more in-depth logging information, set the RUST_LOG environment variable, e.g.:

RUST_LOG=debug ConSTRain alignment -a <ALIGNMENT> -k <KARYOTYPE> -r <REPEATS>

If you know there are CNVs in the sample you're analysing, ConSTRain can account for these if you supply the affected regions as a BED file via the --cnvs flag:

ConSTRain alignment -a <ALIGNMENT> -k <KARYOTYPE> -r <REPEATS> --cnvs <CNVs>

Re-running ConSTRain on a VCF file

Once an alignment has been analysed using ConSTRain alignment the generated VCF file can be used to re-run ConSTRain. This is useful if you decide you want to use different filtering parameters, or maybe you have access to new CNV information that was not available when you first analysed a sample. Rather than having to analyse the alignment again, you can use the ConSTRain vcf subcommand to re-estimate STR genotypes. This is possible because ConSTRain includes the observed allele length distribution for each STR as a FORMAT field in the VCF file. Running ConSTRain vcf is typically a matter of a few seconds.

The basic invocation of ConSTRain vcf is:

ConSTRain vcf -v <VCF> -k <KARYOTYPE> -r <REPEATS> -s <SAMPLE>

Again, you can add CNV information via the --cnv flag:

ConSTRain vcf -v <VCF> -k <KARYOTYPE> -r <REPEATS> -s <SAMPLE> --cnvs <CNVs>

Examples

Some mock input files are located in the ConSTRain/tests/data folder of this repository. We will use these files to give some small examples of how to run ConSTRain.

ConSTRain alignment \
    -a ConSTRain/tests/data/duplication_cn3_out1.bam \
    -r ConSTRain/tests/data/APC_repeats.bed \
    -k resources/h_sapiens/h_sapiens_female.json | \
    grep "^#CHROM" -A 1 | \
    cut -f 1,2,4,5,9,10

This will run ConSTRain on the mock input files. Some logging information will be printed to stderr, followed by a VCF header line and a single STR locus (we extract only a few columns using cut here.

#CHROM  POS     REF     ALT     FORMAT  duplication_cn3_out1
chr5    106700  AAAAAAAAAAAAAAA AAAAAAAAAA,AAAAAAAAAAAAAAAA     GT:FT:CN:DP:FREQS:REPLEN        1/2:PASS:2:124:10,44|15,39|16,41:10,16

In the above output, you can see that ConSTRain estimates a biallelic genotype since we specify chr2 to be represented by two homologues in the karyotype of this sample. However, we have observed three distinct repeat lengths (REPLEN). If we now run ConSTRain on the same input samples, but with the additional of CNV information indicating that this locus is actually located in an amplified region (with copy number three), we see the following:

ConSTRain alignment \
    -a ConSTRain/tests/data/duplication_cn3_out1.bam \
    -r ConSTRain/tests/data/APC_repeats.bed \
    -k resources/h_sapiens/h_sapiens_female.json \
    --cnvs ConSTRain/tests/data/cnv_test.bed | \
    grep "^#CHROM" -A 1 | \
    cut -f 1,2,4,5,9,10

#CHROM  POS     REF     ALT     FORMAT  duplication_cn3_out1
chr5    106700  AAAAAAAAAAAAAAA AAAAAAAAAA,AAAAAAAAAAAAAAAA     GT:FT:CN:DP:FREQS:REPLEN        0/1/2:PASS:3:124:10,44|15,39|16,41:10,15,16

We see that the same allele length distribution for this locus is observed as before (which makes sense), but that the interpretation is different! Since we have now told ConSTRain that this locus is located in an amplified region with copy number three, this same allele length distribution is now interpreted differently. ConSTRain now estimates that the most likely genotype for this STR consists of three different allele lengths.

NOTE: we also could have piped the output of the initial ConSTRain run to a VCF file, and then run ConSTRain vcf on it with the additional CNV information for the same result.

Input file formats

ConSTRain expects alignments to adhere to file specifications maintained by the GA4GH, and outputs variant calls in VCF format. This means that coordinates in BED files must be 0-based and open ended, in accordance with the BED file specification. However, specific formats are expected for other input files, which are outlined here.

Karyotype: the organism's karyotype should be specified as a JSON file mapping the names of chromosomes to their ploidies. E.g., {"chr1": 2, ... "chrX": 2, "chrY: 0"} for a human female sample, {"chr1": 2, ... "chrX": 1, "chrY: 1"} for a human male. It is critical that chromosome names in this file match chromosome names in the alignment file exactly.
Repeats: the location of repeats in the reference genome should be provided as a BED3+2 file, with the two extra columns indicating the repeat unit length and the repeat unit. E.g., chr5 21004 21014 2 AT specifies a dinucleotide repeat with sequence ATATATATAT starting at position 21004 on chromosome 5. Please note: ConSTRain currently only considers perfect repeat loci (i.e., the repeat unit is exactly the same each time, no interruptions or mismatches). If you provide a custom STR reference panel, please ensure that you only specify repeats of this nature, otherwise the ConSTRain output may not accurately reflect the genotype in the samples you're analysing.
Copy number variants: CNVs can be specified as a BED3+1 file, with the extra column indicating the (absolute!) copy number of the region affected by the CNV. E.g., chr5 106680 106750 3. It is important to note that ConSTRain makes no effort to estimate the copy number of loci itself. To generate CNV calls, you will need to run one of the many available CNV calling tools on an alignment (or some orthogonal data modality) of the samples you're analysing. Alternatively, large sequencing efforts such as the 1000 Genomes project or GTEx will provide CNV calls for the samples in their cohort. Such cohorts could be a good starting point if you're interested in using ConSTRain to analyse the interplay between STR variablity and CNVs.

Command line arguments

All of ConSTRain's command line arguments are listed here. For details on expected formats of the CNV, karyotype, and repeat file, see the section on input file formats below.

long	short	subcommand	required	default	description
--alignment	-a	alignment	y		Input file to extract repeat allele lengths from. Can be SAM/BAM/CRAM
--cnvs		both	n		Copy number variants for this individual. Expected format is BED3+1
--flanksize		alignment	y	5	Size of flanking region around the target repeat that reads need to cover to be considered (both sides)
--karyotype	-k	alignment	y		File containing chromosome names and their base ploidies. Expected format is JSON
--max-cn		both	y	20	Maximum copy number to consider
--max-norm-depth		both	n		Maximum normalised depth of coverage to perform allele length estimation. E.g., `max-norm-depth` of 30. means loci with more than 60 reads for a locus with copy number 2 will be skipped, 90 for copy number 3 etc.
--min-norm-depth		both	y	1.0	Minimum normalised depth of coverage to perform allele length estimation. E.g., `min-norm-depth` of 10. means at least 20 reads are needed for a locus with copy number 2, 30 for copy number 3 etc. (must be at least 1.)
--reference		alignment	n		Reference genome. Expected format is FASTA (NOT gzipped), index file should exist right next to FASTA. Required if alignment is in CRAM format
--repeats	-r	alignment	y		File specifying target repeat regions. Expected format is BED3+2
--sample		both	n (y for vcf)	infer from file name	Sample name
--threads		both	y	1	Number of threads to use
--vcf	-v	vcf	y		Input file to extract repeat allele lengths from (VCF)

ConSTRain INFO and FORMAT fields

ConSTRain provides several INFO and FORMAT fields in its VCF output to describe repeat loci and genotyping results.

INFO fields

INFO field	Description
END	End position of reference allele
RU	Repeat unit
PERIOD	Repeat period (length of unit)
REF	Repeat allele length in reference

FORMAT fields

FORMAT field	Description
GT	Genotype
FT	Filter tag. Contains PASS if all filters passed, otherwise reason for filter
CN	Copy number
DP	Number of fully spanning reads mapped to locus
FREQS	Frequencies observed for each allele length. Keys are allele lengths and values are the number of reads with that allele length
REPLEN	Genotype given in the number of times the unit is repeated for each allele

Filter tags

The FT FORMAT field may hold different values. An overview of the meaning of different FT values is given here:

FT value	Description
PASS	All filters passed
UNDEF	Undefined ConSTRain filter
DPZERO	No reads were mapped to locus
DPOOR	Normalised depth of coverage at locus was out of range specified by --min-norm-depth and --max-norm-depth command line arguments
CNZERO	Copy number was zero
CNOOR	Copy number was out of range specified by --max-cn command line argument
CNMISSING	No copy number set for locus. Can happen if contig is missing from karyotype or if only a part of the STR is affected by a CNA
AMBGT	Multiple genotypes are equally likely

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
ConSTRain		ConSTRain
constrain_utils		constrain_utils
images		images
resources		resources
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ConSTRain: copy number guided STR allele inference

Method overview

Installation

Using the pre-compiled binaries

Building from source

Possible issues

libclang is not found

System header files are not found

Running ConSTRain

Re-running ConSTRain on a VCF file

Examples

Input file formats

Command line arguments

ConSTRain INFO and FORMAT fields

INFO fields

FORMAT fields

Filter tags

About

Uh oh!

Releases 2

Packages

Uh oh!

Languages

License

acg-team/ConSTRain

Folders and files

Latest commit

History

Repository files navigation

ConSTRain: copy number guided STR allele inference

Method overview

Installation

Using the pre-compiled binaries

Building from source

Possible issues

libclang is not found

System header files are not found

Running ConSTRain

Re-running ConSTRain on a VCF file

Examples

Input file formats

Command line arguments

ConSTRain INFO and FORMAT fields

INFO fields

FORMAT fields

Filter tags

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Languages

Packages