Skip to content

Subcommand: afs heatmap

Lucas Czech edited this page Jun 21, 2023 · 22 revisions

Synopsis: Create a per-window heatmap of the allele frequency spectrum along the genome.

Usage: grenedalf afs-heatmap [options]

Required options:

  • --window-type

Documentation for grenedalf v0.2.0

Options

Input SAM/BAM/CRAM

--sam-path
TEXT:PATH(existing)=[] ...
List of sam/bam/cram files or directories to process. For directories, only files with the extension .sam[.gz]|.bam|.cram are processed. To input more than one file or directory, either separate them with spaces, or provide this option multiple times.
--sam-min-map-qual
UINT:UINT in [0 - 90]=0 Needs: --sam-path
Minimum phred-scaled mapping quality score [0-90] for a read in sam/bam/cram files to be considered. Any read that is below the given value of mapping quality will be completely discarded, and its bases not taken into account. Default is 0, meaning no filtering by base quality.
--sam-min-base-qual
UINT:UINT in [0 - 90]=0 Needs: --sam-path
Minimum phred-scaled quality score [0-90] for a base in sam/bam/cram files to be considered. Bases below this are ignored when computing allele frequencies. Default is 0, meaning no filtering by base quality.
--sam-split-by-rg
FLAG Needs: --sam-path
Instead of considering the whole sam/bam/cram file as one large colletion of reads, use the @RG read group tag to split reads. Each read group is then considered a sample. Reads with an invalid (not in the header) read group tag or without a tag are ignored.
--sam-flags-include-all
TEXT Needs: --sam-path
Only use reads with all bits in the given value present in the FLAG field of the read. This is equivalent to the -f / --require-flags setting in samtools view, and uses the same flag names and their corresponding binary values. The value can be specified in hex by beginning with 0x (i.e., /^0x[0-9A-F]+/), in octal by beginning with 0 (i.e., /^0[0-7]+/), as a decimal number not beginning with '0', or as a comma-, plus-, space-, or vertiacal-bar-separated list of flag names as specified by samtools. We are more lenient in parsing flag names than samtools, and allow different capitalization and delimiteres such as dashes and underscores in the flag names as well.
--sam-flags-include-any
TEXT Needs: --sam-path
Only use reads with any bits set in the given value present in the FLAG field of the read. This is equivalent to the --rf / --incl-flags / --include-flags setting in samtools view. See --sam-flags-include-all above for how to specify the value.
--sam-flags-exclude-all
TEXT Needs: --sam-path
Do not use reads with all bits set in the given value present in the FLAG field of the read. This is equivalent to the -G setting in samtools view. See --sam-flags-include-all above for how to specify the value.
--sam-flags-exclude-any
TEXT Needs: --sam-path
Do not use reads with any bits set in the given value present in the FLAG field of the read. This is equivalent to the -F / --excl-flags / --exclude-flags setting in samtools view. See --sam-flags-include-all above for how to specify the value.

Input (m)pileup

--pileup-path
TEXT:PATH(existing)=[] ...
List of (m)pileup files or directories to process. For directories, only files with the extension .(plp|mplp|pileup|mpileup)[.gz] are processed. To input more than one file or directory, either separate them with spaces, or provide this option multiple times.
--pileup-min-base-qual
UINT:UINT in [0 - 90]=0 Needs: --pileup-path
Minimum phred quality score [0-90] for a base in (m)pileup files to be considered. Bases below this are ignored when computing allele frequencies. Default is 0, meaning no filtering by phred quality score.
--pileup-quality-encoding
TEXT:{sanger,illumina-1.3,illumina-1.5,illumina-1.8,solexa}=sanger Needs: --pileup-path
Encoding of the quality scores of the bases in (m)pileup files, when using --pileup-min-base-qual. Default is "sanger", which seems to be the most common these days. Both "sanger" and "illumina-1.8" are identical and use an ASCII offset of 33, while "illumina-1.3" and "illumina-1.5" are identical with an ASCII offset of 64 (we provide different names for completeness). Lastly, "solexa" has an offset of 64, but uses a different equation (not phred score) for the encoding.

Input sync

--sync-path
TEXT:PATH(existing)=[] ...
List of sync (as specified by PoPoolation2) files or directories to process. For directories, only files with the extension .sync[.gz] are processed. To input more than one file or directory, either separate them with spaces, or provide this option multiple times.

Input VCF/BCF

--vcf-path
TEXT:PATH(existing)=[] ...
List of vcf/bcf files or directories to process. For directories, only files with the extension .vcf[.gz]|.bcf are processed. To input more than one file or directory, either separate them with spaces, or provide this option multiple times. This expects that the input file has the per-sample VCF FORMAT field AD (alleleic depth) given, containing the counts of the reference and alternative base. This assumes that the data that was used to create the VCF file was actually a pool of individuals (e.g., from pool sequencing) for each sample (column) of the VCF file. We then interpret the AD field as the allele counts of each pool of individuals.

Input frequency table

--frequency-table-path
TEXT:PATH(existing)=[] ...
List of frequency table files or directories to process. For directories, only files with the extension .(csv|tsv)[.gz] are processed. To input more than one file or directory, either separate them with spaces, or provide this option multiple times.
--frequency-table-separator-char
TEXT:{comma,tab,space,semicolon}=comma Needs: --frequency-table-path
Separator char between fields of the frequency table input.
--frequency-table-int-factor
FLOAT:POSITIVE=0 Needs: --frequency-table-path
For frequency table input that only contains allele frequencies, without any information on coverage, we need to transform those frequencies into counts for our internal processing. This number is multiplied by the frequency to obtain these pseudo-counts. By default, we use the largest number that is possible with our numerical types.
--frequency-table-freq-is-ref
FLAG Needs: --frequency-table-path
For frequency table input that contains allele frequencies, we need to decide whether those frequencies represent the reference or the alternative allele. By default, we assume the latter, i.e., values are interpreted as alternative allele frequencies. Use this flag to instead interpret them as reference allele frequencies.
--frequency-table-chr-column
TEXT Needs: --frequency-table-path
Specify the name of the chromosome column in the header, case sensitive. By default, we look for columns named "chromosome", "chrom", "chr", or "contig", case insensitive.
--frequency-table-pos-column
TEXT Needs: --frequency-table-path
Specify the name of the position column in the header, case sensitive. By default, we look for columns named "position" or "pos", case insensitive.
--frequency-table-ref-base-column
TEXT Needs: --frequency-table-path
Specify the name of the reference base column in the header, case sensitive. By default, we look for columns named "reference", "referencebase", "ref", or "refbase", case insensitive, and ignoring any extra punctuation marks.
--frequency-table-alt-base-column
TEXT Needs: --frequency-table-path
Specify the name of the alternative base column in the header, case sensitive. By default, we look for columns named "alternative", "alternativebase", "alt", or "altbase", case insensitive, and ignoring any extra punctuation marks.
--frequency-table-sample-ref-count-column
TEXT Needs: --frequency-table-path
Specify the exact prefix or suffix of the per-sample reference count columns in the header, case sensitive. By default, we look leniently for column names that combine any of "reference", "referencebase", "ref", or "refbase" with any of "counts", "count", "cnt", or "ct", case insensitive, and ignoring any extra punctuation marks, as a prefix or suffix, with the remainder of the column name used as the sample name. For example, "S1.ref_cnt" indicates the reference count column for sample "S1".
--frequency-table-sample-alt-count-column
TEXT Needs: --frequency-table-path
Specify the exact prefix or suffix of the per-sample alternative count columns in the header, case sensitive. By default, we look leniently for column names that combine any of "alternative", "alternativebase", "alt", or "altbase" with any of "counts", "count", "cnt", or "ct", case insensitive, and ignoring any extra punctuation marks, as a prefix or suffix, with the remainder of the column name used as the sample name. For example, "S1.alt_cnt" indicates the alternative count column for sample "S1".
--frequency-table-sample-freq-column
TEXT Needs: --frequency-table-path
Specify the exact prefix or suffix of the per-sample frequency columns in the header, case sensitive. By default, we look for column names having "frequency", "freq", "maf", "af", or "allelefrequency", case insensitive, and ignoring any extra punctuation marks, as a prefix or suffix, with the remainder of the column name used as the sample name. For example, "S1.freq" indicates the frequency column for sample "S1". Note that when the input data contains frequencies, but no reference or alternative base columns, such as HAF-pipe output tables, we cannot know the bases, and will hence guess. To properly set the reference bases, consider providing the --reference-genome-file option.
--frequency-table-sample-cov-column
TEXT Needs: --frequency-table-path
Specify the exact prefix or suffix of the per-sample coverage (i.e., depth) columns in the header, case sensitive. By default, we look for column names having "coverage", "cov", "depth", or "ad", case insensitive, and ignoring any extra punctuation marks, as a prefix or suffix, with the remainder of the column name used as the sample name. For example, "S1.cov" indicates the coverage column for sample "S1".

Input Settings

--multi-file-locus-set
TEXT:{union,intersection}=union
When multiple input files are provided, select whether the union of all their loci is used, or their intersection. For their union, input files that do not have data at a particular locus are considered to have zero coverage at that locus. Note that we allow to use multiple input files even with different file types.
--reference-genome-fasta-file
TEXT:FILE Excludes: --reference-genome-dict-file --reference-genome-fai-file
Provide a reference genome in .fasta[.gz] format. This allows to correctly assign the reference bases in file formats that do not store them, and serves as an integrity check in those that do. It further is used as a sequence dictionary to determine the chromosome order and length, on behalf of a dict or fai file.
--reference-genome-dict-file
TEXT:FILE Excludes: --reference-genome-fasta-file --reference-genome-fai-file
Provide a reference genome sequence dictionary in .dict format. It is used to determine the chromosome order and length, without having to provide the full reference genome.
--reference-genome-fai-file
TEXT:FILE Excludes: --reference-genome-fasta-file --reference-genome-dict-file
Provide a reference genome sequence dictionary in .fai format. It is used to determine the chromosome order and length, without having to provide the full reference genome.

Sample Names and Filters

--sample-name-list
TEXT Excludes: --sample-name-prefix
Some file types do not contain sample names, such as (m)pileup or sync files. For such file types, sample names can here be provided as either (1) a comma- or tab-separated list, or (2) as a file with one sample name per line, in the same order as samples are in the actual input file. We then use these names in the output and the --filter-samples-include and --filter-samples-exclude options. If not provided, we simply use numbers 1..n as sample names for these files types. Note that this option can only be used if a single file is given as input. Alternatively, use --sample-name-prefix to provide a prefix for this sample numbering.
--sample-name-prefix
TEXT Excludes: --sample-name-list
Some file types do not contain sample names, such as (m)pileup or sync files. For such file types, this prefix followed by indices 1..n can be used instead to provide unique names per sample that we use in the output and the --filter-samples-include and --filter-samples-exclude options. For example, use "Sample_" as a prefix. If not provided, we simply use numbers 1..n as sample names for these files types. This prefix also works if multiple files are given as input. Alternatively, use --sample-name-list to directly provide a list of sample names.
--filter-samples-include
TEXT Excludes: --filter-samples-exclude
Sample names to include (all other samples are excluded); either (1) a comma- or tab-separated list, or (2) a file with one sample name per line. If no sample filter is provided, all samples in the input file are used. The option considers --sample-name-list or --sample-name-prefix for file types that do not contain sample names. Note that this option can only be used if a single file is given as input.
--filter-samples-exclude
TEXT Excludes: --filter-samples-include
Sample names to exclude (all other samples are included); either (1) a comma- or tab-separated list, or (2) a file with one sample name per line. If no sample filter is provided, all samples in the input file are used. The option considers --sample-name-list or --sample-name-prefix for file types that do not contain sample names. Note that this option can only be used if a single file is given as input.

Region Filters

--filter-region
TEXT=[] ...
Genomic region to filter for, in the format "chr" (for whole chromosomes), "chr:position", "chr:start-end", or "chr:start..end". Positions are 1-based and inclusive (closed intervals). Multiple region options can be provided, see also --filter-region-set.
--filter-region-list
TEXT:FILE=[] ...
Genomic regions to filter for, as a file with one region per line, either in the format "chr" (for whole chromosomes), "chr:position", "chr:start-end", "chr:start..end", or tab- or space-delimited "chr position" or "chr start end". Positions are 1-based and inclusive (closed intervals). Multiple region options can be provided, see also --filter-region-set.
--filter-region-bed
TEXT:FILE=[] ...
Genomic regions to filter for, as a BED file. This only uses the chromosome, as well as start and end information per line, and ignores everything else in the file. Note that BED uses 0-based positions, and a half-open [) interval for the end position; simply using columns extracted from other file formats (such as vcf or gff) will not work. Multiple region options can be provided, see also --filter-region-set.
--filter-region-gff
TEXT:FILE=[] ...
Genomic regions to filter for, as a GFF2/GFF3/GTF file. This only uses the chromosome, as well as start and end information per line, and ignores everything else in the file. Multiple region options can be provided, see also --filter-region-set.
--filter-region-map-bim
TEXT:FILE=[] ...
Genomic positions to filter for, as a MAP or BIM file as used in PLINK. This only uses the chromosome and coordinate per line, and ignores everything else in the file. Multiple region options can be provided, see also --filter-region-set.
--filter-region-vcf
TEXT:FILE=[] ...
Genomic positions to filter for, as a VCF/BCF file (such as a known-variants file). This only uses the chromosome and position per line, and ignores everything else in the file. Multiple region options can be provided, see also --filter-region-set.
--filter-region-set
TEXT:{union,intersection}=union
If multiple genomic region filters are set, decide on how to combine the loci of these filters.

Window

--window-type
TEXT:{sliding,queue,single,regions}=sliding
Required. Type of window to use. Depending on the type, additional options might need to be provided.
(1) sliding: Typical sliding window over intervals of fixed length (in bases) along the genome.
(2) queue: Sliding window, but instead of using a fixed length of bases along the genome, it uses a fixed number of positions of the input data. Typically used for windowing over variant positions such as (biallelic) SNPs, and useful for example when SNPs are sparse in the genome.
(3) single: Treat each position of the input data as an individual window of size 1. This is typically used to process single SNPs, and equivalent to sliding or queue with a width/count of 1.
(4) regions: Windows corresponding to some regions of interest, such as genes. The regions need to be provided, and can be overlapping or nested as needed.
--window-sliding-width
UINT=0
Required when using --window-type sliding: Width of each window along the chromosome, in bases.
--window-sliding-stride
UINT=0
When using --window-type sliding: Stride between windows along the chromosome, that is how far to move to get to the next window. If set to 0 (default), this is set to the same value as the --window-sliding-width.
--window-queue-count
UINT=0
Required when using --window-type queue: Number of positions in the genome in each window. This is most commonly used when also filtering for variant positions such as (biallelic) SNPs (which most commands do implicitly), so that each window of the analysis conists of a fixed number of SNPs, instead of a fixed length alogn the genome.
--window-queue-stride
UINT=0
When using --window-type queue: Stride of positions between windows along the chromosome, that is how many positions does the window move forward each time. If set to 0 (default), this is set to the same value as the --window-queue-count, meaning that each new window consists of new positions.
--window-region
TEXT=[] ...
When using --window-type regions: Genomic region to process as windows, in the format "chr" (for whole chromosomes), "chr:position", "chr:start-end", or "chr:start..end". Positions are 1-based and inclusive (closed intervals). Multiple region options can be provided to add region windows to be processed.
--window-region-list
TEXT:FILE=[] ...
When using --window-type regions: Genomic regions to process as windows, as a file with one region per line, either in the format "chr" (for whole chromosomes), "chr:position", "chr:start-end", "chr:start..end", or tab- or space-delimited "chr position" or "chr start end". Positions are 1-based and inclusive (closed intervals). Multiple region options can be provided to add region windows to be processed.
--window-region-bed
TEXT:FILE=[] ...
When using --window-type regions: Genomic regions to process as windows, as a BED file. This only uses the chromosome, as well as start and end information per line, and ignores everything else in the file. Note that BED uses 0-based positions, and a half-open [) interval for the end position; simply using columns extracted from other file formats (such as vcf or gff) will not work. Multiple region options can be provided to add region windows to be processed.
--window-region-gff
TEXT:FILE=[] ...
When using --window-type regions: Genomic regions to process as windows, as a GFF2/GFF3/GTF file. This only uses the chromosome, as well as start and end information per line, and ignores everything else in the file. Multiple region options can be provided to add region windows to be processed.
--window-region-skip-empty
FLAG
When using --window-type regions: In cases where there is no data in the input files for a region window, by default, we produce some "empty" or NaN output. With this option however, regions without data are skipped in the output.

Settings

--resolution
UINT:POSITIVE=100
Resolution of the spectrum histogram, that is, the number of bins of frequencies. This is hence also the vertical resolution (in pixels) of the resulting images.
--max-frequency
FLOAT:(FLOAT in [0 - 1]) AND (POSITIVE)=1
Maximum frequency, that is, the y-axis cutoff; default is 1.0. Frequencies above the maximum will be assinged to the highest bin. When using --spectrum-type folded, consider setting this option to 0.5, as that is the maximum of the folded spectrum.
--spectrum-type
TEXT:{folded,unfolded}=unfolded
Type of spectrum to compute. That is, which frequencies do the values in the heat map represent: Either the frequency of the alternative allele (unfolded, default; uses the highest count/frequency allele that is not the reference allele as given in the input file, with frequencies in range [0, 1]), or the frequency of the minor allele (folded; uses the allele with the second highest count/frequency after the major (most common) allele, with frequencies in range [0, 0.5]).
--average-method
TEXT:{arithmetic,geometric,harmonic,counts}=harmonic
If multiple samples are used (present in the file, and not filtered out), either compute the average of the frequencies per sample (using either arithmetic, geometric, or harmonic mean), or sum up the allele counts per sample, and then compute the frequency from that. The former gives each sample the same weight, while in the latter case, samples with more counts (more sequence reads) have a higher influence. If harmonic mean is used, we apply a correction for frequencies of zero, which happens in samples that do not have any alternative/minor allele counts.
--fold-undetermined-positions
FLAG
Only relevant if --spectrum-type unfolded is set. By default, positions with undetermined reference alleles ('N') are skipped in the unfolded spectrum. If however this option is set, these positions are instead folded; that is, we then assume the major allele with the highest count/frequency to be the reference allele.
--write-individual-bmps
FLAG
If set, write individual heatmap files for each chromosome, in bitmap format, in addition to the svg heatmap that contains all (not filtered) chromosomes.

Output

--out-dir
TEXT=.
Directory to write files to
--file-prefix
TEXT
File prefix for output files. Most grenedalf commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
--file-suffix
TEXT
File suffix for output files. Most grenedalf commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
--compress
FLAG
If set, compress the output files using gzip. Output file extensions are automatically extended by .gz.

Global Options

--allow-file-overwriting
FLAG
Allow to overwrite existing output files instead of aborting the command.
--verbose
FLAG
Produce more verbose output.
--threads
UINT
Number of threads to use for calculations. If not set, we guess a reasonable number of threads, by looking at the environmental variables (1) OMP_NUM_THREADS (OpenMP) and (2) SLURM_CPUS_PER_TASK (slurm), as well as (3) the hardware concurrency, taking hyperthreads into account, in the given order of precedence.
--log-file
TEXT
Write all output to a log file, in addition to standard output to the terminal.

Citation

When using this method, please do not forget to cite

Lucas Czech, Jeffrey Spence, Moises Exposito-Alonso. grenedalf: population genetic statistics for the next generation of pool sequencing. arXiv, 2023. doi:10.48550/arXiv.2306.11622