Skip to content

Subcommand: fst cathedral

Lucas Czech edited this page Oct 10, 2024 · 13 revisions

Synopsis: Compute the data for an FST cathedral plot.

Usage: grenedalf fst-cathedral [options]

Required options:

  • --method
  • --pool-sizes

Documentation for grenedalf v0.6.2

Table of contents:

Description

Compute the data for an FST cathedral plot. This command does the main part of the computation: reading the input sample data, computing per-position FST values along the chromosomes, and computing the full matrix of values that make the plot (i.e., that are then turned into the pixels of the plot). For each sample pair and each chromosome, the output of the command are two files: a csv file containing the per-pixel value matrix for the plot, and a json file describing the plot and its meta-data.

The matrix simply contains the FST values for each pixel of the plot. Each row corresponds to one window size, and each cell along the columns is the FST value for a window centered on the position along the genome that the column corresponds to.

Afterwards, use the cathedral-plot command to create plots from these files, in a plain bmp format, as well as an svg image that also contains axis, a legend, etc. Making a cathedral plot is split into these two commands: The computation (this command here) can take a while, but one might want to iterate quickly over several color schemes and other plot-specific settings. Having this as an independent command hence avoids recomputing the value matrix each time.

Note that the command processes all given pairs of samples at the same time. We split the computation per chromosome, but for each chromosome, we need to keep data memory in the order of the number of variant positions. Hence, for a large number of samples, this can result in high memory usage. If that is the case, simply run this command individually on subsets of sample pairs.

Also, we here do not account for missing data in the windows. That is, the pi values involved in the computation of FST are not normalized per window, and instead simply their sum is used to compute the Nei/Hudson estimators for each window. Properly accounting for missing data would entail to keep information on that during the computation per pixel as well, which we have opted not to implement at this time. As cathedral plots are meant as a visual explorative tool anyway, this should be okay. It will not make a large difference unless there is extremely uneven coverage between the samples. If you need this though, please let us know.

FST Cathedral Plots

FST Cathedral Plot.

The above shows what we call a "cathedral plot" of FST on chromosome 1 between two pool sequencing samples of A. thaliana, which reveals the differentiation structure between two populations across different scales, with darker regions having a higher FST between the two populations.

The x-axis corresponds to chromosome 1 of A. thaliana, which is ≈30mio bases long. The y-axis position (rows of pixels) determines the window size to be used for that row. From top to bottom, windows get smaller, with an exponentially decaying size. The top row corresponds to windows of the full length of the chromosome, and towards the bottom, windows get smaller until their size matches the resolution of the image. Here, we plotted the image with a width of 1500 pixels, so that each pixel in the bottom row represents ≈20k positions in the genome. For a given image width, this is the highest resolution that can be displayed; at this point, each window is exactly as wide as the pixel it corresponds to, and windows between pixels in the row are not overlapping with each other any more. Windows are centered around their pixel, meaning that towards the left and right of the plot (and in particular towards the top of the image, with larger window sizes), the actual number of positions within a window is truncated.

In other words, for each pixel, the plot involves accumulating values in a window centered on that pixel, with the width of the window determined by the row in the image. Each pixel hence corresponds to a distinct window width and position, and shows the FST value in that window. Our implementation computes intermediate per-SNP values once; for FST for example, we compute π within, π between, and π total for each SNP, as explained in our equations document. Per window, these values are then accumulated the resulting FST is computed.

The plot hence "zooms in" from larger to smaller windows, with increasing resolution along the genome towards the bottom. This can help to visualize the scale of a statistic across different window sizes, showing its broad and fine structure, and allows picking appropriate window sizes for further analyses. Note that we chose exponential decay for the window size, as it offers a good balance between revealing both broad structures and fine details. This results in the y-axis being log-scaled.

Options

Input SAM/BAM/CRAM

--sam-path
TEXT:PATH(existing)=[] ...
List of sam/bam/cram files or directories to process. For directories, only files with the extension .sam[.gz]|.bam|.cram are processed. To input more than one file or directory, either separate them with spaces, or provide this option multiple times.
--sam-min-map-qual
UINT:UINT in [0 - 90]=0 Needs: --sam-path
Minimum phred-scaled mapping quality score [0-90] for a read in sam/bam/cram files to be considered. Any read that is below the given value of mapping quality will be completely discarded, and its bases not taken into account. Default is 0, meaning no filtering by base quality.
--sam-min-base-qual
UINT:UINT in [0 - 90]=0 Needs: --sam-path
Minimum phred-scaled quality score [0-90] for a base in sam/bam/cram files to be considered. Bases below this are ignored when computing allele frequencies. Default is 0, meaning no filtering by base quality.
--sam-split-by-rg
FLAG Needs: --sam-path
Instead of considering the whole sam/bam/cram file as one large colletion of reads, use the @RG read group tag to split reads. Each read group is then considered a sample. Reads with an invalid (not in the header) read group tag or without a tag are ignored.
--sam-flags-include-all
TEXT Needs: --sam-path
Only use reads with all bits in the given value present in the FLAG field of the read. This is equivalent to the -f / --require-flags setting in samtools view, and uses the same flag names and their corresponding binary values. The value can be specified in hex by beginning with 0x (i.e., /^0x[0-9A-F]+/), in octal by beginning with 0 (i.e., /^0[0-7]+/), as a decimal number not beginning with '0', or as a comma-, plus-, space-, or vertiacal-bar-separated list of flag names as specified by samtools. We are more lenient in parsing flag names than samtools, and allow different capitalization and delimiteres such as dashes and underscores in the flag names as well.
--sam-flags-include-any
TEXT Needs: --sam-path
Only use reads with any bits set in the given value present in the FLAG field of the read. This is equivalent to the --rf / --incl-flags / --include-flags setting in samtools view. See --sam-flags-include-all above for how to specify the value.
--sam-flags-exclude-all
TEXT Needs: --sam-path
Do not use reads with all bits set in the given value present in the FLAG field of the read. This is equivalent to the -G setting in samtools view. See --sam-flags-include-all above for how to specify the value.
--sam-flags-exclude-any
TEXT Needs: --sam-path
Do not use reads with any bits set in the given value present in the FLAG field of the read. This is equivalent to the -F / --excl-flags / --exclude-flags setting in samtools view. See --sam-flags-include-all above for how to specify the value.

Input (m)pileup

--pileup-path
TEXT:PATH(existing)=[] ...
List of (m)pileup files or directories to process. For directories, only files with the extension .(plp|mplp|pileup|mpileup)[.gz] are processed. To input more than one file or directory, either separate them with spaces, or provide this option multiple times.
--pileup-min-base-qual
UINT:UINT in [0 - 90]=0 Needs: --pileup-path
Minimum phred quality score [0-90] for a base in (m)pileup files to be considered. Bases below this are ignored when computing allele frequencies. Default is 0, meaning no filtering by phred quality score.
--pileup-quality-encoding
TEXT:{sanger,illumina-1.3,illumina-1.5,illumina-1.8,solexa}=sanger Needs: --pileup-path
Encoding of the quality scores of the bases in (m)pileup files, when using --pileup-min-base-qual. Default is "sanger", which seems to be the most common these days. Both "sanger" and "illumina-1.8" are identical and use an ASCII offset of 33, while "illumina-1.3" and "illumina-1.5" are identical with an ASCII offset of 64 (we provide different names for completeness). Lastly, "solexa" has an offset of 64, but uses a different equation (not phred score) for the encoding.

Input sync

--sync-path
TEXT:PATH(existing)=[] ...
List of sync (as specified by PoPoolation2) files or directories to process. For directories, only files with the extension .sync[.gz] are processed. To input more than one file or directory, either separate them with spaces, or provide this option multiple times.

Input VCF/BCF

--vcf-path
TEXT:PATH(existing)=[] ...
List of vcf/bcf files or directories to process. For directories, only files with the extension .vcf[.gz]|.bcf are processed. To input more than one file or directory, either separate them with spaces, or provide this option multiple times. This expects that the input file has the per-sample VCF FORMAT field AD (alleleic depth) given, containing the counts of the reference and alternative base. This assumes that the data that was used to create the VCF file was actually a pool of individuals (e.g., from pool sequencing) for each sample (column) of the VCF file. We then interpret the AD field as the allele counts of each pool of individuals. Note that only SNP positions are used; positions that contain indels and other non-SNP variants are skipped.

Input frequency table

--frequency-table-path
TEXT:PATH(existing)=[] ...
List of frequency table files or directories to process. For directories, only files with the extension .(csv|tsv)[.gz] are processed. To input more than one file or directory, either separate them with spaces, or provide this option multiple times.
--frequency-table-separator-char
TEXT:{comma,tab,space,semicolon}=comma Needs: --frequency-table-path
Separator char between fields of the frequency table input.
--frequency-table-missing-value
TEXT Needs: --frequency-table-path
Marker for denoting missing values in the table. By default, we use ., nan, and na.
--frequency-table-depth-factor
FLOAT:POSITIVE=0 Needs: --frequency-table-path
For frequency table input that only contains allele frequencies, without any information on read depth, we need to transform those frequencies into counts for our internal processing. This number is multiplied by the frequency to obtain these pseudo-counts. By default, we use 1000000, to get a reasonable interger approximation of the floating point frequency. This is of course above any typical read depth, but allows for more accurate counts when using for instance haplotype-corrected frequencies such as those from HAF-pipe.
--frequency-table-freq-is-ref
FLAG Needs: --frequency-table-path
For frequency table input that contains allele frequencies, we need to decide whether those frequencies represent the reference or the alternative allele. By default, we assume the latter, i.e., values are interpreted as alternative allele frequencies. Use this flag to instead interpret them as reference allele frequencies.
--frequency-table-chr-column
TEXT Needs: --frequency-table-path
Specify the name of the chromosome column in the header, case sensitive. By default, we look for columns named "chromosome", "chrom", "chr", or "contig", case insensitive.
--frequency-table-pos-column
TEXT Needs: --frequency-table-path
Specify the name of the position column in the header, case sensitive. By default, we look for columns named "position" or "pos", case insensitive.
--frequency-table-ref-base-column
TEXT Needs: --frequency-table-path
Specify the name of the reference base column in the header, case sensitive. By default, we look for columns named "reference", "referencebase", "ref", or "refbase", case insensitive, and ignoring any extra punctuation marks.
--frequency-table-alt-base-column
TEXT Needs: --frequency-table-path
Specify the name of the alternative base column in the header, case sensitive. By default, we look for columns named "alternative", "alternativebase", "alt", or "altbase", case insensitive, and ignoring any extra punctuation marks.
--frequency-table-sample-ref-count-column
TEXT Needs: --frequency-table-path
Specify the exact prefix or suffix of the per-sample reference count columns in the header, case sensitive. By default, we look leniently for column names that combine any of "reference", "referencebase", "ref", or "refbase" with any of "counts", "count", "cnt", or "ct", case insensitive, and ignoring any extra punctuation marks, as a prefix or suffix, with the remainder of the column name used as the sample name. For example, "S1.ref_cnt" indicates the reference count column for sample "S1".
--frequency-table-sample-alt-count-column
TEXT Needs: --frequency-table-path
Specify the exact prefix or suffix of the per-sample alternative count columns in the header, case sensitive. By default, we look leniently for column names that combine any of "alternative", "alternativebase", "alt", or "altbase" with any of "counts", "count", "cnt", or "ct", case insensitive, and ignoring any extra punctuation marks, as a prefix or suffix, with the remainder of the column name used as the sample name. For example, "S1.alt_cnt" indicates the alternative count column for sample "S1".
--frequency-table-sample-freq-column
TEXT Needs: --frequency-table-path
Specify the exact prefix or suffix of the per-sample frequency columns in the header, case sensitive. By default, we look for column names having "frequency", "freq", "maf", "af", or "allelefrequency", case insensitive, and ignoring any extra punctuation marks, as a prefix or suffix, with the remainder of the column name used as the sample name. For example, "S1.freq" indicates the frequency column for sample "S1". Note that when the input data contains frequencies, but no reference or alternative base columns, such as HAF-pipe output tables, we cannot know the bases, and will hence guess. To properly set the reference bases, consider providing the --reference-genome-fasta option.
--frequency-table-sample-depth-column
TEXT Needs: --frequency-table-path
Specify the exact prefix or suffix of the per-sample read depth columns in the header, case sensitive. By default, we look for column names having "readdepth", "depth", "coverage", "cov", or "ad", case insensitive, and ignoring any extra punctuation marks, as a prefix or suffix, with the remainder of the column name used as the sample name. For example, "S1.read-depth" indicates the read depth column for sample "S1".

Input Settings

--multi-file-locus-set
TEXT:{union,intersection}=union
When multiple input files are provided, select whether the union of all their loci is used (outer join), or their intersection (inner join). For their union, input files that do not have data at a particular locus are considered as missing at that locus. Note that we allow to use multiple input files even with different file types.
--make-gapless
FLAG
By default, we only operate on the positions for which there is data. In particular, positions that are absent in the input are completely ignored; they do not even show up in the missing column of output tables. This is because for the statistics, data being absend or (marked as) missing is merely a sementic distinction, but it does not change the results. However, it might make processing with downstream tools easier if the output contains all positions, for instance when using single windows. With this option, all absent positions are filled in as missing data, so that they show up in the missing column and as entries in single windows. If a referene genome or dictionary is given, this might also include positions beyond where there is input data, up until the length of each chromosome. Note that this can lead to large ouput tables when processing single positions.
--reference-genome-fasta
TEXT:FILE Excludes: --reference-genome-dict --reference-genome-fai
Provide a reference genome in .fasta[.gz] format. This allows to correctly assign the reference bases in file formats that do not store them, and serves as an integrity check in those that do. It further is used as a sequence dictionary to determine the chromosome order and length, on behalf of a dict or fai file.
--reference-genome-dict
TEXT:FILE Excludes: --reference-genome-fasta --reference-genome-fai
Provide a reference genome sequence dictionary in .dict format. It is used to determine the chromosome order and length, without having to provide the full reference genome.
--reference-genome-fai
TEXT:FILE Excludes: --reference-genome-fasta --reference-genome-dict
Provide a reference genome sequence dictionary in .fai format. It is used to determine the chromosome order and length, without having to provide the full reference genome.

Sample Names, Groups, and Filters

--rename-samples-list
TEXT:FILE
Allows to rename samples, by providing a file that lists the old and new sample names, one per line, separating old and new names by a tab.
By default, we use sample names as provided in the input files. Some file types however do not contain sample names, such as (m)pileup or sync files (unless the non-standard sync header line is provided). For such file types, sample names are automatically assigned by using their input file base name (without path and extension), followed by a dot and numbers 1..n for all samples in that file. For instance, samples in /path/to/sample.sync are named sample.1, sample.2, etc.
Using this option, those names can be renamed as needed. Use verbose output (--verbose) to show a list of all sample names. We then use these names in the output as well as in the --filter-samples-include and --filter-samples-exclude options.
--filter-samples-include
TEXT Excludes: --filter-samples-exclude
Sample names to include (all other samples are excluded); either (1) a comma- or tab-separated list given on the command line (in a typical shell, this list has to be enclosed in quotation marks), or (2) a file with one sample name per line. If no sample filter is provided, all samples in the input file are used. The option is applied after potentially renaming the samples with --rename-samples-list.
--filter-samples-exclude
TEXT Excludes: --filter-samples-include
Sample names to exclude (all other samples are included); either (1) a comma- or tab-separated list given on the command line (in a typical shell, this list has to be enclosed in quotation marks), or (2) a file with one sample name per line. If no sample filter is provided, all samples in the input file are used. The option is applied after potentially renaming the samples with --rename-samples-list.
--sample-group-merge-table
TEXT:FILE
When the input contains multiple samples (either within a single input file, or by providing multiple input files), these can be merged into new samples, by summing up their nucleotide base counts at each position. This has essentially the same effect as having merged the raw fastq files or the mapped sam/bam files of the samples, that is, all reads from those samples are treated as if they were a single sample. For this grouping, the option takes a simple table file (comma- or tab-separated), with the sample names (after the above renaming, if provided) in the first column, and their assigned group names in the second column. All samples in the same group are then merged into a grouped sample, and the group names are used as the new sample names for the output. Note that the --pool-sizes option then need to contain the summed up pool sizes for each group, using the group names.

Region Filters

--filter-region
TEXT=[] ...
Genomic region to filter for, in the format "chr" (for whole chromosomes), "chr:position", "chr:start-end", or "chr:start..end". Positions are 1-based and inclusive (closed intervals). The filter keeps all listed positions, and removes all that are not listed. Multiple region options can be provided, see also --filter-region-set.
--filter-region-list
TEXT:FILE=[] ...
Genomic regions to filter for, as a file with one region per line, either in the format "chr" (for whole chromosomes), "chr:position", "chr:start-end", "chr:start..end", or tab- or space-delimited "chr position" or "chr start end". Positions are 1-based and inclusive (closed intervals). The filter keeps all listed positions, and removes all that are not listed. Multiple region options can be provided, see also --filter-region-set.
--filter-region-bed
TEXT:FILE=[] ...
Genomic regions to filter for, as a BED file. This only uses the chromosome, as well as start and end information per line, and ignores everything else in the file. Note that BED uses 0-based positions, and a half-open [) interval for the end position; simply using columns extracted from other file formats (such as vcf or gff) will not work. The filter keeps all listed positions, and removes all that are not listed.
--filter-region-gff
TEXT:FILE=[] ...
Genomic regions to filter for, as a GFF2/GFF3/GTF file. This only uses the chromosome, as well as start and end information per line, and ignores everything else in the file. The filter keeps all listed positions, and removes all that are not listed.
--filter-region-map-bim
TEXT:FILE=[] ...
Genomic positions to filter for, as a MAP or BIM file as used in PLINK. This only uses the chromosome and coordinate per line, and ignores everything else in the file. The filter keeps all listed positions, and removes all that are not listed.
--filter-region-vcf
TEXT:FILE=[] ...
Genomic positions to filter for, as a VCF/BCF file (such as a known-variants file). This only uses the chromosome and position per line, and ignores everything else in the file. The filter keeps all listed positions, and removes all that are not listed.
--filter-region-fasta
TEXT:FILE=[] ...
Genomic positions to filter for, as a FASTA-like mask file (such as used by vcftools). The file contains a sequence of integer digits [0-9], one for each position on the chromosomes, which specify if the position should be filtered out or not. Any positions with digits above the --filter-region-fasta-min value are removed. Note that this conceptually differs from a mask file, and merely uses the same format.
--filter-region-fasta-min
UINT:INT in [0 - 9]=0 Needs: --filter-region-fasta
When using --filter-region-mask-fasta, set the cutoff threshold for the filtered digits. Only positions with that value or lower will be kept. The default is 0, meaning that all positions with digits greater than 0 will be removed.
--filter-region-fasta-invert
:INT in [0 - 9] Needs: --filter-region-fasta
When using --filter-region-mask-fasta, invert the mask. This option has the same effect as the equivalent in vcftools, but instead of specifying the file, this here is a flag. When it is set, the mask specified above is inverted.
--filter-region-set
TEXT:{union,intersection}=union
It is possible to provide multiple of the above region filter options, even of different types. In that case, decide on how to combine the loci of these filters.

Masking Filters

--filter-mask-samples-bed-list
TEXT:FILE Excludes: --filter-mask-samples-fasta-list
For each sample, genomic positions to mask (mark as missing), as a set of BED files.
See the below --filter-mask-total-bed for details. Here, individual BED files can be provided for each sample, for fine-grained control over the masking. The option takes a path to a file that contains a comma- or tab-separated list of sample names and BED file paths, with one name/path pair per line, in any order of lines.
--filter-mask-samples-bed-invert
FLAG Needs: --filter-mask-samples-bed-list
When using --filter-mask-samples-bed-list, set this flag to invert the specified mask. Needs one of --reference-genome-fasta, --reference-genome-dict, --reference-genome-fai to determine chromosome lengths.
--filter-mask-samples-fasta-list
TEXT:FILE Excludes: --filter-mask-samples-bed-list
For each sample, genomic positions to mask, as a FASTA-like mask file.
See the below --filter-mask-total-fasta for details. Here, individual FASTA files can be provided for each sample, for fine-grained control over the masking. The option takes a path to a file that contains a comma- or tab-separated list of sample names and FASTA file paths, with one name/path pair per line, in any order of lines.
--filter-mask-samples-fasta-min
UINT:INT in [0 - 9]=0 Needs: --filter-mask-samples-fasta-list
When using --filter-mask-samples-fasta-list, set the cutoff threshold for the masked digits. All positions above that value are masked. The default is 0, meaning that only exactly the positons with value 0 will not be masked.
--filter-mask-samples-fasta-invert
FLAG Needs: --filter-mask-samples-fasta-list
When using --filter-mask-samples-fasta-list, invert the mask. When this flag is set, the mask specified above is inverted.
--filter-mask-total-bed
TEXT:FILE Excludes: --filter-mask-total-fasta
Genomic positions to mask (mark as missing), as a BED file.
The regions listed in the BED file are masked; this is in line with, e.g., smcpp, but is the inverse of the above usage of a BED file for selection regions, where instead the listed regions are kept. Note that this also conceptually differs from the region BED above. We here do not remove the masked positions, but instead just mark them as masked, so that they can still contribute to, e.g., denominators in the statistics for certain settings.
This only uses the chromosome, as well as start and end information per line, and ignores everything else in the file. Note that BED uses 0-based positions, and a half-open [) interval for the end position; simply using columns extracted from other file formats (such as vcf or gff) will not work.
--filter-mask-total-bed-invert
FLAG Needs: --filter-mask-total-bed
When using --filter-mask-total-bed, set this flag to invert the specified mask. Needs one of --reference-genome-fasta, --reference-genome-dict, --reference-genome-fai to determine chromosome lengths.
--filter-mask-total-fasta
TEXT:FILE Excludes: --filter-mask-total-bed
Genomic positions to mask, as a FASTA-like mask file (such as used by vcftools).
The file contains a sequence of integer digits [0-9], one for each position on the chromosomes, which specify if the position should be masked or not. Any positions with digits above the --filter-mask-total-fasta-min value are tagged as being masked. Note that this conceptually differs from the region fasta above. We here do not remove the the masked positions, but instead just mark them as masked, so that they can still contribute to, e.g., denominators in the statistics for certain settings.
--filter-mask-total-fasta-min
UINT:INT in [0 - 9]=0 Needs: --filter-mask-total-fasta
When using --filter-mask-total-fasta, set the cutoff threshold for the masked digits. All positions above that value are masked. The default is 0, meaning that only exactly the positons with value 0 will not be masked.
--filter-mask-total-fasta-invert
FLAG Needs: --filter-mask-total-fasta
When using --filter-mask-total-fasta, invert the mask. This option has the same effect as the equivalent in vcftools, but instead of specifying the file, this here is a flag. When it is set, the mask specified above is inverted.

Numerical Filters

--filter-sample-min-count
UINT=0
Minimum base count for a nucleotide (in ACGT) to be considered as an allele. Counts below that are set to zero, and hence ignored as an allele/variant. For example, singleton read sequencing errors can be filtered out this way.
--filter-sample-max-count
UINT=0
Maximum base count for a nucleotide (in ACGT) to be considered as an allele. Counts above that are set to zero, and hence ignored as an allele/variant. For example, spuriously high read counts can be filtered out this way.
--filter-sample-deletions-limit
FLAG
Maximum number of deletions at a position before being filtered out. If this is set to a value greater than 0, and the number of deletions at the position is equal to or greater than this value, the sample is filtered out.
--filter-sample-min-read-depth
UINT=0
Minimum read depth expected for a position in a sample to be considered covered. If the sum of nucleotide counts (in ACGT) at a given position in a sample is less than the provided value, the sample is ignored at this position.
--filter-sample-max-read-depth
UINT=0
Maximum read depth expected for a position in a sample to be considered covered. If the sum of nucleotide counts (in ACGT) at a given position in a sample is greater than the provided value, the sample is ignored at this position. This can for example be used to filter spuriously high read depth positions.
--filter-total-min-read-depth
UINT=0
Minimum read depth expected for a position in total to be considered covered. If the sum of nucleotide counts (in ACGT) at a given position in total (across all samples) is less than the provided value, the position is ignored.
--filter-total-max-read-depth
UINT=0
Maximum read depth expected for a position in total to be considered covered. If the sum of nucleotide counts (in ACGT) at a given position in total (across all samples) is greater than the provided value, the position is ignored. This can for example be used to filter spuriously high read depth positions.
--filter-total-deletions-limit
FLAG
Maximum number of deletions at a position before being filtered out, summed across all samples. If this is set to a value greater than 0, and the number of deletions at the position is equal to or greater than this value, the position is filtered out.
--filter-total-only-biallelic-snps
FLAG
Filter out any positions that do not have exactly two alleles across all samples. That is, after applying all previous filters, if not exactly two counts (in ACGT) are non-zero in total across all samples, the position is not considered a biallelic SNP, and ignored.
By default, we already filter out invariant sites, so that only SNPs remain. With this option however, this is further reduced to only biallelic (pure) SNPs. Note that with --method karlsson, we implicitly also activate to filter for biallelic SNPs only, as the method requires it.
--filter-total-snp-min-count
UINT=0
When filtering for positions that are SNPs, use this minimum count (summed across all samples) to identify what is considered a SNP. Positions where the counts are below this are filtered out.
--filter-total-snp-max-count
UINT=0
When filtering for positions that are SNPs, use this maximum count (summed across all samples) to identify what is considered a SNP. Positions where the counts are above this are filtered out; probably not relevant in practice, but offered for completeness.
--filter-total-snp-min-frequency
FLOAT=0
Minimum allele frequency that needs to be reached for a position to be used. Positions where the allele frequency af across all samples, or 1 - af, is below this value, are ignored. If both the reference and alternative base are known, allele frequencies are computed based on those; if only the reference base is known, the most frequent non-reference base is used as the alternative; if neither is known, the first and second most frequent bases are used to compute the frequency.

Settings

--method
TEXT:{unbiased-nei,unbiased-hudson}=unbiased-nei
Required. FST method to use for the computation: The unbiased pool-sequencing statistic in two variants, following the definition of Nei, and the definition of Hudson.
--pool-sizes
TEXT
Required. Pool sizes for all samples that are used (not filtered out). These are the number of haploids, so 100 diploid individuals correspond to a pool size of 200. Either
(1) a single pool size that is used for all samples, specified on the command line, or
(2) a path to a file that contains a comma- or tab-separated list of sample names and pool sizes, with one name/size pair per line, in any order of lines.
--comparand
TEXT Excludes: --comparand-list
By default, statistics between all pairs of samples (that are not filtered) are computed. If this option is given a sample name however, only the pairwise statistics between that sample and all others (that are not filtered) are computed.
--second-comparand
TEXT Needs: --comparand Excludes: --comparand-list
If in addition to --comparand, this option is also given a (second) sample name, only the statistics between those two samples are computed.
--comparand-list
TEXT:FILE Excludes: --comparand --second-comparand
By default, statistics between all pairs of samples are computed. If this option is given a file containing comma- or tab-separated pairs of sample names (one pair per line) however, only these pairwise statistics are computed.
--cathedral-width
UINT=1500
Width of the plot, in pixels. In particular, this determines the resolution of the last row of the plot, where each pixel corresponds to a window of the size genome length / plot width.
--cathedral-height
UINT=500
Height of the plot, in pixels. This determines the number of different window sizes displayed in the plot, from whole chromosome at the top to finest resolution at the bottom.

Output

--out-dir
TEXT=.
Directory to write files to
--file-prefix
TEXT
File prefix for output files. Most grenedalf commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
--file-suffix
TEXT
File suffix for output files. Most grenedalf commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.

Global Options

--allow-file-overwriting
FLAG
Allow to overwrite existing output files instead of aborting the command. By default, we abort if any output file already exists, to avoid overwriting by mistake.
--verbose
FLAG
Produce more verbose output.
--threads
UINT
Number of threads to use for calculations. If not set, we guess a reasonable number of threads, by looking at the environmental variables (1) OMP_NUM_THREADS (OpenMP) and (2) SLURM_CPUS_PER_TASK (slurm), as well as (3) the hardware concurrency (number of CPU cores), taking hyperthreads into account, in the given order of precedence.
--log-file
TEXT
Write all output to a log file, in addition to standard output to the terminal.

Citation

When using this method, please do not forget to cite

Lucas Czech, Jeffrey Spence, Moises Exposito-Alonso. grenedalf: population genetic statistics for the next generation of pool sequencing. Bioinformatics, vol. 40, no. 8, 2024. doi:10.1093/bioinformatics/btae508