Description: perform pooled-sample core-genome SNP calling. For each species, this script will pool data across multiple samples. Based on this pooled data, it will identify core genomic sites present in most samples. Next it will determine if any of these core-sites are bi-allelic SNPs (i.e. there are two alleles at >5% frequency). Take a close look at the options; there are different ways of doing this step.
The pipeline can be broken down into the following steps:
- Take MIDAS output files from multiple samples
- Identify species to process (based on user criterea, e.g. min # of samples)
- Scan across the representative genome of each species
- Pool nucleotide variants from all metagenomic samples
- Determine if genomic site is in the core-genome (e.g. non-zero depth in >95% of samples)
- If yes, identify the major and minor alleles among the pooled reads
- Determine if there is a SNP present (e.g. minor allele frequency >1%)
- Annotate genomic site (e.g. CDS/RNA, gene identifier, SYN/NS, etc.)
- Write core-genome SNPs to matrix files
Usage: merge_midas.py snps outdir [options]
positional arguments:
outdir Directory for output files. a subdirectory will be created for each species_id
Input/Output:
-i INPUT Input to sample directories output by run_midas.py; see '-t' for details
-t INPUT_TYPE Specify one of the following:
list: -i is a comma-separated list (ex: /samples/sample_1,/samples/sample_2)
dir: -i is a directory containing all samples (ex: /samples)
file: -i is a file of paths to samples (ex: /sample_paths.txt)
-d DB Path to reference database
By default, the MIDAS_DB environmental variable is used
Presets:
--core_snps Same as: --snp_type bi --site_depth 1 --site_ratio 2.0 --site_prev 0.95 (default)
--core_sites Same as: --snp_type any --site_depth 1 --site_ratio 2.0 --site_prev 0.95
--all_snps Same as: --snp_type bi --site_prev 0.0
--all_sites Same as: --snp_type any --site_prev 0.0
Species filters (select subset of species from INPUT):
--min_samples INT All species with >= MIN_SAMPLES (1)
--species_id CHAR Comma-separated list of species ids
--max_species INT Maximum number of species to call SNPs for (all with >= 1 sample)
Sample filters (select subset of samples from INPUT):
--sample_depth FLOAT Minimum average read depth per sample (5.0)
--fract_cov FLOAT Fraction of reference sites covered by at least 1 read (0.4)
--max_samples INT Maximum number of samples to process. useful for quick tests (use all)
--all_samples Include all samples in output
Site filters (select subset of genomic sites from INPUT):
--snp_type [ ...] Specify one or more of the following:
mono: keep sites with 1 allele > ALLELE_FREQ
bi: keep sites with 2 alleles > ALLELE_FREQ (default)
tri: keep sites with 3 alleles > ALLELE_FREQ
quad: keep sites with 4 alleles > ALLELE_FREQ
any: keep sites regardless of observed alleles
--allele_freq FLOAT Minimum frequency for calling an allele present (0.01)
Values > 0.0 and < 0.5 are accepted.
Ex: --snp_type=bi --allele_freq=0.01 keeps bi-allelic SNPs with a minimum frequency of 1%
--site_depth INT Minimum number of reads mapped to genomic site (1)
Used in combination with --site_prev to determine if site is in core-genome
--site_ratio FLOAT Maximum ratio of site depth to genome depth (2.0)
This filter helps to eliminate genomic sites with abnormally high read depth
--site_prev FLOAT Minimum fraction of sample where genomic site is >= SITE_DEPTH and <= SITE_RATIO (0.95)
A high value selects for sites in the core-genome (i.e. present in nearly all strains).
A low value includes sites in variable regions and/or with abnormally high read depth
--max_sites INT Maximum number of sites to include in output (use all). Useful for quick tests
-
Call core-genome SNPs all species using defaults:
merge_midas.py snps /path/to/outdir -i sample_1,sample_2 -t list
-
Call core-genome SNPs for one species:
merge_midas.py snps /path/to/outdir --species_id Bacteroides_vulgatus_57955 -i sample_1,sample_2 -t list
-
Include all core-genome sites in output files (useful for quantifying SNP density or comparing SNP levels between species): merge_midas.py snps /path/to/outdir -i /path/to/samples -t dir --core-sites
-
Include ALL sites in output files, regardless of read-depth or variation: merge_midas.py snps /path/to/outdir -i /path/to/samples -t dir --core-sites
-
Only use samples with >15x average depth and only use sites covered by >=10 reads in at least >=95% of samples:
merge_midas.py snps /path/to/outdir -i /path/to/samples -t dir --sample_depth 15 --site_depth 10 --site_prev 0.95
-
Run a quick test:
merge_midas.py snps /path/to/outdir -i /path/to/samples -t dir --max_species 1 --max_samples 10 --max_sites 1000
- snps_freq.txt: frequency of minor allele per genomic site and per sample; a value of 1.0 indicates that all reads matched the minor allele for site-sample; the major (most common) and minor allele (2nd most common) are determined from pooled reads across ALL samples; see:
snps_info.txt
for details on the major, minor, and reference alleles - snps_depth.txt: number of reads mapped to genomic site per sample; only accounts for reads matching either major or minor allele
- snps_info.txt:; metadata for genomic site; see below for more information
- snps_summary.txt: alignment summary statistics per sample; see below for more information
- snps_log.txt: log file containing parameters used
snps_freq.txt and snps_depth.txt: tab-delimited matrix files; field names are sample ids; row names are genome site ids; see: snps_info.txt
for details on each genomic site
snps_info.txt
- site_id: incrementing integer field
- ref_id: identifier of scaffold in representative genome
- ref_pos: position of site on scaffold
- ref_allele: allele in reference genome
- major_allele: most common allele in metagenomes
- minor_allele: second most common allele in metagenomes
- count_samples: number of metagenomes where site was found
- count_a: count of A allele in pooled metagenomes
- count_c: count of C allele in pooled metagenomes
- count_g: count of G allele in pooled metagenomes
- count_t: count of T allele in pooled metagenomes
- locus_type: CDS (site in coding gene), RNA (site in non-coding gene), IGR (site in intergenic region)
- gene_id: gene identified if locus type is CDS, or RNA
- snp_type: indicates the number of alleles observed at site (mono,bi,tri,quad); observed allele are determined by
--snp_maf
flag - site_type: indicates degeneracy: 1D, 2D, 3D, 4D
- amino_acids: amino acids encoded by 4 possible alleles
snps_summary.txt
- sample_id: sample identifier
- genome_length: number of base pairs in representative genome
- covered_bases: number of reference sites with at least 1 mapped read
- fraction_covered: proportion of reference sites with at least 1 mapped read
- mean_coverage: average read-depth across reference sites with at least 1 mapped read
- aligned_reads: number of reads that aligned BEFORE quality filtering
- mapped_reads: number of reads that aligned AFTER quality filtering
- This step takes an insignificant amount of memory