TORCH-Consortium · abhi18av · Jun 6, 2025 · May 8, 2025 · May 8, 2025 · May 14, 2025
diff --git a/README.md b/README.md
@@ -202,9 +202,11 @@ The `conda` environments are expected by the `conda_local` profile of the pipeli
 
 ```sh
 $ conda env create -n magma-env-1 --file magma-env-1.yml
-
 $ conda env create -n magma-env-2 --file magma-env-2.yml
+$ conda env create -n magma-tbprofiler-env --file magma-tbprofiler-env.yaml
+$ conda env create -n magma-ntmprofiler-env --file magma-ntmprofiler-env.yaml
 ```
+Optionally, you can run `bash ./conda/setup_conda_envs.sh` to build all the necessary conda environments.
 
 Once the environments are created, you can make use of the pipeline parameter `conda_envs_location` to inform the pipeline of the names and location of the conda envs.
 
@@ -248,7 +250,7 @@ Success, would look like this
 
 We provide [two docker containers](https://github.com/orgs/TORCH-Consortium/packages?repo_name=MAGMA) with the pipeline so that you could just download and run the pipeline with them. There is **NO** need to create any docker containers, just download and enable the `docker` profile.
 
-> 🚧 **Container build script**: The script used to build these containers is provided [here](./containers/build.sh).
+> 🚧 **Container build script**: The script used to build these containers is provided [here](https://github.com/TORCH-Consortium/MAGMA/tree/master/containers).
 
 Although, you don't need to pull the containers manually, but should you need to, you could use the following commands to pull the pre-built and provided containers
 

diff --git a/conda_envs/setup_conda_envs.sh b/conda_envs/setup_conda_envs.sh
@@ -12,10 +12,12 @@ resolverCondaBinary="conda" # pick either conda OR mamba
 
 $resolverCondaBinary env create -p magma-env-1 --file magma-env-1.yml 
 
-$resolverCondaBinary env create -p magma-env-1 --file magma-env-1.yml 
+$resolverCondaBinary env create -p magma-env-2 --file magma-env-2.yml 
 
 $resolverCondaBinary env create -p magma-ntmprofiler-env --file magma-ntmprofiler-env.yml
 
+$resolverCondaBinary env create -p magma-tbprofiler-env --file magma-tbprofiler-env.yml
+
 #===========================================================
 
 #NOTE: Setup the tbprofiler env with WHO v2 Database

diff --git a/docs/.gitignore b/docs/.gitignore
@@ -0,0 +1 @@
+/.quarto/
diff --git a/docs/_quarto.yml b/docs/_quarto.yml
@@ -0,0 +1,23 @@
+project:
+  type: website
+
+website:
+  title: "MAGMA"
+
+  sidebar:
+      style: "docked"
+      search: true
+      contents:
+        - text: "Usage"
+          href: usage.qmd
+        - text: "Customizable Parameters"
+          href: customizable-parameters.qmd
+        - text: "Output"
+          href: output.qmd
+
+format:
+  html:
+    theme: cosmo
+    css: styles.css
+    toc: true
+
diff --git a/docs/cloud_batch_execution.qmd b/docs/cloud_batch_execution.qmd
diff --git a/docs/customizable-parameters.qmd b/docs/customizable-parameters.qmd
@@ -0,0 +1,88 @@
+# MAGMA Customizable Parameters
+
+
+This document provides an overview of the customizable parameters for the MAGMA pipeline. Each parameter is listed with its default value, description.
+
+> 💡 **Hint**: you may check a full parameters [reference file](https://github.com/TORCH-Consortium/MAGMA/blob/master/params/params.yaml).
+
+---
+
+## Common Parameters
+
+### Input Samplesheet
+| Parameter             | Default Value              | Description                                                                                     |
+|-----------------------|----------------------------|-------------------------------------------------------------------------------------------------|
+| `input_samplesheet`   | `"samplesheet.magma.csv"`  | The input CSV file containing sample information. The study ID cannot start with `XBS_REF_`.   |
+
+> 💡 **Hint**: The samplesheet should include the fields `[Sample, R1, R2]`. Optionally, you can add `[study, library, attempt, flowcell, lane, index_sequence]`.
+
+---
+
+### Output Directory
+| Parameter   | Default Value         | Description                                                                 |
+|-------------|-----------------------|-----------------------------------------------------------------------------|
+| `outdir`    | `"magma-results"`     | The directory where all output files will be written.                      |
+| `vcf_name`  | `"joint"`             | The name of the output folder for results. Used to derive `JOINT_NAME`.    |
+
+> 💡 **Note**: The `vcf_name` parameter is critical for naming conventions in downstream processes.
+
+---
+
+## Additional Samples Addition
+
+| Parameter         | Default Value                                                                 | Description                                                                                     |
+|-------------------|-------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------|
+| `use_ref_gvcf`    | `true`                                                                      | Whether to use a reference GVCF file to include additional samples.                            |
+| `ref_gvcf`        | `"${projectDir}/resources/ref_gvcfs/LineagesAndOutgroupV2.g.vcf.gz"`         | Path to the reference GVCF file.                                                              |
+| `ref_gvcf_tbi`    | `"${projectDir}/resources/ref_gvcfs/LineagesAndOutgroupV2.g.vcf.gz.tbi"`     | Path to the index file for the reference GVCF.                                                |
+
+> 💡 **Hint**: Use this feature if your dataset has low genetic diversity (e.g., clonal or fewer than 20 samples).
+
+---
+
+## Quality Control Parameters
+
+| Parameter                     | Default Value | Description                                                                                     |
+|-------------------------------|---------------|-------------------------------------------------------------------------------------------------|
+| `cutoff_median_coverage`      | `10`          | The minimal median coverage required to process the sample.                                    |
+| `cutoff_breadth_of_coverage`  | `0.90`        | The minimal breadth of coverage required to process the sample.                                |
+| `cutoff_rel_abundance`        | `0.70`        | The minimal relative abundance of the majority strain required to process the sample.          |
+| `cutoff_ntm_fraction`         | `0.20`        | The maximum fraction of NTM DNA allowed to process the sample.                                 |
+
+> ⚠️ **Attention**: Ensure these values are adjusted based on the quality of your input data to avoid processing errors.
+
+---
+
+## Skipping Pipeline Steps
+
+| Parameter                        | Default Value | Description                                                                                     |
+|----------------------------------|---------------|-------------------------------------------------------------------------------------------------|
+| `only_validate_fastqs`           | `false`       | Set to `true` to only validate input FASTQs and check their FASTQC reports.                    |
+| `skip_merge_analysis`            | `false`       | Skip the final merge analysis step.                                                            |
+| `skip_variant_recalibration`     | `false`       | Skip variant quality score recalibration (VQSR).                                               |
+| `skip_base_recalibration`        | `true`        | Skip base quality score recalibration (BQSR). Not suitable for low-coverage Mtb genomes.       |
+| `skip_minor_variants_gatk`       | `true`        | Skip minor variants detection with GATK. LoFreq is recommended for most purposes.              |
+| `skip_phylogeny_and_clustering`  | `false`       | Disable downstream phylogenetic analysis of merged GVCF.                                       |
+| `skip_complex_regions`           | `false`       | Disable downstream complex region analysis of merged GVCF.                                     |
+| `skip_ntmprofiler`               | `false`       | Disable execution of `ntmprofiler` on FASTQ files.                                             |
+| `skip_tbprofiler_fastq`          | `true`        | Disable `tbprofiler` analysis on FASTQ files.                                                  |
+| `skip_spotyping`                 | `false`       | Disable spoligotyping analysis.                                                                |
+
+> 💡 **Hint**: Use these flags to customize the pipeline execution based on your specific requirements.
+
+---
+
+## Reference Files
+
+| Parameter               | Default Value                                              | Description                                                                                     |
+|-------------------------|------------------------------------------------------------|-------------------------------------------------------------------------------------------------|
+| `ref_fasta_basename`    | `"NC-000962-3-H37Rv"`                                      | Basename of the reference FASTA file.                                                          |
+| `ref_fasta_dir`         | `"${projectDir}/resources/genome"`                         | Directory containing the reference FASTA file.                                                 |
+| `ref_fasta`             | `"${params.ref_fasta_dir}/${params.ref_fasta_basename}.fa"`| Full path to the reference FASTA file.                                                         |
+| `ref_fasta_dict`        | `"${params.ref_fasta_dir}/${params.ref_fasta_basename}.dict"`| Path to the reference FASTA dictionary file.                                                  |
+| `ref_fasta_gb`          | `"${params.ref_fasta_dir}/${params.ref_fasta_basename}.gb"`| Path to the reference GenBank file.                                                            |
+
+> ⚠️ **Warning**: It is recommended to use the provided reference files to ensure compatibility with the pipeline.
+
+---
+
diff --git a/docs/hpc_execution.qmd b/docs/hpc_execution.qmd
diff --git a/docs/output.qmd b/docs/output.qmd
@@ -0,0 +1,169 @@
+---
+title: "MAGMA"
+---
+
+Here we briefly introduce the main outputs from MAGMA pipeline execution, please note that some outputs are optional and depends mainly on specific parameters to be generated.
+
+- [Interpretation](#Interpretation)
+
+# Tutorials and Presentations
+
+Tim Huepink and Lennert Verboven created an in-depth tutorial of the features of the variant calling in MAGMA:
+
+[![Video](https://img.youtube.com/vi/Kic2ItrJHj0/maxresdefault.jpg)](https://www.youtube.com/watch?v=Kic2ItrJHj0)
+
+
+We have also included a presentation (in PDF format) of the logic and workflow of the MAGMA pipeline as well as posters that have been presented at conferences. Please refer the [docs](./docs) folder.
+
+# Interpretation
+
+The results directory produced by MAGMA is as follows:
+
+```bash
+/path/to/results_dir/
+.
+├── QC_statistics
+├── analyses
+└── vcf_files
+```
+
+## QC Statistics Directory
+
+In this directory you will find files related to the quality control carried out by the MAGMA pipeline. The structure is as follows:
+
+```bash
+/path/to/results_dir/QC_statistics
+├── cohort
+|   └── fastq_validation
+│   └── multiqc
+│       └── multiqc_data
+└── per_sample
+    ├── coverage
+    ├── fastqc
+    └── mapping
+
+```
+
+- **cohort**
+
+Here you will find the `joint.merged_cohort_stats.tsv` which contains the QC statistics for all samples in the samplesheet and allows users to determine why certain samples failed to be incorporated in the cohort analysis steps
+
+In addition, you'll find the cohort-level MultiQC report generated by `per_sample/fastqc` analysis and the fastq validation report in `json` format.
+
+- **per_sample/coverage**
+
+Contains the GATK WGSMetrics outputs for each of the samples in the samplesheet
+
+- **per_sample/mapping**
+
+> Contains the FlagStat and samtools stats for each of the samples in the samplesheet
+
+## Analysis Directory
+
+```bash
+/path/to/results_dir/analysis
+├── cluster_analysis
+├── drug_resistance
+├── non-tuberculous_mycobacteria
+├── phylogeny
+├── spotyping
+└── snp_distances
+```
+
+- **Cluster Analysis**
+
+> Contains files related to clustering based on 5SNP and 12SNP cutoffs and inclunding and excluding complex regions
+> **.figtree files**: These can be imported directly into Figtree for visualisation
+
+- **Drug Resistance**
+
+Organised based on the different types of variants as well as combined results:
+
+```bash
+/path/to/results_dir/analysis/drug_resistance
+├── combined_resistance_summaries
+├── combined_resistance_summaries_mixed_infection_samples
+├── major_variants_xbs
+├── minor_variants_lofreq
+├── structural_variants_delly
+└── tbprofiler_fastq
+```
+
+Each of the directories containing results related to the different variants (major | minor | structural) have text files that can be used to annotate the .treefiles produced by MAGMA in iToL (https://itol.embl.de)
+
+The combined resistance results file contains a per-sample drug resistance summary based on the WHO Catalogue of *Mtb* mutations (https://www.who.int/publications/i/item/9789240082410)
+
+MAGMA also notes the presence of all variants in in tier 1 and tier 2 drug resistance genes.
+
+MAGMA will generated mixed infection reports and also optionally run tbprofiler from the fastq files for comparison purposes.
+
+- **Non-Tuberculous Mycobacteria (NTM)**
+
+Contains a brief report of NTM presence on the submitted samples, in cohort and per_sample structure.
+
+- **Phylogeny**
+
+Contains the outputs of the IQTree phylogenetic tree construction.
+
+> :memo: By default we recommend that you use the **ExDRIncComplex** files as MAGMA was optimized to be able to accurately call positions on the edges of complex regions in the *Mtb* genome
+
+- **SNP distances**
+
+Contains the SNP distance tables in tsv format.
+
+> :memo: By default we recommend that you use the **ExDRIncComplex** files as MAGMA was optimized to be able to accurately call positions on the edges of complex regions in the *Mtb* genome
+
+
+- **Spotyping**
+
+Contains a spoligotyping pattern prediction using [SpoTyping](https://github.com/xiaeryu/SpoTyping-v2.0/tree/master/SpoTyping-v2.0-commandLine).
+
+
+## `vcf_files` Directory
+
+```bash
+/path/to/results_dir/vcf_files
+├── cohort
+│   ├── combined_variant_files
+│   ├── minor_variants
+│   ├── multiple_alignment_files
+│   ├── raw_variant_files
+│   ├── snp_variant_files
+│   └── structural_variants
+└── per_sample
+    ├── minor_variants
+    ├── raw_variant_files
+    └── structural_variants
+```
+
+- **Combined variant files**
+
+> Contains the cohort gvcfs based on major variants detected by the MAGMA pipeline
+
+- **Minor variants**
+
+> Merged vcfs of all samples, generated by LoFreq
+
+- **Multiple alignment files**
+
+> FASTA files for the generation of phylogenetic trees by IQTree
+
+- **Raw variant files**
+
+> Unfiltered indel and SNPs detected by the MAGMA pipeline
+
+- **SNP variant files**
+
+> Filtered SNPs detected by the MAGMA pipeline
+
+- **Structural variant files**
+
+> Unfiltered structural variants detected by the MAGMA pipeline
+
+## Libraries Directory
+
+> Contains files related to FASTQ validation and FASTQC analysis
+
+## Samples Directory
+
+> Contains vcf files for major|minor|structural variants for each individual samples
diff --git a/docs/styles.css b/docs/styles.css
@@ -0,0 +1 @@
+/* css styles */