-
Notifications
You must be signed in to change notification settings - Fork 0
Pipeline Tutorial
This page walks through a complete first-time setup on a fresh Linux machine, from installing prerequisites through running the NosoGraph bacterial-assembly pipeline on a real long-read sample and producing the per-sample kg/ knowledge-graph CSVs.
Tested on: Ubuntu 22.04
Nextflow requires Java 17 or later. Install via sdkman is recommended.
curl -s "https://get.sdkman.io" | bash
source "$USER/.sdkman/bin/sdkman-init.sh"
sdk install javajava -versionNextflow is a single self-contained launcher script. No root required.
curl -s https://get.nextflow.io | bash
mv nextflow $HOME/.local/bin/
nextflow -version
# Expected: nextflow version 26.x.x ...Minimum version: NosoGraph requires Nextflow ≥ 23.04. If
nextflow -versionshows an older release, runnextflow self-update.
The pipeline resolves all tool environments automatically via conda/micromamba. Micromamba is lighter than full Anaconda or Miniconda.
"${SHELL}" <(curl -L micro.mamba.pm/install.sh)
source ~/.bashrc # or ~/.zshrc
micromamba --versionAlternative managers: plain
condaandmambacan also be used if prefered. Conda is enabled innextflow.config; environments are created on first run, so no manualmicromamba createis needed.
git clone https://github.com/STTLab/NosoGraph.git
cd NosoGraphThe repository has assembly/polish/QC module under modules/vendor/bacterial-assembly/, which owns its own conda environment. Additional environments live under conda/.
Before running real data, confirm the workflow DAG compiles end-to-end. The test profile disables conda so this runs in seconds with no tools installed, and input paths are not checked for existence:
nextflow run main.nf -stub-run -profile test \
--sample_id BAC_S001 \
--assembler canu \
--tech nanopore \
--genome_size 5.5m \
--long_reads dummy.fastq.gz \
--read1 dummy_R1.fastq.gz \
--read2 dummy_R2.fastq.gz \
--racon_iter 2 \
--pilon_iter 2 \
--checkm2_db dummy.dmnd \
--outdir /tmp/nf_testExpected:
[PROCESS] BACTERIAL_ASSEMBLY:ASSEMBLY_CANU (1)
[PROCESS] BACTERIAL_ASSEMBLY:RACON_POLISH (1)
[PROCESS] BACTERIAL_ASSEMBLY:PILON_POLISH (1)
[PROCESS] BACTERIAL_ASSEMBLY:CHECKM2 (1)
[PROCESS] KG_EXPORT (BAC_S001)
[SUCCESS] completed=5 failed=0 cached=0
Both --assembler canu and --assembler flye wire correctly. If you only want to confirm wiring, you can stop here.
CheckM2 estimates assembly completeness and contamination. It needs the UniRef100/KO DIAMOND database. CheckM2 runs only when --pilon_iter > 0; skip this section if you are doing a long-read-only assembly without QC.
# Install CheckM2 into a throwaway env to get its downloader
micromamba create -n checkm2 -c bioconda -c conda-forge checkm2
micromamba run -n checkm2 checkm2 database --download --path /data/checkm2_dbAfter download you will have:
/data/checkm2_db/CheckM2_database/uniref100.KO.1.dmnd
Pass that file path to --checkm2_db.
path/to/data/reads/
├── sample_1_long.fastq.gz # Oxford Nanopore long reads
├── sample_1_R1.fastq.gz # Illumina short reads R1 (for hybrid polishing)
└── sample_1_R2.fastq.gz # Illumina short reads R2
Use a sample_id that matches the Samples.csv you will load into the graph — here BAC_S001, which the bundled example/csv/Samples.csv already links to specimen SP003. The sample_id both scopes outputs to <outdir>/BAC_S001/ and namespaces contig IDs as BAC_S001:contig_1, BAC_S001:contig_2, … so they stay globally unique across samples.
Highest-quality consensus: Flye assembly, Racon long-read polishing, then Pilon short-read polishing, then CheckM2.
nextflow run main.nf \
--sample_id sample_1 \
--long_reads path/to/data/reads/sample_1_long.fastq.gz \
--read1 path/to/data/reads/sample_1_R1.fastq.gz \
--read2 path/to/data/reads/sample_1_R2.fastq.gz \
--assembler flye \
--tech nanopore \
--outdir results \
--threads 16 \
--racon_iter 2 \
--pilon_iter 2 \
--checkm2_db /data/checkm2_db/CheckM2_database/uniref100.KO.1.dmndOn the first run, Nextflow resolves and builds the bacterial-assembly conda environment (~10–15 min); subsequent runs reuse it.
No short reads, so Pilon and CheckM2 are skipped (--pilon_iter 0). Canu requires a genome-size estimate:
nextflow run main.nf \
--sample_id sample_1 \
--long_reads path/to/data/reads/sample_1_long.fastq.gz \
--assembler canu \
--tech nanopore \
--genome_size 5.5m \
--pilon_iter 0 \
--outdir results \
--threads 32A separate pipeline (--pipeline metagenomics) classifies long reads against a pre-built
Kraken2 database and exports a high-level, pathogen-ID knowledge graph — in a single
Nextflow run. The vendored kraken2-classify module (modules/vendor/kraken2-classify/)
produces the Kraken2 report; META_KG_EXPORT turns it into the kg/ CSVs.
nextflow run main.nf \
--pipeline metagenomics \
--sample_id sample_1 \
--long_reads path/to/data/reads/sample_1_long.fastq.gz \
--kraken2_db /data/k2_standard \
--outdir results \
--threads 16--kraken2_db must point at a Kraken2 DB directory containing hash.k2d, opts.k2d, and
taxo.k2d. Kraken2 loads the whole DB into RAM, so size the request to the DB (--kraken2_mem,
default 64 GB; raise for the full Standard DB). Outputs land in results/META_S001/:
kraken2/sample_1.kraken2.report.txt plus kg/ (taxonomic_classification.csv,
meta_reads.csv). Only species (rank S) and genus (G) rows are kept, and they are carried
as a taxa_json QC blob on the TaxonomicClassification node — not as Organism nodes,
because Kraken2 output is an untrusted per-run classification and isn't meant to be traversed in
the graph. The blob is sorted by abundance and pre-filtered by an adaptive z-score bucket:
low-abundance taxa (z below --kraken2_z_min, default -1.0) fold into a single "Other" row;
filtering is skipped when there are fewer than --kraken2_min_taxa (default 3) taxa. See the
Knowledge graph Tutorial for loading the taxonomic-classification
subgraph (LOAD DATA steps 17–18 and the Pathogens detected per sample query).
| Parameter | Description | Default |
|---|---|---|
--pipeline |
bacterial-assembly, autocycler, or metagenomics
|
bacterial-assembly |
--sample_id |
Sample identifier; scopes outputs and namespaces contig IDs | required |
--long_reads |
Long-read FASTQ (gzipped or plain) | required |
--assembler |
canu or flye (bacterial-assembly only) |
required |
--tech |
nanopore, nanopore-hq, or pacbio
|
required |
--read1 / --read2
|
Paired short reads (required when --pilon_iter > 0) |
— |
--genome_size |
e.g. 5.5m, 2.6g (required for Canu) |
— |
--racon_iter |
Racon polishing iterations | 1 |
--pilon_iter |
Pilon polishing iterations (0 = skip Pilon + CheckM2) | 1 |
--checkm2_db |
Path to uniref100.KO.1.dmnd
|
— |
--kraken2_db |
Kraken2 DB dir with hash.k2d/opts.k2d/taxo.k2d (metagenomics only) |
— |
--kraken2_mem |
Memory request for Kraken2 (≈ DB size) | 64 GB |
--kraken2_z_min |
taxa_json z-score cutoff; taxa below fold into an "Other" bucket (metagenomics only) |
-1.0 |
--kraken2_min_taxa |
Keep all taxa (no bucketing) below this count (metagenomics only) | 3 |
--outdir |
Output directory | results |
--threads |
Threads per process | 1 |
--queue |
SLURM partition (-profile slurm only) |
— |
Add -profile slurm to submit each process as an independent job, and name your partition with --queue:
nextflow run main.nf \
-profile slurm \
--queue normal \
--sample_id BAC_S001 \
--long_reads /data/reads/BAC_S001_long.fastq.gz \
--read1 /data/reads/BAC_S001_R1.fastq.gz \
--read2 /data/reads/BAC_S001_R2.fastq.gz \
--assembler flye \
--tech nanopore \
--outdir results \
--threads 16 \
--racon_iter 2 \
--pilon_iter 1 \
--checkm2_db /data/checkm2_db/CheckM2_database/uniref100.KO.1.dmndDefault per-process resource requests (override in modules/vendor/bacterial-assembly/nextflow.config):
| Process | CPUs | Memory | Time |
|---|---|---|---|
| Assembly (Flye / Canu) | --threads |
32 GB | 24 h |
| Racon iteration | --threads |
32 GB | 24 h |
| Pilon iteration | --threads |
28 GB | 12 h |
| CheckM2 | --threads |
32 GB | 12 h |
Failed processes retry once on common HPC kill signals (OOM 137, timeout 140/143, segfault 139).
NosoGraph is conda-first: tool environments are resolved with conda/micromamba on the
compute nodes (see Step 3), and each process declares only a conda directive — there are no
container directives. On a cluster where micromamba is available, the slurm profile plus
conda envs already cover scheduling and reproducibility, so no container runtime is required.
There is intentionally no Singularity profile at present. If a future cluster is air-gapped,
loses conda channels, or mandates containers, the planned path is to build .sif images from the
existing conda/*.yaml recipes via Seqera Wave
(wave.enabled + singularity.enabled), which reuses the conda recipes with the least
duplication. This is a documented future option, not a current dependency.
nextflow run main.nf -resume ... # add -profile slurm if applicableResume reuses cached completed steps — only the failed/remaining steps re-run.
After a successful run, results/ looks like:
results/
├── 01_assembly/
│ ├── assembly.contigs.fasta # de novo assembly
│ └── assembly_info.txt # contig length / coverage / circularity (Flye)
├── 02_polish/
│ ├── 01_racon/ # Racon-polished intermediates
│ └── 02_pilon/ # Pilon-polished assembly (hybrid only)
├── 03_qc/
│ └── quality_report.tsv # CheckM2 completeness / contamination
└── BAC_S001/
└── kg/ # Neo4j-ready CSVs ← import these
├── sample.csv
├── assembly.csv
├── biodata_files.csv
└── contigs.csv
Spot-check the assembly and the KG bundle:
grep -c "^>" results/01_assembly/assembly.contigs.fasta # number of contigs
column -s, -t results/BAC_S001/kg/contigs.csv | head # contig table
cat results/03_qc/quality_report.tsv # CheckM2 metricsThe kg/ CSVs are what you copy into Neo4j in the next page. Continue to the Knowledge graph Tutorial.
Because each run is one sample, loop in the shell for a batch:
while IFS=, read -r sid long r1 r2; do
nextflow run main.nf \
--sample_id "$sid" \
--long_reads "$long" \
--read1 "$r1" \
--read2 "$r2" \
--assembler flye \
--tech nanopore \
--outdir "results" \
--threads 16 \
--racon_iter 2 --pilon_iter 2 \
--checkm2_db /data/checkm2_db/CheckM2_database/uniref100.KO.1.dmnd
done < samples.csvEach sample lands in its own results/<sample_id>/kg/, ready to copy into Neo4j's import directory under a matching <sample_id>/ subfolder.
| Symptom | Cause / fix |
|---|---|
Missing required param: --sample_id (or --long_reads, --assembler, --tech) |
The workflow validates required params before running — supply the missing flag. |
Pilon polishing (pilon_iter > 0) requires --read1 and --read2 |
Either provide short reads or set --pilon_iter 0 for long-read-only assembly. |
| Canu errors about genome size | Canu needs --genome_size (e.g. 5.5m). |
| First run is slow | Conda environment creation is one-off; later runs reuse it. Build it the day before a demo. |
| Want to validate without data | Use -stub-run -profile test (Step 5). |