Skip to content

Pipeline Tutorial

Sara Wattanasombat edited this page Jul 2, 2026 · 6 revisions

This page walks through a complete first-time setup on a fresh Linux machine, from installing prerequisites through running the NosoGraph bacterial-assembly pipeline on a real long-read sample and producing the per-sample kg/ knowledge-graph CSVs.

Tested on: Ubuntu 22.04


1. Install Java (OpenJDK)

Nextflow requires Java 17 or later. Install via sdkman is recommended.

Ubuntu / Debian

curl -s "https://get.sdkman.io" | bash

source "$USER/.sdkman/bin/sdkman-init.sh"

sdk install java

Verify

java -version

2. Install Nextflow

Nextflow is a single self-contained launcher script. No root required.

curl -s https://get.nextflow.io | bash
mv nextflow $HOME/.local/bin/
nextflow -version
# Expected: nextflow version 26.x.x ...

Minimum version: NosoGraph requires Nextflow ≥ 23.04. If nextflow -version shows an older release, run nextflow self-update.


3. Install micromamba

The pipeline resolves all tool environments automatically via conda/micromamba. Micromamba is lighter than full Anaconda or Miniconda.

"${SHELL}" <(curl -L micro.mamba.pm/install.sh)
source ~/.bashrc      # or ~/.zshrc
micromamba --version

Alternative managers: plain conda and mamba can also be used if prefered. Conda is enabled in nextflow.config; environments are created on first run, so no manual micromamba create is needed.


4. Clone NosoGraph

git clone https://github.com/STTLab/NosoGraph.git
cd NosoGraph

The repository has assembly/polish/QC module under modules/vendor/bacterial-assembly/, which owns its own conda environment. Additional environments live under conda/.


5. Validate the wiring first (no data, no tools)

Before running real data, confirm the workflow DAG compiles end-to-end. The test profile disables conda so this runs in seconds with no tools installed, and input paths are not checked for existence:

nextflow run main.nf -stub-run -profile test \
    --sample_id BAC_S001 \
    --assembler canu \
    --tech nanopore \
    --genome_size 5.5m \
    --long_reads dummy.fastq.gz \
    --read1 dummy_R1.fastq.gz \
    --read2 dummy_R2.fastq.gz \
    --racon_iter 2 \
    --pilon_iter 2 \
    --checkm2_db dummy.dmnd \
    --outdir /tmp/nf_test

Expected:

[PROCESS] BACTERIAL_ASSEMBLY:ASSEMBLY_CANU (1)
[PROCESS] BACTERIAL_ASSEMBLY:RACON_POLISH (1)
[PROCESS] BACTERIAL_ASSEMBLY:PILON_POLISH (1)
[PROCESS] BACTERIAL_ASSEMBLY:CHECKM2 (1)
[PROCESS] KG_EXPORT (BAC_S001)

[SUCCESS] completed=5 failed=0 cached=0

Both --assembler canu and --assembler flye wire correctly. If you only want to confirm wiring, you can stop here.


6. Download the CheckM2 database (optional but recommended)

CheckM2 estimates assembly completeness and contamination. It needs the UniRef100/KO DIAMOND database. CheckM2 runs only when --pilon_iter > 0; skip this section if you are doing a long-read-only assembly without QC.

# Install CheckM2 into a throwaway env to get its downloader
micromamba create -n checkm2 -c bioconda -c conda-forge checkm2
micromamba run -n checkm2 checkm2 database --download --path /data/checkm2_db

After download you will have:

/data/checkm2_db/CheckM2_database/uniref100.KO.1.dmnd

Pass that file path to --checkm2_db.


7. Prepare your sample

path/to/data/reads/
├── sample_1_long.fastq.gz      # Oxford Nanopore long reads
├── sample_1_R1.fastq.gz        # Illumina short reads R1 (for hybrid polishing)
└── sample_1_R2.fastq.gz        # Illumina short reads R2

Use a sample_id that matches the Samples.csv you will load into the graph — here BAC_S001, which the bundled example/csv/Samples.csv already links to specimen SP003. The sample_id both scopes outputs to <outdir>/BAC_S001/ and namespaces contig IDs as BAC_S001:contig_1, BAC_S001:contig_2, … so they stay globally unique across samples.


8. Run the pipeline

Option A — Hybrid assembly with Flye (long + short reads)

Highest-quality consensus: Flye assembly, Racon long-read polishing, then Pilon short-read polishing, then CheckM2.

nextflow run main.nf \
    --sample_id   sample_1 \
    --long_reads  path/to/data/reads/sample_1_long.fastq.gz \
    --read1       path/to/data/reads/sample_1_R1.fastq.gz \
    --read2       path/to/data/reads/sample_1_R2.fastq.gz \
    --assembler   flye \
    --tech        nanopore \
    --outdir      results \
    --threads     16 \
    --racon_iter  2 \
    --pilon_iter  2 \
    --checkm2_db  /data/checkm2_db/CheckM2_database/uniref100.KO.1.dmnd

On the first run, Nextflow resolves and builds the bacterial-assembly conda environment (~10–15 min); subsequent runs reuse it.

Option B — Long-read-only assembly with Canu

No short reads, so Pilon and CheckM2 are skipped (--pilon_iter 0). Canu requires a genome-size estimate:

nextflow run main.nf \
    --sample_id   sample_1 \
    --long_reads  path/to/data/reads/sample_1_long.fastq.gz \
    --assembler   canu \
    --tech        nanopore \
    --genome_size 5.5m \
    --pilon_iter  0 \
    --outdir      results \
    --threads     32

Option C — Metagenomics (pathogen identification)

A separate pipeline (--pipeline metagenomics) classifies long reads against a pre-built Kraken2 database and exports a high-level, pathogen-ID knowledge graph — in a single Nextflow run. The vendored kraken2-classify module (modules/vendor/kraken2-classify/) produces the Kraken2 report; META_KG_EXPORT turns it into the kg/ CSVs.

nextflow run main.nf \
    --pipeline    metagenomics \
    --sample_id   sample_1 \
    --long_reads  path/to/data/reads/sample_1_long.fastq.gz \
    --kraken2_db  /data/k2_standard \
    --outdir      results \
    --threads     16

--kraken2_db must point at a Kraken2 DB directory containing hash.k2d, opts.k2d, and taxo.k2d. Kraken2 loads the whole DB into RAM, so size the request to the DB (--kraken2_mem, default 64 GB; raise for the full Standard DB). Outputs land in results/META_S001/: kraken2/sample_1.kraken2.report.txt plus kg/ (taxonomic_classification.csv, meta_reads.csv). Only species (rank S) and genus (G) rows are kept, and they are carried as a taxa_json QC blob on the TaxonomicClassification node — not as Organism nodes, because Kraken2 output is an untrusted per-run classification and isn't meant to be traversed in the graph. The blob is sorted by abundance and pre-filtered by an adaptive z-score bucket: low-abundance taxa (z below --kraken2_z_min, default -1.0) fold into a single "Other" row; filtering is skipped when there are fewer than --kraken2_min_taxa (default 3) taxa. See the Knowledge graph Tutorial for loading the taxonomic-classification subgraph (LOAD DATA steps 17–18 and the Pathogens detected per sample query).

Parameters

Parameter Description Default
--pipeline bacterial-assembly, autocycler, or metagenomics bacterial-assembly
--sample_id Sample identifier; scopes outputs and namespaces contig IDs required
--long_reads Long-read FASTQ (gzipped or plain) required
--assembler canu or flye (bacterial-assembly only) required
--tech nanopore, nanopore-hq, or pacbio required
--read1 / --read2 Paired short reads (required when --pilon_iter > 0)
--genome_size e.g. 5.5m, 2.6g (required for Canu)
--racon_iter Racon polishing iterations 1
--pilon_iter Pilon polishing iterations (0 = skip Pilon + CheckM2) 1
--checkm2_db Path to uniref100.KO.1.dmnd
--kraken2_db Kraken2 DB dir with hash.k2d/opts.k2d/taxo.k2d (metagenomics only)
--kraken2_mem Memory request for Kraken2 (≈ DB size) 64 GB
--kraken2_z_min taxa_json z-score cutoff; taxa below fold into an "Other" bucket (metagenomics only) -1.0
--kraken2_min_taxa Keep all taxa (no bucketing) below this count (metagenomics only) 3
--outdir Output directory results
--threads Threads per process 1
--queue SLURM partition (-profile slurm only)

9. Run on SLURM (HPC)

Add -profile slurm to submit each process as an independent job, and name your partition with --queue:

nextflow run main.nf \
    -profile slurm \
    --queue       normal \
    --sample_id   BAC_S001 \
    --long_reads  /data/reads/BAC_S001_long.fastq.gz \
    --read1       /data/reads/BAC_S001_R1.fastq.gz \
    --read2       /data/reads/BAC_S001_R2.fastq.gz \
    --assembler   flye \
    --tech        nanopore \
    --outdir      results \
    --threads     16 \
    --racon_iter  2 \
    --pilon_iter  1 \
    --checkm2_db  /data/checkm2_db/CheckM2_database/uniref100.KO.1.dmnd

Default per-process resource requests (override in modules/vendor/bacterial-assembly/nextflow.config):

Process CPUs Memory Time
Assembly (Flye / Canu) --threads 32 GB 24 h
Racon iteration --threads 32 GB 24 h
Pilon iteration --threads 28 GB 12 h
CheckM2 --threads 32 GB 12 h

Failed processes retry once on common HPC kill signals (OOM 137, timeout 140/143, segfault 139).

Containers (Singularity / Apptainer)

NosoGraph is conda-first: tool environments are resolved with conda/micromamba on the compute nodes (see Step 3), and each process declares only a conda directive — there are no container directives. On a cluster where micromamba is available, the slurm profile plus conda envs already cover scheduling and reproducibility, so no container runtime is required.

There is intentionally no Singularity profile at present. If a future cluster is air-gapped, loses conda channels, or mandates containers, the planned path is to build .sif images from the existing conda/*.yaml recipes via Seqera Wave (wave.enabled + singularity.enabled), which reuses the conda recipes with the least duplication. This is a documented future option, not a current dependency.

Resume after interruption

nextflow run main.nf -resume ...    # add -profile slurm if applicable

Resume reuses cached completed steps — only the failed/remaining steps re-run.


10. Inspect the outputs

After a successful run, results/ looks like:

results/
├── 01_assembly/
│   ├── assembly.contigs.fasta        # de novo assembly
│   └── assembly_info.txt             # contig length / coverage / circularity (Flye)
├── 02_polish/
│   ├── 01_racon/                     # Racon-polished intermediates
│   └── 02_pilon/                     # Pilon-polished assembly (hybrid only)
├── 03_qc/
│   └── quality_report.tsv            # CheckM2 completeness / contamination
└── BAC_S001/
    └── kg/                           # Neo4j-ready CSVs ← import these
        ├── sample.csv
        ├── assembly.csv
        ├── biodata_files.csv
        └── contigs.csv

Spot-check the assembly and the KG bundle:

grep -c "^>" results/01_assembly/assembly.contigs.fasta   # number of contigs
column -s, -t results/BAC_S001/kg/contigs.csv | head      # contig table
cat results/03_qc/quality_report.tsv                      # CheckM2 metrics

The kg/ CSVs are what you copy into Neo4j in the next page. Continue to the Knowledge graph Tutorial.


11. Run your own samples (batch)

Because each run is one sample, loop in the shell for a batch:

while IFS=, read -r sid long r1 r2; do
  nextflow run main.nf \
    --sample_id  "$sid" \
    --long_reads "$long" \
    --read1      "$r1" \
    --read2      "$r2" \
    --assembler  flye \
    --tech       nanopore \
    --outdir     "results" \
    --threads    16 \
    --racon_iter 2 --pilon_iter 2 \
    --checkm2_db /data/checkm2_db/CheckM2_database/uniref100.KO.1.dmnd
done < samples.csv

Each sample lands in its own results/<sample_id>/kg/, ready to copy into Neo4j's import directory under a matching <sample_id>/ subfolder.


Troubleshooting

Symptom Cause / fix
Missing required param: --sample_id (or --long_reads, --assembler, --tech) The workflow validates required params before running — supply the missing flag.
Pilon polishing (pilon_iter > 0) requires --read1 and --read2 Either provide short reads or set --pilon_iter 0 for long-read-only assembly.
Canu errors about genome size Canu needs --genome_size (e.g. 5.5m).
First run is slow Conda environment creation is one-off; later runs reuse it. Build it the day before a demo.
Want to validate without data Use -stub-run -profile test (Step 5).