Skip to content

Pipeline Tutorial

minaminii edited this page Jul 1, 2026 · 3 revisions

This page walks through a complete first-time setup on a fresh Linux machine, from installing prerequisites through running ViroWatch on the bundled test sample with clinical data attached.

Tested on: Ubuntu 22.04


1. Install Java (OpenJDK)

Nextflow requires Java 11 or later. OpenJDK 21 (LTS) is recommended.

Ubuntu / Debian

sudo apt update
sudo apt install -y openjdk-21-jdk

macOS

brew install openjdk@21
# Follow the brew post-install instructions to add java to PATH, e.g.:
echo 'export PATH="/opt/homebrew/opt/openjdk@21/bin:$PATH"' >> ~/.zshrc
source ~/.zshrc

Verify

java -version
# Expected: openjdk version "21.x.x" ...

2. Install Nextflow

Nextflow is a single self-contained JAR distributed as a launcher script. No root required.

# Download the installer
curl -s https://get.nextflow.io | bash

# Move to a directory on your PATH (adjust if you prefer ~/bin)
sudo mv nextflow /usr/local/bin/

# Verify
nextflow -version
# Expected: nextflow version 24.x.x ...

Minimum version: ViroWatch requires Nextflow ≥ 23.10.0. If nextflow -version shows an older release, run nextflow self-update.


3. Install micromamba

ViroWatch resolves all tool environments automatically via conda/micromamba. Micromamba is faster and lighter than full Anaconda or Miniconda — it is the recommended option.

# Install micromamba to ~/micromamba (no root required)
"${SHELL}" <(curl -L micro.mamba.pm/install.sh)

The installer will:

  • download the micromamba binary
  • configure your shell (.bashrc / .zshrc) with the mamba initialisation block
  • ask where to place the binary and base prefix (defaults are fine)

After installation, restart your shell or run:

source ~/.bashrc   # or ~/.zshrc on macOS/zsh

Verify

micromamba --version
# Expected: 1.x.x or 2.x.x

Alternative managers: Plain conda (from Miniconda) and mamba work too — pass -profile mamba when running the pipeline. The tutorial below uses micromamba with -profile mamba.


4. Clone ViroWatch

git clone https://github.com/STTLab/ViroWatch.git
cd ViroWatch

The repository is self-contained: reference genomes, test reads, and the pre-built LosAlamos BLAST database are all bundled under assets/.


5. Extract the LosAlamos BLAST database

A pre-built BLAST database of 15,471 HIV-1 sequences (indexed + taxdb) is bundled as a compressed tarball. Extract it once to a persistent location — it does not need to live inside the repo.

# Create a directory for BLAST databases (adjust path to your server's convention)
mkdir -p /data/blast_dbs

# Extract
tar -xzf assets/blast/LosAlamos_db.tar.gz -C /data/blast_dbs/

After extraction you will have:

/data/blast_dbs/LosAlamos_db/
  LosAlamos_db.nhr
  LosAlamos_db.nin
  LosAlamos_db.nsq
  LosAlamos_db.ntf
  LosAlamos_db.nto
  taxdb.btd
  taxdb.bti
  ...

Note the prefix path — you will pass it to --blast_db. The prefix is the base name without any extension:

--blast_db /data/blast_dbs/LosAlamos_db/LosAlamos_db

Database contents

Subtype Count TaxID
HIV-1 M:B 10,096 505185
HIV-1 M:C 2,402 505186
HIV-1 M:CRF01_AE 2,124 1345266
HIV-1 M:CRF02_AG 232 1287874
HIV-1 M:D 201 505226
HIV-1 M:CRF07_BC 183 1385609
HIV-1 M:G 101 505228
HIV-1 (root/unclassified) 115 11676
HIV-1 M:F2 14 1392219
HIV-1 M:A 3 505184

CRF01_AE (13.7%) is well represented for Southeast Asian surveillance.


6. Optional: NCBI core_nt database

The second BLAST step (--core_nt_db) runs BLAST against the full NCBI core_nt nucleotide database. It is disabled by default because the database is large (~500 GB compressed). Skip this section for the test run — come back once you have a storage location.

Download core_nt

# Requires BLAST+ tools; install via micromamba if not already present:
micromamba install -n base -c bioconda blast

# Download to a dedicated directory
mkdir -p /data/blast_dbs/core_nt
cd /data/blast_dbs/core_nt

# Use update_blastdb.pl to fetch and verify all volumes
update_blastdb.pl --decompress core_nt

See the NCBI BLAST database documentation for full details and alternate download methods (aws s3, rsync, GCP).

Configure the taxdb

The taxdb files must be in the same directory as (or discoverable from) the core_nt index files, and the BLASTDB environment variable must point to that directory:

# In conf/site.config or ~/.nextflow/config
export BLASTDB=/data/blast_dbs/core_nt

Pass the database prefix to the pipeline:

--core_nt_db /data/blast_dbs/core_nt/core_nt

6b. Optional: Kraken2 read-set taxonomy QC

When --kraken2_db is set, ViroWatch runs a Kraken2 classification of the filtered reads immediately after NanoStat, before assembly. This is a quick "what else is in my read set?" QC pass — it flags contamination or co-infection, and because it runs ahead of assembly it still produces useful output even when Flye yields no assembly. The step is skipped entirely when no DB is given.

Get a Kraken2 database

Any standard pre-built Kraken2 database works. A small Standard-8 (≈8 GB, capped) index is a good starting point; the full Standard index is larger. Pre-built indexes are published at https://benlangmead.github.io/aws-indexes/k2.

# Example: fetch and unpack a pre-built index
mkdir -p /data/kraken2_dbs/k2_standard_08gb
cd /data/kraken2_dbs/k2_standard_08gb
curl -sSL -O https://genome-idx.s3.amazonaws.com/kraken/k2_standard_08gb_20240605.tar.gz
tar -xzf k2_standard_08gb_20240605.tar.gz

The database directory must contain hash.k2d, opts.k2d, and taxo.k2d. Point --kraken2_db at the directory:

--kraken2_db /data/kraken2_dbs/k2_standard_08gb

Tuning the QC summary

The per-taxon results are folded into a single JSON property on the knowledge-graph TaxonomicClassification node (taxa_json), sorted by abundance. An adaptive filter collapses the long tail of trace taxa into a single "Other" bucket so the summary stays readable:

Parameter Default Effect
--kraken2_confidence 0.0 Kraken2 --confidence threshold
--kraken2_z_min -1.0 Robust (median/MAD) log-abundance z-score cutoff; taxa below fold into "Other". Lower keeps more; higher groups more aggressively
--kraken2_min_taxa 3 Below this taxa count the adaptive filter is skipped and all taxa are kept

Run with Kraken2 QC

nextflow run . -profile test,mamba \
    --kraken2_db /data/kraken2_dbs/k2_standard_08gb

Output lands in <sample>/kraken2/ (the raw report + output) and <sample>/kg/taxonomic_classification.csv (the QC summary for knowledge-graph import).


7. Run the test sample with clinical data

ViroWatch ships with:

  • assets/test_data/sample_test.fq.gz — ONT reads from a CRF01_AE HIV-1 isolate
  • report/example_vl.csv — viral load time-series for test_01
  • report/example_cd4.csv — CD4 count time-series for test_01

The bundled test profile wires these together automatically. Run:

nextflow run . -profile test,mamba

This single command will:

  1. Resolve and create the virowatch conda environment on first run (~5–10 min)
  2. Resolve and create the isolated medaka environment (~5 min)
  3. Process sample_test.fq.gz through all pipeline steps (Kraken2 QC is skipped unless --kraken2_db is given)
  4. Attach the viral load and CD4 time-series to the HTML report

With the LosAlamos BLAST database

Pass --blast_db on top of the test profile to also run HIV-1 subtyping:

nextflow run . -profile test,mamba \
    --blast_db /data/blast_dbs/LosAlamos_db/LosAlamos_db

Resume after interruption

If the run is interrupted (power loss, timeout, Ctrl-C), resume from the last completed step:

nextflow run . -profile test,mamba -resume

Resource limits

By default ViroWatch caps itself at 16 CPUs and 64 GB RAM. Override in conf/site.config or on the command line:

nextflow run . -profile test,mamba --max_cpus 8 --max_memory 32.GB

8. Inspect the outputs

After a successful run, results land in results_test/test_01/:

results_test/
└── test_01/
    ├── nanostat/                    # Read QC statistics
    ├── kraken2/                     # Kraken2 read-set QC (only if --kraken2_db)
    ├── aln.bam                      # Reference-mapped reads
    ├── qualimap/                    # Mapping QC report
    ├── flye/                        # De novo assembly
    │   └── assembly_info.txt        # Contig metadata (length, coverage, circular)
    ├── racon_iter_1.fa              # Polishing intermediate
    ├── racon_iter_2.fa
    ├── racon_iter_3.fa
    ├── medaka_consensus/
    │   └── consensus.fasta          # Final polished consensus
    ├── quast/                       # Assembly QC vs CRF01_AE reference
    ├── sierrapy.json                # Stanford HIVDB drug resistance
    ├── blast/
    │   ├── los_alamos.blast.json    # LosAlamos BLAST result (if --blast_db)
    │   └── core_nt.blast.json       # core_nt BLAST result (if --core_nt_db)
    ├── multiqc/                     # Aggregated MultiQC HTML report
    ├── test_01_report.html          # Per-sample surveillance report ← open this
    └── kg/                          # Neo4j-compatible CSVs
        ├── sample.csv
        ├── assembly.csv
        ├── biodata_files.csv
        ├── contigs.csv
        ├── stanford_alignments.csv
        ├── stanford_predictions.csv
        ├── mutations.csv
        ├── blast_hits.csv           # only if BLAST was enabled
        └── taxonomic_classification.csv  # only if --kraken2_db was enabled

Open results_test/test_01/test_01_report.html in a browser to see the full surveillance report including the viral load and CD4 trend charts.


9. Prepare clinical CSVs for your own samples

Viral load CSV

Required columns: sample_id, date, vl (copies/mL).
Optional column: log (log₁₀ VL — computed by the report renderer if absent).

sample_id,date,vl
sample_01,2024-03-01,245000
sample_01,2024-06-15,8700
sample_01,2024-09-20,320
sample_02,2024-04-10,150000

CD4 count CSV

Required columns: sample_id, date, cd4_pct, cd4_count.

sample_id,date,cd4_pct,cd4_count
sample_01,2024-03-01,12,210
sample_01,2024-06-15,19,380
sample_01,2024-09-20,27,530
sample_02,2024-04-10,10,180

Samplesheet

One row per sample:

sample_id,fastq
sample_01,/data/ont/sample_01.fq.gz
sample_02,/data/ont/sample_02.fq.gz

Full run with clinical data

nextflow run . -profile mamba \
    --input samplesheet.csv \
    --outdir ./results \
    --blast_db /data/blast_dbs/LosAlamos_db/LosAlamos_db \
    --vl_csv vl.csv \
    --cd4_csv cd4.csv

Site-specific config (recommended for repeated use)

Copy the template and fill it in once — no need to pass flags every run:

cp conf/site.config.template conf/site.config
// conf/site.config
params {
    blast_db = '/data/blast_dbs/LosAlamos_db/LosAlamos_db'
    vl_csv   = '/data/clinical/vl.csv'
    cd4_csv  = '/data/clinical/cd4.csv'
    outdir   = '/data/results/virowatch'
}
workDir = '/scratch/nextflow_work'

Then run:

nextflow run . -profile mamba -c conf/site.config --input samplesheet.csv

Alternatively, add the same block to ~/.nextflow/config — Nextflow loads it automatically on every run.


Hardware note — AVX2

Newer conda builds of Flye (≥ 2.9.6) and Racon (1.5.0) use AVX2 SIMD instructions and will crash with Illegal instruction (SIGILL) on older CPUs (Intel Xeon E5 v1/v2, some Broadwell Xeons).

The environment file pins flye=2.9.5 (the last pre-AVX2 build) automatically. For Racon, if you see SIGILL errors apply this workaround after the conda environment is created:

# Check whether your CPU has AVX2
grep -c avx2 /proc/cpuinfo     # 0 = no AVX2; >0 = fine

# Replace racon binary with the non-AVX2 build (Linux x86_64 only)
VIROWATCH_ENV=$(micromamba env list | awk '/virowatch/{print $NF}')
wget -q https://conda.anaconda.org/bioconda/linux-64/racon-1.4.20-hd03093a_2.tar.bz2
tar xjf racon-1.4.20-hd03093a_2.tar.bz2 bin/racon bin/rampler -C "${VIROWATCH_ENV}/"
rm racon-1.4.20-hd03093a_2.tar.bz2

On Haswell (2013) and later no workaround is needed — you can also remove the flye=2.9.5 pin from envs/virowatch.yaml to get the latest Flye.