-
Notifications
You must be signed in to change notification settings - Fork 0
Pipeline Tutorial
This page walks through a complete first-time setup on a fresh Linux machine, from installing prerequisites through running ViroWatch on the bundled test sample with clinical data attached.
Tested on: Ubuntu 22.04
Nextflow requires Java 11 or later. OpenJDK 21 (LTS) is recommended.
sudo apt update
sudo apt install -y openjdk-21-jdkbrew install openjdk@21
# Follow the brew post-install instructions to add java to PATH, e.g.:
echo 'export PATH="/opt/homebrew/opt/openjdk@21/bin:$PATH"' >> ~/.zshrc
source ~/.zshrcjava -version
# Expected: openjdk version "21.x.x" ...Nextflow is a single self-contained JAR distributed as a launcher script. No root required.
# Download the installer
curl -s https://get.nextflow.io | bash
# Move to a directory on your PATH (adjust if you prefer ~/bin)
sudo mv nextflow /usr/local/bin/
# Verify
nextflow -version
# Expected: nextflow version 24.x.x ...Minimum version: ViroWatch requires Nextflow ≥ 23.10.0. If
nextflow -versionshows an older release, runnextflow self-update.
ViroWatch resolves all tool environments automatically via conda/micromamba. Micromamba is faster and lighter than full Anaconda or Miniconda — it is the recommended option.
# Install micromamba to ~/micromamba (no root required)
"${SHELL}" <(curl -L micro.mamba.pm/install.sh)The installer will:
- download the
micromambabinary - configure your shell (
.bashrc/.zshrc) with themambainitialisation block - ask where to place the binary and base prefix (defaults are fine)
After installation, restart your shell or run:
source ~/.bashrc # or ~/.zshrc on macOS/zshmicromamba --version
# Expected: 1.x.x or 2.x.xAlternative managers: Plain
conda(from Miniconda) andmambawork too — pass-profile mambawhen running the pipeline. The tutorial below uses micromamba with-profile mamba.
git clone https://github.com/STTLab/ViroWatch.git
cd ViroWatchThe repository is self-contained: reference genomes, test reads, and the pre-built LosAlamos BLAST database are all bundled under assets/.
A pre-built BLAST database of 15,471 HIV-1 sequences (indexed + taxdb) is bundled as a compressed tarball. Extract it once to a persistent location — it does not need to live inside the repo.
# Create a directory for BLAST databases (adjust path to your server's convention)
mkdir -p /data/blast_dbs
# Extract
tar -xzf assets/blast/LosAlamos_db.tar.gz -C /data/blast_dbs/After extraction you will have:
/data/blast_dbs/LosAlamos_db/
LosAlamos_db.nhr
LosAlamos_db.nin
LosAlamos_db.nsq
LosAlamos_db.ntf
LosAlamos_db.nto
taxdb.btd
taxdb.bti
...
Note the prefix path — you will pass it to --blast_db. The prefix is the base name without any extension:
--blast_db /data/blast_dbs/LosAlamos_db/LosAlamos_db
| Subtype | Count | TaxID |
|---|---|---|
| HIV-1 M:B | 10,096 | 505185 |
| HIV-1 M:C | 2,402 | 505186 |
| HIV-1 M:CRF01_AE | 2,124 | 1345266 |
| HIV-1 M:CRF02_AG | 232 | 1287874 |
| HIV-1 M:D | 201 | 505226 |
| HIV-1 M:CRF07_BC | 183 | 1385609 |
| HIV-1 M:G | 101 | 505228 |
| HIV-1 (root/unclassified) | 115 | 11676 |
| HIV-1 M:F2 | 14 | 1392219 |
| HIV-1 M:A | 3 | 505184 |
CRF01_AE (13.7%) is well represented for Southeast Asian surveillance.
The second BLAST step (--core_nt_db) runs BLAST against the full NCBI core_nt nucleotide database. It is disabled by default because the database is large (~500 GB compressed). Skip this section for the test run — come back once you have a storage location.
# Requires BLAST+ tools; install via micromamba if not already present:
micromamba install -n base -c bioconda blast
# Download to a dedicated directory
mkdir -p /data/blast_dbs/core_nt
cd /data/blast_dbs/core_nt
# Use update_blastdb.pl to fetch and verify all volumes
update_blastdb.pl --decompress core_ntSee the NCBI BLAST database documentation for full details and alternate download methods (aws s3, rsync, GCP).
The taxdb files must be in the same directory as (or discoverable from) the core_nt index files, and the BLASTDB environment variable must point to that directory:
# In conf/site.config or ~/.nextflow/config
export BLASTDB=/data/blast_dbs/core_ntPass the database prefix to the pipeline:
--core_nt_db /data/blast_dbs/core_nt/core_nt
When --kraken2_db is set, ViroWatch runs a Kraken2 classification of the filtered reads immediately after NanoStat, before assembly. This is a quick "what else is in my read set?" QC pass — it flags contamination or co-infection, and because it runs ahead of assembly it still produces useful output even when Flye yields no assembly. The step is skipped entirely when no DB is given.
Any standard pre-built Kraken2 database works. A small Standard-8 (≈8 GB, capped) index is a good starting point; the full Standard index is larger. Pre-built indexes are published at https://benlangmead.github.io/aws-indexes/k2.
# Example: fetch and unpack a pre-built index
mkdir -p /data/kraken2_dbs/k2_standard_08gb
cd /data/kraken2_dbs/k2_standard_08gb
curl -sSL -O https://genome-idx.s3.amazonaws.com/kraken/k2_standard_08gb_20240605.tar.gz
tar -xzf k2_standard_08gb_20240605.tar.gzThe database directory must contain hash.k2d, opts.k2d, and taxo.k2d. Point --kraken2_db at the directory:
--kraken2_db /data/kraken2_dbs/k2_standard_08gb
The per-taxon results are folded into a single JSON property on the knowledge-graph TaxonomicClassification node (taxa_json), sorted by abundance. An adaptive filter collapses the long tail of trace taxa into a single "Other" bucket so the summary stays readable:
| Parameter | Default | Effect |
|---|---|---|
--kraken2_confidence |
0.0 |
Kraken2 --confidence threshold |
--kraken2_z_min |
-1.0 |
Robust (median/MAD) log-abundance z-score cutoff; taxa below fold into "Other". Lower keeps more; higher groups more aggressively |
--kraken2_min_taxa |
3 |
Below this taxa count the adaptive filter is skipped and all taxa are kept |
nextflow run . -profile test,mamba \
--kraken2_db /data/kraken2_dbs/k2_standard_08gbOutput lands in <sample>/kraken2/ (the raw report + output) and <sample>/kg/taxonomic_classification.csv (the QC summary for knowledge-graph import).
ViroWatch ships with:
-
assets/test_data/sample_test.fq.gz— ONT reads from a CRF01_AE HIV-1 isolate -
report/example_vl.csv— viral load time-series fortest_01 -
report/example_cd4.csv— CD4 count time-series fortest_01
The bundled test profile wires these together automatically. Run:
nextflow run . -profile test,mambaThis single command will:
- Resolve and create the
virowatchconda environment on first run (~5–10 min) - Resolve and create the isolated
medakaenvironment (~5 min) - Process
sample_test.fq.gzthrough all pipeline steps (Kraken2 QC is skipped unless--kraken2_dbis given) - Attach the viral load and CD4 time-series to the HTML report
Pass --blast_db on top of the test profile to also run HIV-1 subtyping:
nextflow run . -profile test,mamba \
--blast_db /data/blast_dbs/LosAlamos_db/LosAlamos_dbIf the run is interrupted (power loss, timeout, Ctrl-C), resume from the last completed step:
nextflow run . -profile test,mamba -resumeBy default ViroWatch caps itself at 16 CPUs and 64 GB RAM. Override in conf/site.config or on the command line:
nextflow run . -profile test,mamba --max_cpus 8 --max_memory 32.GBAfter a successful run, results land in results_test/test_01/:
results_test/
└── test_01/
├── nanostat/ # Read QC statistics
├── kraken2/ # Kraken2 read-set QC (only if --kraken2_db)
├── aln.bam # Reference-mapped reads
├── qualimap/ # Mapping QC report
├── flye/ # De novo assembly
│ └── assembly_info.txt # Contig metadata (length, coverage, circular)
├── racon_iter_1.fa # Polishing intermediate
├── racon_iter_2.fa
├── racon_iter_3.fa
├── medaka_consensus/
│ └── consensus.fasta # Final polished consensus
├── quast/ # Assembly QC vs CRF01_AE reference
├── sierrapy.json # Stanford HIVDB drug resistance
├── blast/
│ ├── los_alamos.blast.json # LosAlamos BLAST result (if --blast_db)
│ └── core_nt.blast.json # core_nt BLAST result (if --core_nt_db)
├── multiqc/ # Aggregated MultiQC HTML report
├── test_01_report.html # Per-sample surveillance report ← open this
└── kg/ # Neo4j-compatible CSVs
├── sample.csv
├── assembly.csv
├── biodata_files.csv
├── contigs.csv
├── stanford_alignments.csv
├── stanford_predictions.csv
├── mutations.csv
├── blast_hits.csv # only if BLAST was enabled
└── taxonomic_classification.csv # only if --kraken2_db was enabled
Open results_test/test_01/test_01_report.html in a browser to see the full surveillance report including the viral load and CD4 trend charts.
Required columns: sample_id, date, vl (copies/mL).
Optional column: log (log₁₀ VL — computed by the report renderer if absent).
sample_id,date,vl
sample_01,2024-03-01,245000
sample_01,2024-06-15,8700
sample_01,2024-09-20,320
sample_02,2024-04-10,150000Required columns: sample_id, date, cd4_pct, cd4_count.
sample_id,date,cd4_pct,cd4_count
sample_01,2024-03-01,12,210
sample_01,2024-06-15,19,380
sample_01,2024-09-20,27,530
sample_02,2024-04-10,10,180One row per sample:
sample_id,fastq
sample_01,/data/ont/sample_01.fq.gz
sample_02,/data/ont/sample_02.fq.gznextflow run . -profile mamba \
--input samplesheet.csv \
--outdir ./results \
--blast_db /data/blast_dbs/LosAlamos_db/LosAlamos_db \
--vl_csv vl.csv \
--cd4_csv cd4.csvCopy the template and fill it in once — no need to pass flags every run:
cp conf/site.config.template conf/site.config// conf/site.config
params {
blast_db = '/data/blast_dbs/LosAlamos_db/LosAlamos_db'
vl_csv = '/data/clinical/vl.csv'
cd4_csv = '/data/clinical/cd4.csv'
outdir = '/data/results/virowatch'
}
workDir = '/scratch/nextflow_work'Then run:
nextflow run . -profile mamba -c conf/site.config --input samplesheet.csvAlternatively, add the same block to ~/.nextflow/config — Nextflow loads it automatically on every run.
Newer conda builds of Flye (≥ 2.9.6) and Racon (1.5.0) use AVX2 SIMD instructions and will crash with Illegal instruction (SIGILL) on older CPUs (Intel Xeon E5 v1/v2, some Broadwell Xeons).
The environment file pins flye=2.9.5 (the last pre-AVX2 build) automatically. For Racon, if you see SIGILL errors apply this workaround after the conda environment is created:
# Check whether your CPU has AVX2
grep -c avx2 /proc/cpuinfo # 0 = no AVX2; >0 = fine
# Replace racon binary with the non-AVX2 build (Linux x86_64 only)
VIROWATCH_ENV=$(micromamba env list | awk '/virowatch/{print $NF}')
wget -q https://conda.anaconda.org/bioconda/linux-64/racon-1.4.20-hd03093a_2.tar.bz2
tar xjf racon-1.4.20-hd03093a_2.tar.bz2 bin/racon bin/rampler -C "${VIROWATCH_ENV}/"
rm racon-1.4.20-hd03093a_2.tar.bz2On Haswell (2013) and later no workaround is needed — you can also remove the flye=2.9.5 pin from envs/virowatch.yaml to get the latest Flye.