ViroWatch is a Nextflow pipeline for HIV-1 genome surveillance from Oxford Nanopore reads. It takes per-sample FASTQ files through quality control, de novo assembly, consensus polishing, drug resistance analysis, and optional BLAST-based subtyping.
Designed for low-resource settings — portable, single conda environment, resume-capable.
Documentation:
flowchart LR
FASTQ([FASTQ]) --> DEDUP[seqkit rmdup]
DEDUP --> CHOP[chopper]
CHOP --> NS[NanoStat]
NS --> K2[Kraken2 read-set QC]:::opt
CHOP --> MM[minimap2] --> QM[qualimap]
CHOP --> FLYE[Flye]
FLYE --> RACON[Racon ×3]
RACON --> MEDAKA[Medaka]
MEDAKA --> QUAST[QUAST]
MEDAKA --> SP[SierraPy]
MEDAKA --> BLA[BLAST LosAlamos]:::opt
MEDAKA --> BNT[BLAST core_nt]:::opt
NS & QM & QUAST --> MQC[MultiQC]
MQC & SP & BLA & BNT --> RPT[/HTML report/]
MEDAKA & SP & BLA & BNT & K2 --> KG[(kg/ CSVs)]
classDef opt fill:#f5f5f5,stroke:#aaa,stroke-dasharray:5 5
- Nextflow ≥ 23.10.0
- Micromamba or Mamba (recommended; plain conda is slow on medaka's dependency tree)
Most tools are installed automatically into envs/virowatch.yaml on first run. Medaka runs in a separate isolated environment (envs/medaka.yaml) because its PyTorch/CUDA dependencies conflict with the main environment.
Both envs/virowatch.yaml and envs/medaka.yaml are fully pinned lockfiles (name=version=build), so they resolve deterministically regardless of upstream conda repodata drift. To change a tool, update the env and re-export with micromamba env export -n <env> > envs/<env>.yaml.
Newer builds of Flye (≥ 2.9.6) and Racon (1.5.0) are compiled with AVX2 instructions and will crash with Illegal instruction (SIGILL) on CPUs that pre-date Haswell (Intel Xeon E5 v1/v2, some Broadwell Xeons). The environment file already pins flye=2.9.5 to avoid this. Racon 1.5.0 has no non-AVX2 conda build; a workaround is to manually replace the binary after environment creation:
# Check whether your CPU supports AVX2
grep -c avx2 /proc/cpuinfo # 0 = no AVX2; >0 = fine, no workaround needed
# Workaround: replace racon binary with an AVX2-free build (Linux x86_64)
wget https://conda.anaconda.org/bioconda/linux-64/racon-1.4.20-hd03093a_2.tar.bz2
tar xjf racon-1.4.20-hd03093a_2.tar.bz2 -C $(conda env list | awk '/virowatch/{print $NF}') bin/racon bin/ramplerOn AVX2-capable hardware (Haswell and later) no action is required — remove the flye=2.9.5 pin in envs/virowatch.yaml to get the latest build.
# Test run with bundled data (CRF01_AE reference + test FASTQ)
nextflow run . -profile test
# Standard run
nextflow run . --input samplesheet.csv --outdir ./results
# With optional LosAlamos BLAST subtyping
nextflow run . --input samplesheet.csv --blast_db /path/to/LosAlamos_db
# With both BLAST DBs and clinical data
nextflow run . --input samplesheet.csv \
--blast_db /path/to/LosAlamos_db \
--core_nt_db /path/to/core_nt/core_nt \
--vl_csv vl.csv --cd4_csv cd4.csv
# Resume a failed run
nextflow run . --input samplesheet.csv -resumeProvide a CSV file with sample_id and fastq columns:
sample_id,fastq
sample_01,/path/to/sample_01.fq.gz
sample_02,/path/to/sample_02.fq.gz| Parameter | Default | Description |
|---|---|---|
--input |
(required) | Path to samplesheet CSV |
--outdir |
./results |
Output directory |
--ref_fa |
assets/refs/CRF01_AE.fa |
Reference FASTA for mapping QC |
--ref_gff |
assets/refs/CRF01_AE.gff |
GFF for qualimap/QUAST |
--medaka_model |
r1041_e82_400bps_sup_v5.2.0 |
Medaka model — must match basecalling model |
--blast_db |
null |
Path to pre-built LosAlamos BLAST DB (disabled if null) |
--core_nt_db |
null |
Path to NCBI core_nt DB; requires BLASTDB env var pointing to taxdb |
--kraken2_db |
null |
Path to a pre-built Kraken2 DB dir for read-set QC (disabled if null) |
--kraken2_confidence |
0.0 |
Kraken2 --confidence threshold |
--kraken2_z_min |
-1.0 |
Log-abundance z-score cutoff; taxa below fold into an "Other" bucket |
--kraken2_min_taxa |
3 |
Below this taxa count the adaptive z-score filter is skipped |
--vl_csv |
null |
Viral load history CSV (sample_id, date, vl columns) |
--cd4_csv |
null |
CD4 count history CSV (sample_id, date, cd4_pct, cd4_count columns) |
--chopper_q |
10 |
Minimum read quality score |
--chopper_minlen |
2000 |
Minimum read length (bp) |
--chopper_maxlen |
6000 |
Maximum read length (bp) |
Bundled references: CRF01_AE (default) and HXB2 — both in assets/refs/.
| File | Committed | Purpose |
|---|---|---|
nextflow.config |
yes | Pipeline defaults and profile stubs |
conf/base.config |
yes | Process resource caps and retry strategy |
conf/test.config |
yes | Test profile — bundled data, relaxed filters |
conf/site.config.template |
yes | Template for site-specific settings |
conf/site.config |
no (gitignored) | Filled-in site config — BLAST paths, workDir, etc. |
Machine-specific settings (BLAST DB paths, workDir, executor) go in conf/site.config, which is gitignored. Copy the template and fill it in:
cp conf/site.config.template conf/site.config
# edit conf/site.config, then:
nextflow run . --input samplesheet.csv -c conf/site.configAlternatively, add site params to ~/.nextflow/config — Nextflow auto-loads it on every run with no -c flag needed:
// ~/.nextflow/config
params {
blast_db = '/mnt/central/BLAST/LosAlamos_db'
core_nt_db = '/mnt/central/BLAST/core_nt/core_nt'
}
workDir = '/scratch/nextflow_work'Each sample produces results/<sample_id>/:
nanostat/ Read QC stats
kraken2/ Kraken2 read-set QC (report + output; if --kraken2_db)
aln.bam Reference-mapped reads
qualimap/ Mapping QC
flye/ De novo assembly
racon_iter_*.fa Polishing intermediates
medaka_consensus/ Final consensus FASTA
quast/ Assembly QC vs reference
sierrapy.json Stanford HIVDB drug resistance result
blast/los_alamos.blast.json LosAlamos BLAST results (if --blast_db)
blast/core_nt.blast.json core_nt BLAST results (if --core_nt_db)
multiqc/ Aggregated QC report
<sample_id>_report.html Per-sample HIV sequence analysis report
kg/ Neo4j-compatible CSVs for knowledge graph import (see below)
Each sample's kg/ directory contains flat CSVs ready for bulk import into a Neo4j knowledge graph:
| CSV file | Neo4j nodes created |
|---|---|
sample.csv |
Sample |
assembly.csv |
Assembly → linked to Sample |
biodata_files.csv |
BioDataFile (FASTQ input + consensus FASTA) → linked to Assembly |
contigs.csv |
Contig → linked to BioDataFile |
stanford_alignments.csv |
StanfordHIVDRAlignment → linked to Contig and Protein |
stanford_predictions.csv |
StanfordHIVDRPrediction, Drug, DrugClass → linked to Contig via Sample |
mutations.csv |
Mutation → linked to StanfordHIVDRAlignment |
blast_hits.csv |
ReferenceGenome, Organism (from BLAST hits) |
taxonomic_classification.csv |
ProcessRun:TaxonomicClassification (Kraken2 read-set QC; taxa in a taxa_json property) → linked to Sample and the FASTQ BioDataFile (if --kraken2_db) |
Contig IDs are namespaced {sample_id}:{flye_contig_name} (e.g. sample_01:contig_1) to remain globally unique across samples, since Flye always resets its numbering from contig_1.
Import into Neo4j using the bundled assets/virowatch_cypher_templates.csv (import via Neo4j Desktop → Saved Cypher → Import). Example query:
// example — load predictions for one sample
:auto LOAD CSV WITH HEADERS FROM 'file:///sample_01/kg/stanford_predictions.csv' AS rec
CALL { WITH rec
MERGE (dc:DrugClass {name: rec.drug_class})
MERGE (d:Drug {name: rec.drug_name})
MERGE (d)-[:IN_DRUG_CLASS]->(dc)
MERGE (pred:StanfordHIVDRPrediction {prediction_id: rec.prediction_id})
ON CREATE SET pred.contig_id = rec.contig_id, pred.sample_id = rec.sample_id,
pred.gene = rec.gene, pred.drug_name = rec.drug_name
SET pred.score = toInteger(rec.score), pred.level = toInteger(rec.level),
pred.interpretation = rec.interpretation
MERGE (pred)-[:PREDICTS_RESISTANCE_TO]->(d)
} IN TRANSACTIONS OF 500 ROWS;Two assets are bundled in assets/blast/:
Option A — pre-built DB (recommended): LosAlamos_db.tar.gz contains the BLAST index files and taxdb (66 MB). Extract to a persistent location and point --blast_db at it:
tar -xzf assets/blast/LosAlamos_db.tar.gz -C /path/to/blast_dbs/
nextflow run . --input samplesheet.csv --blast_db /path/to/blast_dbs/LosAlamos_dbOption B — rebuild from FASTA: LosAlamos_db.gz is the raw sequence file (20 MB compressed). Use this if you need to rebuild with custom taxid mappings:
gunzip -c assets/blast/LosAlamos_db.gz > LosAlamos_db.fa
makeblastdb -in LosAlamos_db.fa -dbtype nucl -out LosAlamos_db \
-taxid_map sequence_to_taxid.txt -parse_seqids15,471 HIV-1 sequences from the Los Alamos HIV Sequence Database, with taxids assigned per subtype:
| Count | TaxID | Subtype |
|---|---|---|
| 10,096 | 505185 | HIV-1 M:B |
| 2,402 | 505186 | HIV-1 M:C |
| 2,124 | 1345266 | HIV-1 M:CRF01_AE |
| 232 | 1287874 | HIV-1 M:CRF02_AG |
| 201 | 505226 | HIV-1 M:D |
| 183 | 1385609 | HIV-1 M:CRF07_BC |
| 101 | 505228 | HIV-1 M:G |
| 115 | 11676 | HIV-1 (root/unclassified) |
| 14 | 1392219 | HIV-1 M:F2 |
| 3 | 505184 | HIV-1 M:A |
CRF01_AE (13.7%) is well represented, relevant for Southeast Asian surveillance. Subtype B dominates (65.3%) reflecting the historical composition of the Los Alamos database.
| Step | Tool | Version |
|---|---|---|
| Deduplication | seqkit rmdup | 2.10.0 |
| Read filtering | chopper | 0.10.0 |
| Read QC | NanoStat | 1.6.0 |
| Read-set taxonomy QC | Kraken2 (optional) | 2.1.3 |
| Reference mapping | minimap2 | latest |
| Mapping QC | qualimap | 2.3 |
| De novo assembly | Flye (--meta) | 2.9.5 † |
| Polishing | Racon | 1.5.0 ‡ |
| Consensus | Medaka | 2.1.1 (separate env) |
| Assembly QC | QUAST | 5.3.0 |
| Drug resistance | SierraPy | 0.4.3 |
| Subtyping | BLAST+ | 2.16.0 |
| Aggregated QC | MultiQC | 1.28 |
| Report | Jinja2 / Python | 3.1.4 / 3.x |
† Pinned at 2.9.5 — the last build without AVX2; safe to upgrade on AVX2-capable hardware.
‡ bioconda 1.5.0 binary requires AVX2; see Hardware note for the workaround on older CPUs.
- Drug resistance: Stanford HIV Drug Resistance Database (via SierraPy)
- Subtyping: Los Alamos HIV Sequence Database
![]() |
|---|
| Figure 2: The structure of the Knowledge Graph in ViroWatch |
The Knowledge Graph spans three interconnected domains:
- Clinical terminology — standardized concepts using SNOMED CT (disorders, clinical findings, morphologic abnormalities).
- Patient and clinical metadata — patient records, specimens, and lab results (viral load, CD4+ counts).
- Microbiology and genomics — isolates, assemblies, and genetic variants linked to clinical data.
For installation, database setup, data loading, and surveillance query templates see the Knowledge graph Tutorial.
Disclaimer: This project is not affiliated with, endorsed by, or sponsored by Neo4j, Inc. "Neo4j" and related trademarks are the property of Neo4j, Inc.
