ViroWatch

ViroWatch is a Nextflow pipeline for HIV-1 genome surveillance from Oxford Nanopore reads. It takes per-sample FASTQ files through quality control, de novo assembly, consensus polishing, drug resistance analysis, and optional BLAST-based subtyping.

Designed for low-resource settings — portable, single conda environment, resume-capable.

Documentation:

Pipeline overview

flowchart LR
    FASTQ([FASTQ]) --> DEDUP[seqkit rmdup]
    DEDUP --> CHOP[chopper]
    CHOP --> NS[NanoStat]
    NS --> K2[Kraken2 read-set QC]:::opt
    CHOP --> MM[minimap2] --> QM[qualimap]
    CHOP --> FLYE[Flye]
    FLYE --> RACON[Racon ×3]
    RACON --> MEDAKA[Medaka]
    MEDAKA --> QUAST[QUAST]
    MEDAKA --> SP[SierraPy]
    MEDAKA --> BLA[BLAST LosAlamos]:::opt
    MEDAKA --> BNT[BLAST core_nt]:::opt
    NS & QM & QUAST --> MQC[MultiQC]
    MQC & SP & BLA & BNT --> RPT[/HTML report/]
    MEDAKA & SP & BLA & BNT & K2 --> KG[(kg/ CSVs)]

    classDef opt fill:#f5f5f5,stroke:#aaa,stroke-dasharray:5 5

Requirements

Nextflow ≥ 23.10.0
Micromamba or Mamba (recommended; plain conda is slow on medaka's dependency tree)

Most tools are installed automatically into envs/virowatch.yaml on first run. Medaka runs in a separate isolated environment (envs/medaka.yaml) because its PyTorch/CUDA dependencies conflict with the main environment.

Both envs/virowatch.yaml and envs/medaka.yaml are fully pinned lockfiles (name=version=build), so they resolve deterministically regardless of upstream conda repodata drift. To change a tool, update the env and re-export with micromamba env export -n <env> > envs/<env>.yaml.

Hardware note — AVX2

Newer builds of Flye (≥ 2.9.6) and Racon (1.5.0) are compiled with AVX2 instructions and will crash with Illegal instruction (SIGILL) on CPUs that pre-date Haswell (Intel Xeon E5 v1/v2, some Broadwell Xeons). The environment file already pins flye=2.9.5 to avoid this. Racon 1.5.0 has no non-AVX2 conda build; a workaround is to manually replace the binary after environment creation:

# Check whether your CPU supports AVX2
grep -c avx2 /proc/cpuinfo     # 0 = no AVX2; >0 = fine, no workaround needed

# Workaround: replace racon binary with an AVX2-free build (Linux x86_64)
wget https://conda.anaconda.org/bioconda/linux-64/racon-1.4.20-hd03093a_2.tar.bz2
tar xjf racon-1.4.20-hd03093a_2.tar.bz2 -C $(conda env list | awk '/virowatch/{print $NF}') bin/racon bin/rampler

On AVX2-capable hardware (Haswell and later) no action is required — remove the flye=2.9.5 pin in envs/virowatch.yaml to get the latest build.

Quick start

# Test run with bundled data (CRF01_AE reference + test FASTQ)
nextflow run . -profile test

# Standard run
nextflow run . --input samplesheet.csv --outdir ./results

# With optional LosAlamos BLAST subtyping
nextflow run . --input samplesheet.csv --blast_db /path/to/LosAlamos_db

# With both BLAST DBs and clinical data
nextflow run . --input samplesheet.csv \
  --blast_db /path/to/LosAlamos_db \
  --core_nt_db /path/to/core_nt/core_nt \
  --vl_csv vl.csv --cd4_csv cd4.csv

# Resume a failed run
nextflow run . --input samplesheet.csv -resume

Samplesheet format

Provide a CSV file with sample_id and fastq columns:

sample_id,fastq
sample_01,/path/to/sample_01.fq.gz
sample_02,/path/to/sample_02.fq.gz

Parameters

Parameter	Default	Description
`--input`	(required)	Path to samplesheet CSV
`--outdir`	`./results`	Output directory
`--ref_fa`	`assets/refs/CRF01_AE.fa`	Reference FASTA for mapping QC
`--ref_gff`	`assets/refs/CRF01_AE.gff`	GFF for qualimap/QUAST
`--medaka_model`	`r1041_e82_400bps_sup_v5.2.0`	Medaka model — must match basecalling model
`--blast_db`	`null`	Path to pre-built LosAlamos BLAST DB (disabled if null)
`--core_nt_db`	`null`	Path to NCBI core_nt DB; requires `BLASTDB` env var pointing to taxdb
`--kraken2_db`	`null`	Path to a pre-built Kraken2 DB dir for read-set QC (disabled if null)
`--kraken2_confidence`	`0.0`	Kraken2 `--confidence` threshold
`--kraken2_z_min`	`-1.0`	Log-abundance z-score cutoff; taxa below fold into an "Other" bucket
`--kraken2_min_taxa`	`3`	Below this taxa count the adaptive z-score filter is skipped
`--vl_csv`	`null`	Viral load history CSV (`sample_id`, `date`, `vl` columns)
`--cd4_csv`	`null`	CD4 count history CSV (`sample_id`, `date`, `cd4_pct`, `cd4_count` columns)
`--chopper_q`	`10`	Minimum read quality score
`--chopper_minlen`	`2000`	Minimum read length (bp)
`--chopper_maxlen`	`6000`	Maximum read length (bp)

Bundled references: CRF01_AE (default) and HXB2 — both in assets/refs/.

Configuration layout

File	Committed	Purpose
`nextflow.config`	yes	Pipeline defaults and profile stubs
`conf/base.config`	yes	Process resource caps and retry strategy
`conf/test.config`	yes	Test profile — bundled data, relaxed filters
`conf/site.config.template`	yes	Template for site-specific settings
`conf/site.config`	no (gitignored)	Filled-in site config — BLAST paths, workDir, etc.

Machine-specific settings (BLAST DB paths, workDir, executor) go in conf/site.config, which is gitignored. Copy the template and fill it in:

cp conf/site.config.template conf/site.config
# edit conf/site.config, then:
nextflow run . --input samplesheet.csv -c conf/site.config

Alternatively, add site params to ~/.nextflow/config — Nextflow auto-loads it on every run with no -c flag needed:

// ~/.nextflow/config
params {
    blast_db   = '/mnt/central/BLAST/LosAlamos_db'
    core_nt_db = '/mnt/central/BLAST/core_nt/core_nt'
}
workDir = '/scratch/nextflow_work'

Output structure

Each sample produces results/<sample_id>/:

nanostat/                    Read QC stats
kraken2/                     Kraken2 read-set QC (report + output; if --kraken2_db)
aln.bam                      Reference-mapped reads
qualimap/                    Mapping QC
flye/                        De novo assembly
racon_iter_*.fa              Polishing intermediates
medaka_consensus/            Final consensus FASTA
quast/                       Assembly QC vs reference
sierrapy.json                Stanford HIVDB drug resistance result
blast/los_alamos.blast.json  LosAlamos BLAST results (if --blast_db)
blast/core_nt.blast.json     core_nt BLAST results (if --core_nt_db)
multiqc/                     Aggregated QC report
<sample_id>_report.html      Per-sample HIV sequence analysis report
kg/                          Neo4j-compatible CSVs for knowledge graph import (see below)

Knowledge graph export

Each sample's kg/ directory contains flat CSVs ready for bulk import into a Neo4j knowledge graph:

CSV file	Neo4j nodes created
`sample.csv`	`Sample`
`assembly.csv`	`Assembly` → linked to `Sample`
`biodata_files.csv`	`BioDataFile` (FASTQ input + consensus FASTA) → linked to `Assembly`
`contigs.csv`	`Contig` → linked to `BioDataFile`
`stanford_alignments.csv`	`StanfordHIVDRAlignment` → linked to `Contig` and `Protein`
`stanford_predictions.csv`	`StanfordHIVDRPrediction`, `Drug`, `DrugClass` → linked to `Contig` via `Sample`
`mutations.csv`	`Mutation` → linked to `StanfordHIVDRAlignment`
`blast_hits.csv`	`ReferenceGenome`, `Organism` (from BLAST hits)
`taxonomic_classification.csv`	`ProcessRun:TaxonomicClassification` (Kraken2 read-set QC; taxa in a `taxa_json` property) → linked to `Sample` and the FASTQ `BioDataFile` (if `--kraken2_db`)

Contig IDs are namespaced {sample_id}:{flye_contig_name} (e.g. sample_01:contig_1) to remain globally unique across samples, since Flye always resets its numbering from contig_1.

Import into Neo4j using the bundled assets/virowatch_cypher_templates.csv (import via Neo4j Desktop → Saved Cypher → Import). Example query:

// example — load predictions for one sample
:auto LOAD CSV WITH HEADERS FROM 'file:///sample_01/kg/stanford_predictions.csv' AS rec
CALL { WITH rec
    MERGE (dc:DrugClass {name: rec.drug_class})
    MERGE (d:Drug {name: rec.drug_name})
    MERGE (d)-[:IN_DRUG_CLASS]->(dc)
    MERGE (pred:StanfordHIVDRPrediction {prediction_id: rec.prediction_id})
      ON CREATE SET pred.contig_id = rec.contig_id, pred.sample_id = rec.sample_id,
                    pred.gene = rec.gene, pred.drug_name = rec.drug_name
    SET pred.score = toInteger(rec.score), pred.level = toInteger(rec.level),
        pred.interpretation = rec.interpretation
    MERGE (pred)-[:PREDICTS_RESISTANCE_TO]->(d)
} IN TRANSACTIONS OF 500 ROWS;

LosAlamos BLAST database setup

Two assets are bundled in assets/blast/:

Option A — pre-built DB (recommended): LosAlamos_db.tar.gz contains the BLAST index files and taxdb (66 MB). Extract to a persistent location and point --blast_db at it:

tar -xzf assets/blast/LosAlamos_db.tar.gz -C /path/to/blast_dbs/
nextflow run . --input samplesheet.csv --blast_db /path/to/blast_dbs/LosAlamos_db

Option B — rebuild from FASTA: LosAlamos_db.gz is the raw sequence file (20 MB compressed). Use this if you need to rebuild with custom taxid mappings:

gunzip -c assets/blast/LosAlamos_db.gz > LosAlamos_db.fa
makeblastdb -in LosAlamos_db.fa -dbtype nucl -out LosAlamos_db \
    -taxid_map sequence_to_taxid.txt -parse_seqids

Database contents

15,471 HIV-1 sequences from the Los Alamos HIV Sequence Database, with taxids assigned per subtype:

Count	TaxID	Subtype
10,096	505185	HIV-1 M:B
2,402	505186	HIV-1 M:C
2,124	1345266	HIV-1 M:CRF01_AE
232	1287874	HIV-1 M:CRF02_AG
201	505226	HIV-1 M:D
183	1385609	HIV-1 M:CRF07_BC
101	505228	HIV-1 M:G
115	11676	HIV-1 (root/unclassified)
14	1392219	HIV-1 M:F2
3	505184	HIV-1 M:A

CRF01_AE (13.7%) is well represented, relevant for Southeast Asian surveillance. Subtype B dominates (65.3%) reflecting the historical composition of the Los Alamos database.

Tools

Step	Tool	Version
Deduplication	seqkit rmdup	2.10.0
Read filtering	chopper	0.10.0
Read QC	NanoStat	1.6.0
Read-set taxonomy QC	Kraken2 (optional)	2.1.3
Reference mapping	minimap2	latest
Mapping QC	qualimap	2.3
De novo assembly	Flye (--meta)	2.9.5 †
Polishing	Racon	1.5.0 ‡
Consensus	Medaka	2.1.1 (separate env)
Assembly QC	QUAST	5.3.0
Drug resistance	SierraPy	0.4.3
Subtyping	BLAST+	2.16.0
Aggregated QC	MultiQC	1.28
Report	Jinja2 / Python	3.1.4 / 3.x

† Pinned at 2.9.5 — the last build without AVX2; safe to upgrade on AVX2-capable hardware.
‡ bioconda 1.5.0 binary requires AVX2; see Hardware note for the workaround on older CPUs.

Data sources

Drug resistance: Stanford HIV Drug Resistance Database (via SierraPy)
Subtyping: Los Alamos HIV Sequence Database

Knowledge graph (Neo4j)


Figure 2: The structure of the Knowledge Graph in ViroWatch

The Knowledge Graph spans three interconnected domains:

Clinical terminology — standardized concepts using SNOMED CT (disorders, clinical findings, morphologic abnormalities).
Patient and clinical metadata — patient records, specimens, and lab results (viral load, CD4+ counts).
Microbiology and genomics — isolates, assemblies, and genetic variants linked to clinical data.

For installation, database setup, data loading, and surveillance query templates see the Knowledge graph Tutorial.

Disclaimer: This project is not affiliated with, endorsed by, or sponsored by Neo4j, Inc. "Neo4j" and related trademarks are the property of Neo4j, Inc.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
README/Images		README/Images
assets		assets
bin		bin
conf		conf
envs		envs
report		report
wiki @ 56c94aa		wiki @ 56c94aa
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
main.nf		main.nf
nextflow.config		nextflow.config
virowatch_assembly.sh		virowatch_assembly.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ViroWatch

Pipeline overview

Requirements

Hardware note — AVX2

Quick start

Samplesheet format

Parameters

Configuration layout

Output structure

Knowledge graph export

LosAlamos BLAST database setup

Database contents

Tools

Data sources

Knowledge graph (Neo4j)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

ViroWatch

Pipeline overview

Requirements

Hardware note — AVX2

Quick start

Samplesheet format

Parameters

Configuration layout

Output structure

Knowledge graph export

LosAlamos BLAST database setup

Database contents

Tools

Data sources

Knowledge graph (Neo4j)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages