Skip to content

STTLab/ViroWatch

Repository files navigation

ViroWatch

ViroWatch is a Nextflow pipeline for HIV-1 genome surveillance from Oxford Nanopore reads. It takes per-sample FASTQ files through quality control, de novo assembly, consensus polishing, drug resistance analysis, and optional BLAST-based subtyping.

Designed for low-resource settings — portable, single conda environment, resume-capable.

Documentation:

Pipeline overview

flowchart LR
    FASTQ([FASTQ]) --> DEDUP[seqkit rmdup]
    DEDUP --> CHOP[chopper]
    CHOP --> NS[NanoStat]
    NS --> K2[Kraken2 read-set QC]:::opt
    CHOP --> MM[minimap2] --> QM[qualimap]
    CHOP --> FLYE[Flye]
    FLYE --> RACON[Racon ×3]
    RACON --> MEDAKA[Medaka]
    MEDAKA --> QUAST[QUAST]
    MEDAKA --> SP[SierraPy]
    MEDAKA --> BLA[BLAST LosAlamos]:::opt
    MEDAKA --> BNT[BLAST core_nt]:::opt
    NS & QM & QUAST --> MQC[MultiQC]
    MQC & SP & BLA & BNT --> RPT[/HTML report/]
    MEDAKA & SP & BLA & BNT & K2 --> KG[(kg/ CSVs)]

    classDef opt fill:#f5f5f5,stroke:#aaa,stroke-dasharray:5 5
Loading

Requirements

Most tools are installed automatically into envs/virowatch.yaml on first run. Medaka runs in a separate isolated environment (envs/medaka.yaml) because its PyTorch/CUDA dependencies conflict with the main environment.

Both envs/virowatch.yaml and envs/medaka.yaml are fully pinned lockfiles (name=version=build), so they resolve deterministically regardless of upstream conda repodata drift. To change a tool, update the env and re-export with micromamba env export -n <env> > envs/<env>.yaml.

Hardware note — AVX2

Newer builds of Flye (≥ 2.9.6) and Racon (1.5.0) are compiled with AVX2 instructions and will crash with Illegal instruction (SIGILL) on CPUs that pre-date Haswell (Intel Xeon E5 v1/v2, some Broadwell Xeons). The environment file already pins flye=2.9.5 to avoid this. Racon 1.5.0 has no non-AVX2 conda build; a workaround is to manually replace the binary after environment creation:

# Check whether your CPU supports AVX2
grep -c avx2 /proc/cpuinfo     # 0 = no AVX2; >0 = fine, no workaround needed

# Workaround: replace racon binary with an AVX2-free build (Linux x86_64)
wget https://conda.anaconda.org/bioconda/linux-64/racon-1.4.20-hd03093a_2.tar.bz2
tar xjf racon-1.4.20-hd03093a_2.tar.bz2 -C $(conda env list | awk '/virowatch/{print $NF}') bin/racon bin/rampler

On AVX2-capable hardware (Haswell and later) no action is required — remove the flye=2.9.5 pin in envs/virowatch.yaml to get the latest build.

Quick start

# Test run with bundled data (CRF01_AE reference + test FASTQ)
nextflow run . -profile test

# Standard run
nextflow run . --input samplesheet.csv --outdir ./results

# With optional LosAlamos BLAST subtyping
nextflow run . --input samplesheet.csv --blast_db /path/to/LosAlamos_db

# With both BLAST DBs and clinical data
nextflow run . --input samplesheet.csv \
  --blast_db /path/to/LosAlamos_db \
  --core_nt_db /path/to/core_nt/core_nt \
  --vl_csv vl.csv --cd4_csv cd4.csv

# Resume a failed run
nextflow run . --input samplesheet.csv -resume

Samplesheet format

Provide a CSV file with sample_id and fastq columns:

sample_id,fastq
sample_01,/path/to/sample_01.fq.gz
sample_02,/path/to/sample_02.fq.gz

Parameters

Parameter Default Description
--input (required) Path to samplesheet CSV
--outdir ./results Output directory
--ref_fa assets/refs/CRF01_AE.fa Reference FASTA for mapping QC
--ref_gff assets/refs/CRF01_AE.gff GFF for qualimap/QUAST
--medaka_model r1041_e82_400bps_sup_v5.2.0 Medaka model — must match basecalling model
--blast_db null Path to pre-built LosAlamos BLAST DB (disabled if null)
--core_nt_db null Path to NCBI core_nt DB; requires BLASTDB env var pointing to taxdb
--kraken2_db null Path to a pre-built Kraken2 DB dir for read-set QC (disabled if null)
--kraken2_confidence 0.0 Kraken2 --confidence threshold
--kraken2_z_min -1.0 Log-abundance z-score cutoff; taxa below fold into an "Other" bucket
--kraken2_min_taxa 3 Below this taxa count the adaptive z-score filter is skipped
--vl_csv null Viral load history CSV (sample_id, date, vl columns)
--cd4_csv null CD4 count history CSV (sample_id, date, cd4_pct, cd4_count columns)
--chopper_q 10 Minimum read quality score
--chopper_minlen 2000 Minimum read length (bp)
--chopper_maxlen 6000 Maximum read length (bp)

Bundled references: CRF01_AE (default) and HXB2 — both in assets/refs/.

Configuration layout

File Committed Purpose
nextflow.config yes Pipeline defaults and profile stubs
conf/base.config yes Process resource caps and retry strategy
conf/test.config yes Test profile — bundled data, relaxed filters
conf/site.config.template yes Template for site-specific settings
conf/site.config no (gitignored) Filled-in site config — BLAST paths, workDir, etc.

Machine-specific settings (BLAST DB paths, workDir, executor) go in conf/site.config, which is gitignored. Copy the template and fill it in:

cp conf/site.config.template conf/site.config
# edit conf/site.config, then:
nextflow run . --input samplesheet.csv -c conf/site.config

Alternatively, add site params to ~/.nextflow/config — Nextflow auto-loads it on every run with no -c flag needed:

// ~/.nextflow/config
params {
    blast_db   = '/mnt/central/BLAST/LosAlamos_db'
    core_nt_db = '/mnt/central/BLAST/core_nt/core_nt'
}
workDir = '/scratch/nextflow_work'

Output structure

Each sample produces results/<sample_id>/:

nanostat/                    Read QC stats
kraken2/                     Kraken2 read-set QC (report + output; if --kraken2_db)
aln.bam                      Reference-mapped reads
qualimap/                    Mapping QC
flye/                        De novo assembly
racon_iter_*.fa              Polishing intermediates
medaka_consensus/            Final consensus FASTA
quast/                       Assembly QC vs reference
sierrapy.json                Stanford HIVDB drug resistance result
blast/los_alamos.blast.json  LosAlamos BLAST results (if --blast_db)
blast/core_nt.blast.json     core_nt BLAST results (if --core_nt_db)
multiqc/                     Aggregated QC report
<sample_id>_report.html      Per-sample HIV sequence analysis report
kg/                          Neo4j-compatible CSVs for knowledge graph import (see below)

Knowledge graph export

Each sample's kg/ directory contains flat CSVs ready for bulk import into a Neo4j knowledge graph:

CSV file Neo4j nodes created
sample.csv Sample
assembly.csv Assembly → linked to Sample
biodata_files.csv BioDataFile (FASTQ input + consensus FASTA) → linked to Assembly
contigs.csv Contig → linked to BioDataFile
stanford_alignments.csv StanfordHIVDRAlignment → linked to Contig and Protein
stanford_predictions.csv StanfordHIVDRPrediction, Drug, DrugClass → linked to Contig via Sample
mutations.csv Mutation → linked to StanfordHIVDRAlignment
blast_hits.csv ReferenceGenome, Organism (from BLAST hits)
taxonomic_classification.csv ProcessRun:TaxonomicClassification (Kraken2 read-set QC; taxa in a taxa_json property) → linked to Sample and the FASTQ BioDataFile (if --kraken2_db)

Contig IDs are namespaced {sample_id}:{flye_contig_name} (e.g. sample_01:contig_1) to remain globally unique across samples, since Flye always resets its numbering from contig_1.

Import into Neo4j using the bundled assets/virowatch_cypher_templates.csv (import via Neo4j Desktop → Saved Cypher → Import). Example query:

// example — load predictions for one sample
:auto LOAD CSV WITH HEADERS FROM 'file:///sample_01/kg/stanford_predictions.csv' AS rec
CALL { WITH rec
    MERGE (dc:DrugClass {name: rec.drug_class})
    MERGE (d:Drug {name: rec.drug_name})
    MERGE (d)-[:IN_DRUG_CLASS]->(dc)
    MERGE (pred:StanfordHIVDRPrediction {prediction_id: rec.prediction_id})
      ON CREATE SET pred.contig_id = rec.contig_id, pred.sample_id = rec.sample_id,
                    pred.gene = rec.gene, pred.drug_name = rec.drug_name
    SET pred.score = toInteger(rec.score), pred.level = toInteger(rec.level),
        pred.interpretation = rec.interpretation
    MERGE (pred)-[:PREDICTS_RESISTANCE_TO]->(d)
} IN TRANSACTIONS OF 500 ROWS;

LosAlamos BLAST database setup

Two assets are bundled in assets/blast/:

Option A — pre-built DB (recommended): LosAlamos_db.tar.gz contains the BLAST index files and taxdb (66 MB). Extract to a persistent location and point --blast_db at it:

tar -xzf assets/blast/LosAlamos_db.tar.gz -C /path/to/blast_dbs/
nextflow run . --input samplesheet.csv --blast_db /path/to/blast_dbs/LosAlamos_db

Option B — rebuild from FASTA: LosAlamos_db.gz is the raw sequence file (20 MB compressed). Use this if you need to rebuild with custom taxid mappings:

gunzip -c assets/blast/LosAlamos_db.gz > LosAlamos_db.fa
makeblastdb -in LosAlamos_db.fa -dbtype nucl -out LosAlamos_db \
    -taxid_map sequence_to_taxid.txt -parse_seqids

Database contents

15,471 HIV-1 sequences from the Los Alamos HIV Sequence Database, with taxids assigned per subtype:

Count TaxID Subtype
10,096 505185 HIV-1 M:B
2,402 505186 HIV-1 M:C
2,124 1345266 HIV-1 M:CRF01_AE
232 1287874 HIV-1 M:CRF02_AG
201 505226 HIV-1 M:D
183 1385609 HIV-1 M:CRF07_BC
101 505228 HIV-1 M:G
115 11676 HIV-1 (root/unclassified)
14 1392219 HIV-1 M:F2
3 505184 HIV-1 M:A

CRF01_AE (13.7%) is well represented, relevant for Southeast Asian surveillance. Subtype B dominates (65.3%) reflecting the historical composition of the Los Alamos database.

Tools

Step Tool Version
Deduplication seqkit rmdup 2.10.0
Read filtering chopper 0.10.0
Read QC NanoStat 1.6.0
Read-set taxonomy QC Kraken2 (optional) 2.1.3
Reference mapping minimap2 latest
Mapping QC qualimap 2.3
De novo assembly Flye (--meta) 2.9.5 †
Polishing Racon 1.5.0 ‡
Consensus Medaka 2.1.1 (separate env)
Assembly QC QUAST 5.3.0
Drug resistance SierraPy 0.4.3
Subtyping BLAST+ 2.16.0
Aggregated QC MultiQC 1.28
Report Jinja2 / Python 3.1.4 / 3.x

† Pinned at 2.9.5 — the last build without AVX2; safe to upgrade on AVX2-capable hardware.
‡ bioconda 1.5.0 binary requires AVX2; see Hardware note for the workaround on older CPUs.

Data sources


Knowledge graph (Neo4j)

Figure 2: An illustration of entities relationship pattern for managing bacterial whole genome sequencing data and all relevant information in ViroWatch.
Figure 2: The structure of the Knowledge Graph in ViroWatch

The Knowledge Graph spans three interconnected domains:

  1. Clinical terminology — standardized concepts using SNOMED CT (disorders, clinical findings, morphologic abnormalities).
  2. Patient and clinical metadata — patient records, specimens, and lab results (viral load, CD4+ counts).
  3. Microbiology and genomics — isolates, assemblies, and genetic variants linked to clinical data.

For installation, database setup, data loading, and surveillance query templates see the Knowledge graph Tutorial.

Disclaimer: This project is not affiliated with, endorsed by, or sponsored by Neo4j, Inc. "Neo4j" and related trademarks are the property of Neo4j, Inc.

About

ViroWatch: HIV-1 ONT genome surveillance pipeline

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors