Skip to content
Sara Wattanasombat edited this page Jun 28, 2026 · 1 revision

NosoGraph is a graph data model — plus a thin Nextflow assembly pipeline and a set of Neo4j Cypher templates — for integrating clinical, microbiological, and genomic data in a single queryable knowledge graph. It is aimed at hospital-associated (nosocomial) infection surveillance, antimicrobial resistance (AMR), and genomic epidemiology.

NosoGraph consists of two parts:

  1. The graph schema + Cypher templates — the core deliverable. A conceptual model (node labels, relationship types, three data domains) and an importable Neo4j Browser saved-queries file (assets/nosograph_cypher_templates.csv) that creates constraints, loads CSVs idempotently, and ships ready analytical queries.
  2. A bacterial-assembly pipeline — a Nextflow workflow (one sample per invocation) that takes long reads through assembly → polishing → QC and exports a per-sample kg/ CSV bundle ready for import into the graph.

NosoGraph is not a database management system. It defines how clinical, microbiology, and genomic data should be organised and linked, and gives you the templates to do it in Neo4j. Infrastructure (deployment, access control, app interfaces) is intentionally out of scope.


Pipeline at a glance

The bacterial-assembly pipeline (default) for one sample:

flowchart TD
    LR([Long reads<br/>FASTQ]) --> ASM["Flye or Canu<br/>de novo assembly"]
    SR([Short reads<br/>R1/R2 — optional]) -.-> PILON

    ASM --> RACON["Racon ×N<br/>long-read polishing"]
    RACON --> PILON["Pilon ×N<br/>short-read polishing<br/>(hybrid only)"]
    PILON --> CHECKM2["CheckM2<br/>completeness / contamination"]

    PILON --> KG["kg_export.py<br/>Neo4j CSVs"]
    CHECKM2 --> KG
    ASM -. assembly_info.txt .-> KG

    KG --> CSVS[("kg/ CSVs:<br/>sample, assembly,<br/>biodata_files, contigs")]

    classDef opt fill:#f5f5f5,stroke:#aaa,stroke-dasharray:5 5
    class SR opt
Loading

An alternative consensus-assembly pipeline, Autocycler (--pipeline autocycler) for users who want multi-assembler consensus contigs.


The knowledge graph

NosoGraph organises everything into three interoperable layers, linked by relationships:

Domain What it holds Example nodes
Clinical terminology Standardised concepts (SNOMED CT) for disorders, findings, morphology terminology concepts
Patient & clinical metadata Who the patient is, admissions, wards, specimens, lab results (MIC/AST) Patient, Admission, Ward, Department, Specimen, LabResult, Antibiotic
Microbiology & genomics Isolates and sequencing-derived entities Sample, Assembly, BioDataFile, Contig, Organism, ReferenceGenome, Feature, Variant

The three domains connect through the clinical–genomic spine:

Patient ← Specimen ← Sample → Assembly → BioDataFile → Contig

so you can ask, in a single query, "which patients carried isolates sharing this resistance variant?"


Key outputs per sample

The pipeline writes results to <outdir>/, with the knowledge-graph bundle scoped to <outdir>/<sample_id>/kg/:

Path Contents
01_assembly/ De novo assembly (assembly.contigs.fasta, and assembly_info.txt for Flye)
02_polish/01_racon/ Racon-polished intermediates
02_polish/02_pilon/ Pilon-polished assembly (hybrid runs)
03_qc/ CheckM2 completeness / contamination report
<sample_id>/kg/sample.csv Sample node
<sample_id>/kg/assembly.csv Assembly node (with CheckM2 metrics), linked to Sample
<sample_id>/kg/biodata_files.csv BioDataFile nodes (input FASTQ + assembly FASTA)
<sample_id>/kg/contigs.csv Contig nodes (FASTA joined with Flye assembly_info.txt)

Variants and features (Variant, Feature) are loaded separately from a variant-calling CSV (e.g. Snippy SNPs.csv) — see the Knowledge graph Tutorial.


Bundled assets

Path Description
assets/nosograph_cypher_templates.csv Neo4j Browser saved-queries file: SETUP / LOAD DATA / QUERIES / UTILITIES
example/csv/ A complete synthetic clinical dataset (Departments, Wards, Patients, Admissions, Specimens, Samples, Organisms, Antibiotic, LabResults, ReferenceGenomes, SNPs)
example/csv/README.md Data dictionary — per-table properties, types, required/optional, examples
conda/ Environment files (blast, medaka, sierrapy, kg_export)
modules/vendor/bacterial-assembly/ Assembly → polish → QC module (Flye, Canu, Racon, Pilon, CheckM2)
modules/vendor/autocycler/ Alternative consensus-assembly pipeline

Where to go next

  • Pipeline Tutorial — install prerequisites and run the assembly pipeline on a sample, from FASTQ to kg/ CSVs.
  • Knowledge graph Tutorial — install Neo4j Desktop, load the clinical CSVs and per-sample kg/ bundles, and run surveillance queries.

Disclaimer: This project is not affiliated with, endorsed by, or sponsored by Neo4j, Inc. "Neo4j" and related trademarks are the property of Neo4j, Inc.

Clone this wiki locally