Home

NosoGraph is a graph data model — plus a thin Nextflow assembly pipeline and a set of Neo4j Cypher templates — for integrating clinical, microbiological, and genomic data in a single queryable knowledge graph. It is aimed at hospital-associated (nosocomial) infection surveillance, antimicrobial resistance (AMR), and genomic epidemiology.

NosoGraph consists of two parts:

The graph schema + Cypher templates — the core deliverable. A conceptual model (node labels, relationship types, three data domains) and an importable Neo4j Browser saved-queries file (assets/nosograph_cypher_templates.csv) that creates constraints, loads CSVs idempotently, and ships ready analytical queries.
A bacterial-assembly pipeline — a Nextflow workflow (one sample per invocation) that takes long reads through assembly → polishing → QC and exports a per-sample kg/ CSV bundle ready for import into the graph.

NosoGraph is not a database management system. It defines how clinical, microbiology, and genomic data should be organised and linked, and gives you the templates to do it in Neo4j. Infrastructure (deployment, access control, app interfaces) is intentionally out of scope.

Pipeline at a glance

The bacterial-assembly pipeline (default) for one sample:

flowchart TD
    LR([Long reads<br/>FASTQ]) --> ASM["Flye or Canu<br/>de novo assembly"]
    SR([Short reads<br/>R1/R2 — optional]) -.-> PILON

    ASM --> RACON["Racon ×N<br/>long-read polishing"]
    RACON --> PILON["Pilon ×N<br/>short-read polishing<br/>(hybrid only)"]
    PILON --> CHECKM2["CheckM2<br/>completeness / contamination"]

    PILON --> KG["kg_export.py<br/>Neo4j CSVs"]
    CHECKM2 --> KG
    ASM -. assembly_info.txt .-> KG

    KG --> CSVS[("kg/ CSVs:<br/>sample, assembly,<br/>biodata_files, contigs")]

    classDef opt fill:#f5f5f5,stroke:#aaa,stroke-dasharray:5 5
    class SR opt

An alternative consensus-assembly pipeline, Autocycler (--pipeline autocycler) for users who want multi-assembler consensus contigs.

The knowledge graph

NosoGraph organises everything into three interoperable layers, linked by relationships:

Domain	What it holds	Example nodes
Clinical terminology	Standardised concepts (SNOMED CT) for disorders, findings, morphology	terminology concepts
Patient & clinical metadata	Who the patient is, admissions, wards, specimens, lab results (MIC/AST)	`Patient`, `Admission`, `Ward`, `Department`, `Specimen`, `LabResult`, `Antibiotic`
Microbiology & genomics	Isolates and sequencing-derived entities	`Sample`, `Assembly`, `BioDataFile`, `Contig`, `Organism`, `ReferenceGenome`, `Feature`, `Variant`

The three domains connect through the clinical–genomic spine:

Patient ← Specimen ← Sample → Assembly → BioDataFile → Contig

so you can ask, in a single query, "which patients carried isolates sharing this resistance variant?"

Key outputs per sample

The pipeline writes results to <outdir>/, with the knowledge-graph bundle scoped to <outdir>/<sample_id>/kg/:

Path	Contents
`01_assembly/`	De novo assembly (`assembly.contigs.fasta`, and `assembly_info.txt` for Flye)
`02_polish/01_racon/`	Racon-polished intermediates
`02_polish/02_pilon/`	Pilon-polished assembly (hybrid runs)
`03_qc/`	CheckM2 completeness / contamination report
`<sample_id>/kg/sample.csv`	`Sample` node
`<sample_id>/kg/assembly.csv`	`Assembly` node (with CheckM2 metrics), linked to `Sample`
`<sample_id>/kg/biodata_files.csv`	`BioDataFile` nodes (input FASTQ + assembly FASTA)
`<sample_id>/kg/contigs.csv`	`Contig` nodes (FASTA joined with Flye `assembly_info.txt`)

Variants and features (Variant, Feature) are loaded separately from a variant-calling CSV (e.g. Snippy SNPs.csv) — see the Knowledge graph Tutorial.

Bundled assets

Path	Description
`assets/nosograph_cypher_templates.csv`	Neo4j Browser saved-queries file: SETUP / LOAD DATA / QUERIES / UTILITIES
`example/csv/`	A complete synthetic clinical dataset (Departments, Wards, Patients, Admissions, Specimens, Samples, Organisms, Antibiotic, LabResults, ReferenceGenomes, SNPs)
`example/csv/README.md`	Data dictionary — per-table properties, types, required/optional, examples
`conda/`	Environment files (`blast`, `medaka`, `sierrapy`, `kg_export`)
`modules/vendor/bacterial-assembly/`	Assembly → polish → QC module (Flye, Canu, Racon, Pilon, CheckM2)
`modules/vendor/autocycler/`	Alternative consensus-assembly pipeline

Where to go next

Pipeline Tutorial — install prerequisites and run the assembly pipeline on a sample, from FASTQ to kg/ CSVs.
Knowledge graph Tutorial — install Neo4j Desktop, load the clinical CSVs and per-sample kg/ bundles, and run surveillance queries.

Disclaimer: This project is not affiliated with, endorsed by, or sponsored by Neo4j, Inc. "Neo4j" and related trademarks are the property of Neo4j, Inc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Home

Pipeline at a glance

The knowledge graph

Key outputs per sample

Bundled assets

Where to go next

Uh oh!

Uh oh!

Clone this wiki locally