-
Notifications
You must be signed in to change notification settings - Fork 0
Home
NosoGraph is a graph data model — plus a thin Nextflow assembly pipeline and a set of Neo4j Cypher templates — for integrating clinical, microbiological, and genomic data in a single queryable knowledge graph. It is aimed at hospital-associated (nosocomial) infection surveillance, antimicrobial resistance (AMR), and genomic epidemiology.
NosoGraph consists of two parts:
-
The graph schema + Cypher templates — the core deliverable. A conceptual model (node labels, relationship types, three data domains) and an importable Neo4j Browser saved-queries file (
assets/nosograph_cypher_templates.csv) that creates constraints, loads CSVs idempotently, and ships ready analytical queries. -
A bacterial-assembly pipeline — a Nextflow workflow (one sample per invocation) that takes long reads through assembly → polishing → QC and exports a per-sample
kg/CSV bundle ready for import into the graph.
NosoGraph is not a database management system. It defines how clinical, microbiology, and genomic data should be organised and linked, and gives you the templates to do it in Neo4j. Infrastructure (deployment, access control, app interfaces) is intentionally out of scope.
The bacterial-assembly pipeline (default) for one sample:
flowchart TD
LR([Long reads<br/>FASTQ]) --> ASM["Flye or Canu<br/>de novo assembly"]
SR([Short reads<br/>R1/R2 — optional]) -.-> PILON
ASM --> RACON["Racon ×N<br/>long-read polishing"]
RACON --> PILON["Pilon ×N<br/>short-read polishing<br/>(hybrid only)"]
PILON --> CHECKM2["CheckM2<br/>completeness / contamination"]
PILON --> KG["kg_export.py<br/>Neo4j CSVs"]
CHECKM2 --> KG
ASM -. assembly_info.txt .-> KG
KG --> CSVS[("kg/ CSVs:<br/>sample, assembly,<br/>biodata_files, contigs")]
classDef opt fill:#f5f5f5,stroke:#aaa,stroke-dasharray:5 5
class SR opt
An alternative consensus-assembly pipeline, Autocycler (--pipeline autocycler) for users who want multi-assembler consensus contigs.
NosoGraph organises everything into three interoperable layers, linked by relationships:
| Domain | What it holds | Example nodes |
|---|---|---|
| Clinical terminology | Standardised concepts (SNOMED CT) for disorders, findings, morphology | terminology concepts |
| Patient & clinical metadata | Who the patient is, admissions, wards, specimens, lab results (MIC/AST) |
Patient, Admission, Ward, Department, Specimen, LabResult, Antibiotic
|
| Microbiology & genomics | Isolates and sequencing-derived entities |
Sample, Assembly, BioDataFile, Contig, Organism, ReferenceGenome, Feature, Variant
|
The three domains connect through the clinical–genomic spine:
Patient ← Specimen ← Sample → Assembly → BioDataFile → Contig
so you can ask, in a single query, "which patients carried isolates sharing this resistance variant?"
The pipeline writes results to <outdir>/, with the knowledge-graph bundle scoped to <outdir>/<sample_id>/kg/:
| Path | Contents |
|---|---|
01_assembly/ |
De novo assembly (assembly.contigs.fasta, and assembly_info.txt for Flye) |
02_polish/01_racon/ |
Racon-polished intermediates |
02_polish/02_pilon/ |
Pilon-polished assembly (hybrid runs) |
03_qc/ |
CheckM2 completeness / contamination report |
<sample_id>/kg/sample.csv |
Sample node |
<sample_id>/kg/assembly.csv |
Assembly node (with CheckM2 metrics), linked to Sample
|
<sample_id>/kg/biodata_files.csv |
BioDataFile nodes (input FASTQ + assembly FASTA) |
<sample_id>/kg/contigs.csv |
Contig nodes (FASTA joined with Flye assembly_info.txt) |
Variants and features (Variant, Feature) are loaded separately from a variant-calling CSV (e.g. Snippy SNPs.csv) — see the Knowledge graph Tutorial.
| Path | Description |
|---|---|
assets/nosograph_cypher_templates.csv |
Neo4j Browser saved-queries file: SETUP / LOAD DATA / QUERIES / UTILITIES |
example/csv/ |
A complete synthetic clinical dataset (Departments, Wards, Patients, Admissions, Specimens, Samples, Organisms, Antibiotic, LabResults, ReferenceGenomes, SNPs) |
example/csv/README.md |
Data dictionary — per-table properties, types, required/optional, examples |
conda/ |
Environment files (blast, medaka, sierrapy, kg_export) |
modules/vendor/bacterial-assembly/ |
Assembly → polish → QC module (Flye, Canu, Racon, Pilon, CheckM2) |
modules/vendor/autocycler/ |
Alternative consensus-assembly pipeline |
-
Pipeline Tutorial — install prerequisites and run the assembly pipeline on a sample, from FASTQ to
kg/CSVs. -
Knowledge graph Tutorial — install Neo4j Desktop, load the clinical CSVs and per-sample
kg/bundles, and run surveillance queries.
Disclaimer: This project is not affiliated with, endorsed by, or sponsored by Neo4j, Inc. "Neo4j" and related trademarks are the property of Neo4j, Inc.