NosoGraph is a graph database schema designed for representing and integrating clinical, microbiological, and genomic data in a unified framework. Built on graph data modeling principles, it encodes entities, such as patients, specimens, microorganisms, genes, and variants as nodes, and their relationships as edges, enabling explicit and queryable connections across traditionally siloed datasets.
The schema is intended to support the construction of biomedical knowledge graphs, with a focus on infectious diseases and antimicrobial resistance (AMR). By structuring data around relationships rather than isolated records, NosoGraph enables intuitive exploration of complex questions—for example, linking patient context to microbial isolates, genomic variants, and resistance phenotypes within a single query.
This repository provides the core schema design, example data models, and practical resources for implementation using Neo4j, including installation guidance, sample CSV files for data import, and example Cypher queries. NosoGraph is designed to be extensible and adaptable to different research and hospital settings, supporting use cases such as outbreak investigation, genomic epidemiology, and integrated clinical-genomic analysis.
The graph is organized into three interoperable domains, each modeling a different layer of clinical–biological knowledge and linked through explicit relationships.
-
Clinical terminology This layer represents standardized clinical concepts using SNOMED CT, including disorders, clinical findings, situations, and morphologic abnormalities. SNOMED provide a controlled vocabulary for patient conditions, enabling consistent representation, disease grouping, and provide point of reference to external clinical data.
-
Patient and clinical metadata This layer stores patient metadata and care processes, including patients' information, admissions, wards, specimens, and laboratory results e.g. MICs, CBC. It captures who the patient is, when and why they were admitted, what specimens were collected, and what tests were done, when, and what are the results forming the clinical context for downstream analyses.
-
Microbiology and genomics layer This layer represents the biological entities and analyses derived from patient specimens, including isolates, organisms, genome assemblies, genes, features, and variants identified through sequencing pipelines.
First, clone the repository:
git clone https://github.com/STTLab/NosoGraph.git
cd NosoGraphThe NosoGraph sequencing pipeline supports hybrid and long-read bacterial genome assembly (Flye, Canu), iterative polishing (Racon, Pilon), and quality assessment (CheckM2). It is implemented in Nextflow DSL2, which manages Conda environments automatically and supports execution on local machines and HPC clusters (SLURM).
- Nextflow ≥ 23.04
- Conda or Micromamba
Install Nextflow:
curl -s https://get.nextflow.io | bash
mv nextflow ~/bin/Conda environments are created automatically by Nextflow on first run — no manual setup required. The assembly → polish → QC pipeline is vendored from bioinformatics-workflows under modules/vendor/bacterial-assembly/, which owns its own conda env; additional envs live in the top-level conda/ directory.
| Environment file | Purpose |
|---|---|
modules/vendor/bacterial-assembly/conda/bacterial-assembly.yaml |
Assembly, polishing, and QC tools (Flye, Canu, Racon, Pilon, BWA-mem2, SAMtools, CheckM2) |
conda/blast.yaml |
BLAST and sequence comparison tools |
conda/medaka.yaml |
Medaka neural-network polishing |
conda/kg_export.yaml |
Knowledge-graph CSV exporter (report/kg_export.py; Python + pandas) |
conda/meta_kg_export.yaml |
Metagenomics knowledge-graph CSV exporter (report/meta_kg_export.py; Python + pandas) |
| Parameter | Description | Default |
|---|---|---|
--pipeline |
Pipeline to run: bacterial-assembly, autocycler, or metagenomics |
bacterial-assembly |
--sample_id |
Sample identifier; scopes outputs to <outdir>/<sample_id>/ and namespaces contig IDs |
required |
--long_reads |
Long-read FASTQ (gzipped or uncompressed) | required |
--kraken2_db |
Kraken2 DB directory with hash.k2d/opts.k2d/taxo.k2d (required for --pipeline metagenomics) |
— |
--kraken2_mem |
Memory request for Kraken2 (≈ DB size; raise for the full Standard DB) | 64 GB |
--kraken2_z_min |
taxa_json z-score cutoff; taxa below fold into an "Other" bucket |
-1.0 |
--kraken2_min_taxa |
Keep all taxa (no bucketing) when fewer than this many | 3 |
--read1 |
Paired-end short reads R1 (required for Pilon) | — |
--read2 |
Paired-end short reads R2 (required for Pilon) | — |
--assembler |
Assembler: canu or flye |
required |
--tech |
Sequencing technology: nanopore, nanopore-hq, or pacbio |
required |
--genome_size |
Genome size (e.g. 5m, 2.6g) — required for Canu |
— |
--outdir |
Output directory | results |
--threads |
Threads per process | 1 |
--racon_iter |
Racon polishing iterations | 1 |
--pilon_iter |
Pilon polishing iterations | 1 |
--checkm2_db |
Path to CheckM2 database (uniref100.KO.1.dmnd) |
— |
--queue |
SLURM partition name (-profile slurm only) |
— |
Hybrid assembly with Flye:
nextflow run main.nf \
--sample_id sample_01 \
--long_reads reads.fastq.gz \
--read1 sample_R1.fastq.gz \
--read2 sample_R2.fastq.gz \
--assembler flye \
--tech nanopore \
--outdir results \
--threads 16 \
--racon_iter 2 \
--pilon_iter 2 \
--checkm2_db /path/to/uniref100.KO.1.dmndThis writes the assembly under results/01_assembly/ and a knowledge-graph CSV bundle to
results/sample_01/kg/ (sample.csv, assembly.csv, biodata_files.csv, contigs.csv) ready for
import into Neo4j (see NosoGraph knowledge graph).
Long-read-only assembly with Canu:
nextflow run main.nf \
--sample_id sample_02 \
--long_reads pacbio_reads.fastq.gz \
--assembler canu \
--tech pacbio \
--genome_size 5m \
--outdir results \
--threads 32The metagenomics pipeline classifies long reads against a pre-built Kraken2 database and
turns the result into a high-level, pathogen-ID knowledge graph — in a single Nextflow
run. The vendored kraken2-classify module produces the
Kraken2 report, which is exported to the kg/ CSVs.
nextflow run main.nf \
--pipeline metagenomics \
--sample_id sample_meta \
--long_reads reads.fastq.gz \
--kraken2_db /path/to/k2_standard \
--outdir resultsIt reads the per-sample Kraken2 report (kraken2/<sample_id>.kraken2.report.txt, filtered to
species + genus) and writes a knowledge-graph CSV bundle to
results/sample_meta/kg/ (taxonomic_classification.csv, meta_reads.csv). The input
FASTQ given to --long_reads is also recorded as a BioDataFile node. The resulting subgraph
(a public NosoGraph extension built on the generic ProcessRun pattern) is:
graph LR
S[Sample] -->|CLASSIFIED_IN| TC["ProcessRun:TaxonomicClassification<br/>(taxa_json)"]
TC -->|CLASSIFIED_FROM| F["BioDataFile {FASTQ}"]
The identified taxa are not modelled as Organism nodes: Kraken2 output is an untrusted,
per-run classification (produced before the curated DB is built), and it isn't meant to be
traversed in the graph. Instead they ride along as a single JSON-string property taxa_json
on the TaxonomicClassification node — a read-set QC glance, sorted by abundance and
pre-filtered with an adaptive z-score bucket (taxa below --kraken2_z_min fold into an
"Other" row; filtering is skipped when there are fewer than --kraken2_min_taxa taxa).
Recover rows in Neo4j Browser with apoc.convert.fromJsonList.
Import these CSVs with LOAD DATA steps 17–18, then run the QUERIES → Pathogens detected per
sample template (unpacks taxa_json; see NosoGraph knowledge graph).
Add -profile slurm to submit each process as an independent SLURM job. Specify your partition with --queue:
nextflow run main.nf \
-profile slurm \
--queue normal \
--sample_id sample_01 \
--long_reads reads.fastq.gz \
--assembler flye \
--tech nanopore \
--outdir results \
--threads 16 \
--racon_iter 2 \
--pilon_iter 1 \
--checkm2_db /path/to/uniref100.KO.1.dmndDefault resource allocations per process label (adjustable in modules/vendor/bacterial-assembly/nextflow.config):
| Process | CPUs | Memory | Time |
|---|---|---|---|
| Assembly (Flye) | --threads |
32 GB | 24 h |
| Assembly (Canu) | --threads |
32 GB | 24 h |
| Racon iteration | --threads |
32 GB | 24 h |
| Pilon iteration | --threads |
28 GB | 12 h |
| CheckM2 | --threads |
32 GB | 12 h |
To resume a run after a failure:
nextflow run main.nf -profile slurm -resume ...Use -stub-run with -profile test to verify the full DAG compiles and all process connections are correct without needing real input files or conda environments:
nextflow run main.nf -stub-run -profile test \
--sample_id sample_01 \
--assembler canu \
--tech pacbio \
--genome_size 5m \
--long_reads dummy.fastq.gz \
--read1 dummy_R1.fastq.gz \
--read2 dummy_R2.fastq.gz \
--racon_iter 2 \
--pilon_iter 2 \
--checkm2_db dummy.dmnd \
--outdir /tmp/nf_testExpected output:
[PROCESS] BACTERIAL_ASSEMBLY:ASSEMBLY_CANU (1)
[PROCESS] BACTERIAL_ASSEMBLY:RACON_POLISH (1)
[PROCESS] BACTERIAL_ASSEMBLY:PILON_POLISH (1)
[PROCESS] BACTERIAL_ASSEMBLY:CHECKM2 (1)
[PROCESS] KG_EXPORT (sample_01)
[SUCCESS] completed=5 failed=0 cached=0
The -profile test flag disables conda so the stub runs locally without any tools installed. Input file paths are not checked for existence in stub mode — any placeholder string works.
This repository provides:
- A conceptual schema defining node labels, relationship types, and data domains
- Example CSV files for data import (
example/csv/) - A single importable loader artefact —
assets/nosograph_cypher_templates.csv— a Neo4j Browser saved-queries file with the constraints, idempotentLOAD CSVimport queries, and example analytical queries - Per-sample knowledge-graph exporters that the pipelines run to write
kg/CSVs to<outdir>/<sample_id>/kg/—report/kg_export.py(bacterial assembly) andreport/meta_kg_export.py(metagenomics pathogen ID) - Guidance for setting up Neo4j as a working environment
Users can adopt the schema as a starting point, extend it to fit their specific use cases, and integrate it with custom pipelines or applications as needed.
We recommend checking out the example directory to get started.
It is important to note that, NosoGraph is not a database management system (DBMS) and does not provide a complete software platform for data ingestion, storage, or analysis. Instead, it defines a blueprint outlining structured conceptual model that guides how clinical, microbiological, and genomic data should be organized and linked within a graph database. The implementation of the underlying infrastructure (e.g., data pipelines, deployment environment, access control, and application interfaces) is intentionally out of scope of this repository. Users are expected to adapt the schema to their own systems and integrate it with existing workflows or tools.
We recommend using Neo4j as the platform offers an intuitive desktop interface, providing ease-of-use for general users and a mature ecosystem for graph-based development.
[info] Disclaimer: This project is not affiliated with, endorsed by, or sponsored by Neo4j, Inc. “Neo4j” and related trademarks are the property of Neo4j, Inc. All references to Neo4j within this repository are for informational and implementation purposes only.
Download and install Neo4j Desktop from:
Follow instructions to download, install, and launch the application.
- Choose "Local instances" on the sidebar menu
- Click "Create instance"
- Fill instance details according to instructions.
- Set a database name (e.g., nosograph-db)
- Set a password and store it securely
- Click “Create”.
- Connect to the instance through "Query" or "Explore" menu
CSV files are loaded from the instance's import directory, found in the instances list on the
connection screen, e.g. Path: C:\Users\<username>\.Neo4jDesktop2\Data\dbmss\dbms-<instance-id>\import.
Copy both sources into that import directory:
- The hand-authored clinical CSVs from
example/csv/(Departments.csv,Wards.csv,Patients.csv,Admissions.csv,Antibiotic.csv,Specimens.csv,Samples.csv,Organisms.csv,ReferenceGenomes.csv,LabResults.csv,SNPs.csv) — copy them to the import-directory root. - For each sequenced sample, the pipeline-produced
kg/directory from<outdir>/<sample_id>/kg/— copy it so it sits at<import>/<sample_id>/kg/.
This repository ships no programmatic loader — the only loader artefact is the saved-queries file
assets/nosograph_cypher_templates.csv.
- In Neo4j Browser, open the Favorites sidebar → Import Cypher queries and select
assets/nosograph_cypher_templates.csv. The queries appear under a NosoGraph folder (SETUP/LOAD DATA/QUERIES/UTILITIES). - Run SETUP → Create Constraints once.
- Run the LOAD DATA queries in order. For the per-sample genomic loads (
10–15, and17–19for metagenomics samples), replace<sample_id>in thefile:///<sample_id>/kg/...paths with your actual sample id. - Each load is idempotent (
MERGE,IN TRANSACTIONS OF 500 ROWS), so re-running is safe.
The node labels, properties, and relationships these templates create match the canonical NosoGraph graph schema, so a manual load yields the same structure as the interface library's ingest.
From the QUERIES folder (or the Query editor) you can:
- Run the example analytical queries (node counts, the Patient→Specimen→Sample→Assembly→Contig spine, AMR susceptibility summary, shared-contig clonality clusters, variants per gene)
- Visualize relationships interactively and expand nodes (double-click)
This work was supported by the following funding bodies:
- The Fundamental Fund 2025, Chiang Mai University, Chaing Mai, Thailand (Grant number: 214458).
- The Faculty of Medicine Research Fund, Chiang Mai University (Grant No. 099-2563)
- Support the Children Foundation, Chiang Mai, Thailand.
