Skip to content

Knowledge graph Tutorial

minaminii edited this page Jul 1, 2026 · 3 revisions

This page covers installing Neo4j Desktop, creating a NosoGraph database, loading both the hand-authored clinical CSVs and the per-sample kg/ CSVs produced by the pipeline, and running surveillance queries that span clinical, microbiology, and genomic data.

NosoGraph ships no programmatic loader — by design, import is a transparent, auditable Neo4j Browser workflow. The only loader artefact is the saved-queries file assets/nosograph_cypher_templates.csv.


1. Install Neo4j Desktop

Download the installer for your OS from the Neo4j Download Center.

OS Package
macOS .dmg — drag Neo4j Desktop to /Applications
Windows .exe installer
Linux .AppImage or .deb / .rpm

Launch Neo4j Desktop and complete first-run activation (a free account or the activation key on the download page).


2. Create a database

  1. In the sidebar choose Local instancesCreate instance.
  2. Name it, e.g. nosograph-db.
  3. Set and securely store a password.
  4. Leave the Neo4j version at the default (5.x recommended).
  5. Click Create, then Start. The status turns green when ready.
  6. Open Query (Neo4j Browser) to run Cypher.

The default Bolt URI for a local instance is bolt://localhost:7687.


3. Import the Cypher templates

NosoGraph ships assets/nosograph_cypher_templates.csv — a saved-queries file with every constraint, load, and analytical query pre-written, grouped into four folders.

  1. In Neo4j Browser, open the Favorites / Saved Cypher sidebar (bookmark icon).
  2. Click Import and select assets/nosograph_cypher_templates.csv.
  3. Four folders appear under NosoGraph: SETUP, LOAD DATA, QUERIES, UTILITIES.

4. Create constraints (run once)

From the SETUP folder, run Create Constraints before loading any data. It enforces uniqueness and makes the MERGE-based loads fast and idempotent. It creates uniqueness/node-key constraints for: Ward, Patient, Admission, Specimen, Sample, LabResult, Antibiotic, Organism, ReferenceGenome, Assembly, BioDataFile, Contig, Feature, SequencingRun, VariantCallingRun, Variant, and TaxonomicClassification (the last is the metagenomics public extension, keyed on process_run_id).

Verify:

SHOW CONSTRAINTS;

5. Stage your data in the import directory

LOAD CSV can only read files inside the instance's import directory. Find it on the instance's connection screen (e.g. …\.Neo4jDesktop2\Data\dbmss\dbms-<id>\import), or via ⋮ → Open folder → Import.

Copy two sources into it:

(a) The hand-authored clinical CSVs → import-directory root. Use the bundled example/csv/ files as templates (or your own, in the same shape):

<import>/
├── Departments.csv
├── Wards.csv
├── Patients.csv
├── Admissions.csv
├── Antibiotic.csv
├── Specimens.csv
├── Samples.csv
├── Organisms.csv
├── ReferenceGenomes.csv
├── LabResults.csv
└── SNPs.csv          # variant calls (e.g. Snippy output) for step 16

(b) Each sequenced sample's kg/ directory → under a matching <sample_id>/ subfolder:

cp -r results/BAC_S001/kg  <import>/BAC_S001/

so it sits at <import>/BAC_S001/kg/{sample,assembly,biodata_files,contigs}.csv.

Field conventions (STTLab): booleans are lowercase true/false; empty values are "" (templates guard optional fields with IS NOT NULL AND <> ''); file identity uses SHA-256; contig IDs are namespaced {sample_id}:{contig_name}.

See example/csv/README.md for the full data dictionary — every column's type, whether it is required, and an example value.


6. Load data (run in order)

Open the LOAD DATA queries in sequence. Steps 01–09 load the clinical/reference backbone from the root CSVs; steps 10–16 load the genomic data. For the per-sample steps (10–15), replace <sample_id> in the file:///<sample_id>/kg/... path with your actual sample id (e.g. BAC_S001). Every load uses MERGE and IN TRANSACTIONS OF 500 ROWS, so re-running is safe.

Step Query Creates Key relationship
01 Departments Department
02 Wards Ward (Ward)-[:IN_DEPARTMENT]->(Department)
03 Patients Patient
04 Admissions Admission (Patient)-[:HAS_ADMISSION]->(Admission)
05 Antibiotics Antibiotic
06 Specimens Specimen (Specimen)-[:COLLECTED_FROM]->(Patient)
07 Organisms Organism
08 Reference Genomes ReferenceGenome (ReferenceGenome)-[:REFERENCE_GENOME_OF]->(Organism)
09 Antibiotic Susceptibility (MIC/AST) LabResult (Specimen)-[:TESTED_FOR]->(LabResult)-[:AGAINST]->(Antibiotic)
10 Sample (pipeline) Sample
11 Sample → Specimen link (Sample)-[:DERIVED_FROM]->(Specimen)
12 Assembly (pipeline) Assembly (Sample)-[:HAS_ASSEMBLY]->(Assembly)
13 BioDataFile FASTA BioDataFile (Assembly)-[:PRODUCE]->(BioDataFile)
14 BioDataFile FASTQ BioDataFile (Assembly)-[:ASSEMBLED_FROM]->(BioDataFile)
15 Contigs (pipeline) Contig (BioDataFile)-[:HAS_CONTIG]->(Contig)
16 Variants & Features (Snippy) Variant, Feature, VariantCallingRun (VariantCallingRun)-[:CALLED]->(Variant)-[:AFFECTS]->(Feature); (ReferenceGenome)-[:HAS_FEATURE]->(Feature)
17 Taxonomic Classification (metagenomics) TaxonomicClassification (Sample)-[:CLASSIFIED_IN]->(:ProcessRun:TaxonomicClassification) (taxa in tc.taxa_json)
18 BioDataFile FASTQ (metagenomics) BioDataFile (TaxonomicClassification)-[:CLASSIFIED_FROM]->(BioDataFile)

Step 11 is what stitches the genomic data onto the clinical backbone: it connects the pipeline's Sample to the clinical Specimen (so Sample.sample_id must match a specimen_id mapping in your Samples.csv). In the bundled example, BAC_S001SP003.

Metagenomics (steps 17–18) apply only to samples run through the metagenomics pipeline. They load the per-sample kg/ CSVs (taxonomic_classification.csv, meta_reads.csv) — replace <sample_id> in the file:/// paths as usual. The :TaxonomicClassification ProcessRun subtype is a public NosoGraph extension (not in the core lib schema), built on the generic ProcessRun pattern and keyed on process_run_id. The identified taxa are not modelled as Organism nodes — Kraken2 output is an untrusted, per-run classification (produced before the curated DB is built) and isn't a graph query target, so it would pollute the curated Organism vocabulary. Instead they are stored as a single JSON-string property taxa_json on the TaxonomicClassification node (sorted by abundance, with an adaptive z-score "Other" bucket); unpack it with apoc.convert.fromJsonList. Step 18's row is header-only when the run had no --long_reads.


7. Template queries

All of these are saved in the QUERIES folder of the imported templates file.

7.1 Node & relationship counts

MATCH (n)
RETURN labels(n)[0] AS label, count(n) AS count
ORDER BY count DESC;

7.2 The clinical–genomic spine (Patient → Specimen → Sample → Assembly → Contig)

MATCH (p:Patient)<-[:COLLECTED_FROM]-(sp:Specimen)<-[:DERIVED_FROM]-(s:Sample)
      -[:HAS_ASSEMBLY]->(a:Assembly)-[:PRODUCE]->(:BioDataFile)-[:HAS_CONTIG]->(c:Contig)
RETURN p.patient_id AS patient, sp.specimen_id AS specimen, s.sample_id AS sample,
       a.assembler AS assembler, a.completeness AS completeness, count(c) AS contigs
ORDER BY patient;

7.3 AMR susceptibility summary (S/I/R)

MATCH (sp:Specimen)-[:TESTED_FOR]->(lr:LabResult:BacterialCulture)-[:AGAINST]->(ab:Antibiotic)
RETURN sp.specimen_id AS specimen, ab.class AS drug_class, ab.name AS antibiotic,
       lr.value AS mic, lr.unit AS unit, lr.interpretation AS sir
ORDER BY specimen, drug_class, antibiotic;

7.4 Shared-contig clusters (clonality)

Samples that share an identical contig sequence (by hash) — a quick clonality / transmission signal:

MATCH (s:Sample)-[:HAS_ASSEMBLY]->(:Assembly)-[:PRODUCE]->(:BioDataFile)-[:HAS_CONTIG]->(c:Contig)
WITH c.sequence_hash AS sequence_hash, collect(DISTINCT s.sample_id) AS samples
WHERE size(samples) > 1
RETURN sequence_hash, samples, size(samples) AS n_samples
ORDER BY n_samples DESC;

7.5 Variants per gene

MATCH (:VariantCallingRun)-[:CALLED]->(v:Variant)-[:AFFECTS]->(f:Feature)
RETURN f.locus_tag AS gene, count(v) AS variants,
       collect(DISTINCT v.IMPACT) AS impacts
ORDER BY variants DESC;

7.6 Pathogens detected per sample (metagenomics)

A read-set QC glance: unpack each sample's taxa_json blob (species S + genus G, plus an "Other" bucket) into rows, ranked by abundance. This needs the APOC plugin (bundled with Neo4j Desktop):

MATCH (s:Sample)-[:CLASSIFIED_IN]->(tc:TaxonomicClassification)
WHERE tc.taxa_json IS NOT NULL AND tc.taxa_json <> ''
UNWIND apoc.convert.fromJsonList(tc.taxa_json) AS taxon
RETURN s.sample_id AS sample, taxon.sciname AS organism, taxon.taxid AS taxid,
       taxon.rank AS rank, taxon.read_count AS reads, taxon.abundance AS abundance
ORDER BY sample, abundance DESC;

7.7 Schema visualization

CALL db.schema.visualization();

Run this in the Graph view to see the live node/relationship structure of your loaded database.


8. Reset (UTILITIES)

To clear the database during development (removes all nodes and relationships):

MATCH (n)
CALL { WITH n DETACH DELETE n }
IN TRANSACTIONS OF 1000 ROWS;

9. Graph schema reference

flowchart TD
    Department
    Ward -->|IN_DEPARTMENT| Department
    Patient -->|HAS_ADMISSION| Admission
    Specimen -->|COLLECTED_FROM| Patient
    Specimen -->|TESTED_FOR| LabResult
    LabResult -->|AGAINST| Antibiotic
    Sample -->|DERIVED_FROM| Specimen
    Sample -->|HAS_ASSEMBLY| Assembly
    Assembly -->|PRODUCE| BioDataFile
    Assembly -->|ASSEMBLED_FROM| BioDataFile
    BioDataFile -->|HAS_CONTIG| Contig
    ReferenceGenome -->|REFERENCE_GENOME_OF| Organism
    ReferenceGenome -->|HAS_FEATURE| Feature
    VariantCallingRun -->|CALLED| Variant
    Variant -->|AFFECTS| Feature
    Variant -->|AGAINST| ReferenceGenome
    Sample -->|CLASSIFIED_IN| TaxonomicClassification
    TaxonomicClassification -->|CLASSIFIED_FROM| BioDataFile
Loading

TaxonomicClassification (a :ProcessRun subtype) and its CLASSIFIED_IN / CLASSIFIED_FROM edges are the metagenomics public extension — present only when the metagenomics pipeline has been loaded. The identified taxa are carried as a taxa_json property on the node (not as Organism nodes), so there is no IDENTIFIED edge.

Node properties reference

Label Key property Notable properties
Department department_id name, description
Ward ward_id name, ward_type, department_id
Patient patient_id firstname, lastname, sex, date_of_birth
Admission admission_id ward_id, date_of_admission, date_of_discharge, length_of_stay
Specimen specimen_id specimen_type, specimen_class, category, collection_date
Antibiotic antibiotic_id name, abbreviation, class
LabResult lab_id result_type, value, unit, interpretation (S/I/R), test_date
Organism taxid sciname, strain
ReferenceGenome accession_no name, molecular_type, strain, taxid
Sample sample_id
Assembly assembly_id assembler, completeness, contamination
BioDataFile uri file_type, compressed, sha256
Contig contig_id length, coverage, sequence_hash
Feature locus_tag feature_type, biotype
Variant (REF_ACC, hgvs_p) POS, REF, ALT, EFFECT, IMPACT
TaxonomicClassification process_run_id process, tool, classified_reads, unclassified_reads, taxa_json (metagenomics extension)

Disclaimer: This project is not affiliated with, endorsed by, or sponsored by Neo4j, Inc. "Neo4j" and related trademarks are the property of Neo4j, Inc.