Knowledge graph Tutorial

This page covers installing Neo4j Desktop, creating a NosoGraph database, loading both the hand-authored clinical CSVs and the per-sample kg/ CSVs produced by the pipeline, and running surveillance queries that span clinical, microbiology, and genomic data.

NosoGraph ships no programmatic loader — by design, import is a transparent, auditable Neo4j Browser workflow. The only loader artefact is the saved-queries file assets/nosograph_cypher_templates.csv.

1. Install Neo4j Desktop

Download the installer for your OS from the Neo4j Download Center.

OS	Package
macOS	`.dmg` — drag Neo4j Desktop to `/Applications`
Windows	`.exe` installer
Linux	`.AppImage` or `.deb` / `.rpm`

Launch Neo4j Desktop and complete first-run activation (a free account or the activation key on the download page).

2. Create a database

In the sidebar choose Local instances → Create instance.
Name it, e.g. nosograph-db.
Set and securely store a password.
Leave the Neo4j version at the default (5.x recommended).
Click Create, then Start. The status turns green when ready.
Open Query (Neo4j Browser) to run Cypher.

The default Bolt URI for a local instance is bolt://localhost:7687.

3. Import the Cypher templates

NosoGraph ships assets/nosograph_cypher_templates.csv — a saved-queries file with every constraint, load, and analytical query pre-written, grouped into four folders.

In Neo4j Browser, open the Favorites / Saved Cypher sidebar (bookmark icon).
Click Import and select assets/nosograph_cypher_templates.csv.
Four folders appear under NosoGraph: SETUP, LOAD DATA, QUERIES, UTILITIES.

4. Create constraints (run once)

From the SETUP folder, run Create Constraints before loading any data. It enforces uniqueness and makes the MERGE-based loads fast and idempotent. It creates uniqueness/node-key constraints for: Ward, Patient, Admission, Specimen, Sample, LabResult, Antibiotic, Organism, ReferenceGenome, Assembly, BioDataFile, Contig, Feature, SequencingRun, VariantCallingRun, Variant, and TaxonomicClassification (the last is the metagenomics public extension, keyed on process_run_id).

Verify:

SHOW CONSTRAINTS;

5. Stage your data in the import directory

LOAD CSV can only read files inside the instance's import directory. Find it on the instance's connection screen (e.g. …\.Neo4jDesktop2\Data\dbmss\dbms-<id>\import), or via ⋮ → Open folder → Import.

Copy two sources into it:

(a) The hand-authored clinical CSVs → import-directory root. Use the bundled example/csv/ files as templates (or your own, in the same shape):

<import>/
├── Departments.csv
├── Wards.csv
├── Patients.csv
├── Admissions.csv
├── Antibiotic.csv
├── Specimens.csv
├── Samples.csv
├── Organisms.csv
├── ReferenceGenomes.csv
├── LabResults.csv
└── SNPs.csv          # variant calls (e.g. Snippy output) for step 16

(b) Each sequenced sample's kg/ directory → under a matching <sample_id>/ subfolder:

cp -r results/BAC_S001/kg  <import>/BAC_S001/

so it sits at <import>/BAC_S001/kg/{sample,assembly,biodata_files,contigs}.csv.

Field conventions (STTLab): booleans are lowercase true/false; empty values are "" (templates guard optional fields with IS NOT NULL AND <> ''); file identity uses SHA-256; contig IDs are namespaced {sample_id}:{contig_name}.

See example/csv/README.md for the full data dictionary — every column's type, whether it is required, and an example value.

6. Load data (run in order)

Open the LOAD DATA queries in sequence. Steps 01–09 load the clinical/reference backbone from the root CSVs; steps 10–16 load the genomic data. For the per-sample steps (10–15), replace <sample_id> in the file:///<sample_id>/kg/... path with your actual sample id (e.g. BAC_S001). Every load uses MERGE and IN TRANSACTIONS OF 500 ROWS, so re-running is safe.

Step	Query	Creates	Key relationship
01	Departments	`Department`	—
02	Wards	`Ward`	`(Ward)-[:IN_DEPARTMENT]->(Department)`
03	Patients	`Patient`	—
04	Admissions	`Admission`	`(Patient)-[:HAS_ADMISSION]->(Admission)`
05	Antibiotics	`Antibiotic`	—
06	Specimens	`Specimen`	`(Specimen)-[:COLLECTED_FROM]->(Patient)`
07	Organisms	`Organism`	—
08	Reference Genomes	`ReferenceGenome`	`(ReferenceGenome)-[:REFERENCE_GENOME_OF]->(Organism)`
09	Antibiotic Susceptibility (MIC/AST)	`LabResult`	`(Specimen)-[:TESTED_FOR]->(LabResult)-[:AGAINST]->(Antibiotic)`
10	Sample (pipeline)	`Sample`	—
11	Sample → Specimen link	—	`(Sample)-[:DERIVED_FROM]->(Specimen)`
12	Assembly (pipeline)	`Assembly`	`(Sample)-[:HAS_ASSEMBLY]->(Assembly)`
13	BioDataFile FASTA	`BioDataFile`	`(Assembly)-[:PRODUCE]->(BioDataFile)`
14	BioDataFile FASTQ	`BioDataFile`	`(Assembly)-[:ASSEMBLED_FROM]->(BioDataFile)`
15	Contigs (pipeline)	`Contig`	`(BioDataFile)-[:HAS_CONTIG]->(Contig)`
16	Variants & Features (Snippy)	`Variant`, `Feature`, `VariantCallingRun`	`(VariantCallingRun)-[:CALLED]->(Variant)-[:AFFECTS]->(Feature)`; `(ReferenceGenome)-[:HAS_FEATURE]->(Feature)`
17	Taxonomic Classification (metagenomics)	`TaxonomicClassification`	`(Sample)-[:CLASSIFIED_IN]->(:ProcessRun:TaxonomicClassification)` (taxa in `tc.taxa_json`)
18	BioDataFile FASTQ (metagenomics)	`BioDataFile`	`(TaxonomicClassification)-[:CLASSIFIED_FROM]->(BioDataFile)`

Step 11 is what stitches the genomic data onto the clinical backbone: it connects the pipeline's Sample to the clinical Specimen (so Sample.sample_id must match a specimen_id mapping in your Samples.csv). In the bundled example, BAC_S001 → SP003.

Metagenomics (steps 17–18) apply only to samples run through the metagenomics pipeline. They load the per-sample kg/ CSVs (taxonomic_classification.csv, meta_reads.csv) — replace <sample_id> in the file:/// paths as usual. The :TaxonomicClassification ProcessRun subtype is a public NosoGraph extension (not in the core lib schema), built on the generic ProcessRun pattern and keyed on process_run_id. The identified taxa are not modelled as Organism nodes — Kraken2 output is an untrusted, per-run classification (produced before the curated DB is built) and isn't a graph query target, so it would pollute the curated Organism vocabulary. Instead they are stored as a single JSON-string property taxa_json on the TaxonomicClassification node (sorted by abundance, with an adaptive z-score "Other" bucket); unpack it with apoc.convert.fromJsonList. Step 18's row is header-only when the run had no --long_reads.

7. Template queries

All of these are saved in the QUERIES folder of the imported templates file.

7.1 Node & relationship counts

MATCH (n)
RETURN labels(n)[0] AS label, count(n) AS count
ORDER BY count DESC;

7.2 The clinical–genomic spine (Patient → Specimen → Sample → Assembly → Contig)

MATCH (p:Patient)<-[:COLLECTED_FROM]-(sp:Specimen)<-[:DERIVED_FROM]-(s:Sample)
      -[:HAS_ASSEMBLY]->(a:Assembly)-[:PRODUCE]->(:BioDataFile)-[:HAS_CONTIG]->(c:Contig)
RETURN p.patient_id AS patient, sp.specimen_id AS specimen, s.sample_id AS sample,
       a.assembler AS assembler, a.completeness AS completeness, count(c) AS contigs
ORDER BY patient;

7.3 AMR susceptibility summary (S/I/R)

MATCH (sp:Specimen)-[:TESTED_FOR]->(lr:LabResult:BacterialCulture)-[:AGAINST]->(ab:Antibiotic)
RETURN sp.specimen_id AS specimen, ab.class AS drug_class, ab.name AS antibiotic,
       lr.value AS mic, lr.unit AS unit, lr.interpretation AS sir
ORDER BY specimen, drug_class, antibiotic;

7.4 Shared-contig clusters (clonality)

Samples that share an identical contig sequence (by hash) — a quick clonality / transmission signal:

MATCH (s:Sample)-[:HAS_ASSEMBLY]->(:Assembly)-[:PRODUCE]->(:BioDataFile)-[:HAS_CONTIG]->(c:Contig)
WITH c.sequence_hash AS sequence_hash, collect(DISTINCT s.sample_id) AS samples
WHERE size(samples) > 1
RETURN sequence_hash, samples, size(samples) AS n_samples
ORDER BY n_samples DESC;

7.5 Variants per gene

MATCH (:VariantCallingRun)-[:CALLED]->(v:Variant)-[:AFFECTS]->(f:Feature)
RETURN f.locus_tag AS gene, count(v) AS variants,
       collect(DISTINCT v.IMPACT) AS impacts
ORDER BY variants DESC;

7.6 Pathogens detected per sample (metagenomics)

A read-set QC glance: unpack each sample's taxa_json blob (species S + genus G, plus an "Other" bucket) into rows, ranked by abundance. This needs the APOC plugin (bundled with Neo4j Desktop):

MATCH (s:Sample)-[:CLASSIFIED_IN]->(tc:TaxonomicClassification)
WHERE tc.taxa_json IS NOT NULL AND tc.taxa_json <> ''
UNWIND apoc.convert.fromJsonList(tc.taxa_json) AS taxon
RETURN s.sample_id AS sample, taxon.sciname AS organism, taxon.taxid AS taxid,
       taxon.rank AS rank, taxon.read_count AS reads, taxon.abundance AS abundance
ORDER BY sample, abundance DESC;

7.7 Schema visualization

CALL db.schema.visualization();

Run this in the Graph view to see the live node/relationship structure of your loaded database.

8. Reset (UTILITIES)

To clear the database during development (removes all nodes and relationships):

MATCH (n)
CALL { WITH n DETACH DELETE n }
IN TRANSACTIONS OF 1000 ROWS;

9. Graph schema reference

flowchart TD
    Department
    Ward -->|IN_DEPARTMENT| Department
    Patient -->|HAS_ADMISSION| Admission
    Specimen -->|COLLECTED_FROM| Patient
    Specimen -->|TESTED_FOR| LabResult
    LabResult -->|AGAINST| Antibiotic
    Sample -->|DERIVED_FROM| Specimen
    Sample -->|HAS_ASSEMBLY| Assembly
    Assembly -->|PRODUCE| BioDataFile
    Assembly -->|ASSEMBLED_FROM| BioDataFile
    BioDataFile -->|HAS_CONTIG| Contig
    ReferenceGenome -->|REFERENCE_GENOME_OF| Organism
    ReferenceGenome -->|HAS_FEATURE| Feature
    VariantCallingRun -->|CALLED| Variant
    Variant -->|AFFECTS| Feature
    Variant -->|AGAINST| ReferenceGenome
    Sample -->|CLASSIFIED_IN| TaxonomicClassification
    TaxonomicClassification -->|CLASSIFIED_FROM| BioDataFile

TaxonomicClassification (a :ProcessRun subtype) and its CLASSIFIED_IN / CLASSIFIED_FROM edges are the metagenomics public extension — present only when the metagenomics pipeline has been loaded. The identified taxa are carried as a taxa_json property on the node (not as Organism nodes), so there is no IDENTIFIED edge.

Node properties reference

Label	Key property	Notable properties
`Department`	`department_id`	`name`, `description`
`Ward`	`ward_id`	`name`, `ward_type`, `department_id`
`Patient`	`patient_id`	`firstname`, `lastname`, `sex`, `date_of_birth`
`Admission`	`admission_id`	`ward_id`, `date_of_admission`, `date_of_discharge`, `length_of_stay`
`Specimen`	`specimen_id`	`specimen_type`, `specimen_class`, `category`, `collection_date`
`Antibiotic`	`antibiotic_id`	`name`, `abbreviation`, `class`
`LabResult`	`lab_id`	`result_type`, `value`, `unit`, `interpretation` (S/I/R), `test_date`
`Organism`	`taxid`	`sciname`, `strain`
`ReferenceGenome`	`accession_no`	`name`, `molecular_type`, `strain`, `taxid`
`Sample`	`sample_id`	—
`Assembly`	`assembly_id`	`assembler`, `completeness`, `contamination`
`BioDataFile`	`uri`	`file_type`, `compressed`, `sha256`
`Contig`	`contig_id`	`length`, `coverage`, `sequence_hash`
`Feature`	`locus_tag`	`feature_type`, `biotype`
`Variant`	`(REF_ACC, hgvs_p)`	`POS`, `REF`, `ALT`, `EFFECT`, `IMPACT`
`TaxonomicClassification`	`process_run_id`	`process`, `tool`, `classified_reads`, `unclassified_reads`, `taxa_json` (metagenomics extension)

Disclaimer: This project is not affiliated with, endorsed by, or sponsored by Neo4j, Inc. "Neo4j" and related trademarks are the property of Neo4j, Inc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Knowledge graph Tutorial

1. Install Neo4j Desktop

2. Create a database

3. Import the Cypher templates

4. Create constraints (run once)

5. Stage your data in the import directory

6. Load data (run in order)

7. Template queries

7.1 Node & relationship counts

7.2 The clinical–genomic spine (Patient → Specimen → Sample → Assembly → Contig)

7.3 AMR susceptibility summary (S/I/R)

7.4 Shared-contig clusters (clonality)

7.5 Variants per gene

7.6 Pathogens detected per sample (metagenomics)

7.7 Schema visualization

8. Reset (UTILITIES)

9. Graph schema reference

Node properties reference

Uh oh!

Uh oh!

Clone this wiki locally