-
Notifications
You must be signed in to change notification settings - Fork 0
Knowledge graph Tutorial
This page covers installing Neo4j Desktop, creating a NosoGraph database, loading both the hand-authored clinical CSVs and the per-sample kg/ CSVs produced by the pipeline, and running surveillance queries that span clinical, microbiology, and genomic data.
NosoGraph ships no programmatic loader — by design, import is a transparent, auditable Neo4j Browser workflow. The only loader artefact is the saved-queries file assets/nosograph_cypher_templates.csv.
Download the installer for your OS from the Neo4j Download Center.
| OS | Package |
|---|---|
| macOS |
.dmg — drag Neo4j Desktop to /Applications
|
| Windows |
.exe installer |
| Linux |
.AppImage or .deb / .rpm
|
Launch Neo4j Desktop and complete first-run activation (a free account or the activation key on the download page).
- In the sidebar choose Local instances → Create instance.
- Name it, e.g.
nosograph-db. - Set and securely store a password.
- Leave the Neo4j version at the default (5.x recommended).
- Click Create, then Start. The status turns green when ready.
- Open Query (Neo4j Browser) to run Cypher.
The default Bolt URI for a local instance is bolt://localhost:7687.
NosoGraph ships assets/nosograph_cypher_templates.csv — a saved-queries file with every constraint, load, and analytical query pre-written, grouped into four folders.
- In Neo4j Browser, open the Favorites / Saved Cypher sidebar (bookmark icon).
- Click Import and select
assets/nosograph_cypher_templates.csv. - Four folders appear under NosoGraph: SETUP, LOAD DATA, QUERIES, UTILITIES.
From the SETUP folder, run Create Constraints before loading any data. It enforces uniqueness and makes the MERGE-based loads fast and idempotent. It creates uniqueness/node-key constraints for: Ward, Patient, Admission, Specimen, Sample, LabResult, Antibiotic, Organism, ReferenceGenome, Assembly, BioDataFile, Contig, Feature, SequencingRun, VariantCallingRun, Variant, and TaxonomicClassification
(the last is the metagenomics public extension, keyed on process_run_id).
Verify:
SHOW CONSTRAINTS;LOAD CSV can only read files inside the instance's import directory. Find it on the instance's connection screen (e.g. …\.Neo4jDesktop2\Data\dbmss\dbms-<id>\import), or via ⋮ → Open folder → Import.
Copy two sources into it:
(a) The hand-authored clinical CSVs → import-directory root. Use the bundled example/csv/ files as templates (or your own, in the same shape):
<import>/
├── Departments.csv
├── Wards.csv
├── Patients.csv
├── Admissions.csv
├── Antibiotic.csv
├── Specimens.csv
├── Samples.csv
├── Organisms.csv
├── ReferenceGenomes.csv
├── LabResults.csv
└── SNPs.csv # variant calls (e.g. Snippy output) for step 16
(b) Each sequenced sample's kg/ directory → under a matching <sample_id>/ subfolder:
cp -r results/BAC_S001/kg <import>/BAC_S001/so it sits at <import>/BAC_S001/kg/{sample,assembly,biodata_files,contigs}.csv.
Field conventions (STTLab): booleans are lowercase
true/false; empty values are""(templates guard optional fields withIS NOT NULL AND <> ''); file identity uses SHA-256; contig IDs are namespaced{sample_id}:{contig_name}.
See example/csv/README.md for the full data dictionary — every column's type, whether it is required, and an example value.
Open the LOAD DATA queries in sequence. Steps 01–09 load the clinical/reference backbone from the root CSVs; steps 10–16 load the genomic data. For the per-sample steps (10–15), replace <sample_id> in the file:///<sample_id>/kg/... path with your actual sample id (e.g. BAC_S001). Every load uses MERGE and IN TRANSACTIONS OF 500 ROWS, so re-running is safe.
| Step | Query | Creates | Key relationship |
|---|---|---|---|
| 01 | Departments | Department |
— |
| 02 | Wards | Ward |
(Ward)-[:IN_DEPARTMENT]->(Department) |
| 03 | Patients | Patient |
— |
| 04 | Admissions | Admission |
(Patient)-[:HAS_ADMISSION]->(Admission) |
| 05 | Antibiotics | Antibiotic |
— |
| 06 | Specimens | Specimen |
(Specimen)-[:COLLECTED_FROM]->(Patient) |
| 07 | Organisms | Organism |
— |
| 08 | Reference Genomes | ReferenceGenome |
(ReferenceGenome)-[:REFERENCE_GENOME_OF]->(Organism) |
| 09 | Antibiotic Susceptibility (MIC/AST) | LabResult |
(Specimen)-[:TESTED_FOR]->(LabResult)-[:AGAINST]->(Antibiotic) |
| 10 | Sample (pipeline) | Sample |
— |
| 11 | Sample → Specimen link | — | (Sample)-[:DERIVED_FROM]->(Specimen) |
| 12 | Assembly (pipeline) | Assembly |
(Sample)-[:HAS_ASSEMBLY]->(Assembly) |
| 13 | BioDataFile FASTA | BioDataFile |
(Assembly)-[:PRODUCE]->(BioDataFile) |
| 14 | BioDataFile FASTQ | BioDataFile |
(Assembly)-[:ASSEMBLED_FROM]->(BioDataFile) |
| 15 | Contigs (pipeline) | Contig |
(BioDataFile)-[:HAS_CONTIG]->(Contig) |
| 16 | Variants & Features (Snippy) |
Variant, Feature, VariantCallingRun
|
(VariantCallingRun)-[:CALLED]->(Variant)-[:AFFECTS]->(Feature); (ReferenceGenome)-[:HAS_FEATURE]->(Feature)
|
| 17 | Taxonomic Classification (metagenomics) | TaxonomicClassification |
(Sample)-[:CLASSIFIED_IN]->(:ProcessRun:TaxonomicClassification) (taxa in tc.taxa_json) |
| 18 | BioDataFile FASTQ (metagenomics) | BioDataFile |
(TaxonomicClassification)-[:CLASSIFIED_FROM]->(BioDataFile) |
Step 11 is what stitches the genomic data onto the clinical backbone: it connects the pipeline's
Sampleto the clinicalSpecimen(soSample.sample_idmust match aspecimen_idmapping in yourSamples.csv). In the bundled example,BAC_S001→SP003.
Metagenomics (steps 17–18) apply only to samples run through the
metagenomicspipeline. They load the per-samplekg/CSVs (taxonomic_classification.csv,meta_reads.csv) — replace<sample_id>in thefile:///paths as usual. The:TaxonomicClassificationProcessRun subtype is a public NosoGraph extension (not in the core lib schema), built on the genericProcessRunpattern and keyed onprocess_run_id. The identified taxa are not modelled asOrganismnodes — Kraken2 output is an untrusted, per-run classification (produced before the curated DB is built) and isn't a graph query target, so it would pollute the curatedOrganismvocabulary. Instead they are stored as a single JSON-string propertytaxa_jsonon theTaxonomicClassificationnode (sorted by abundance, with an adaptive z-score"Other"bucket); unpack it withapoc.convert.fromJsonList. Step 18's row is header-only when the run had no--long_reads.
All of these are saved in the QUERIES folder of the imported templates file.
MATCH (n)
RETURN labels(n)[0] AS label, count(n) AS count
ORDER BY count DESC;MATCH (p:Patient)<-[:COLLECTED_FROM]-(sp:Specimen)<-[:DERIVED_FROM]-(s:Sample)
-[:HAS_ASSEMBLY]->(a:Assembly)-[:PRODUCE]->(:BioDataFile)-[:HAS_CONTIG]->(c:Contig)
RETURN p.patient_id AS patient, sp.specimen_id AS specimen, s.sample_id AS sample,
a.assembler AS assembler, a.completeness AS completeness, count(c) AS contigs
ORDER BY patient;MATCH (sp:Specimen)-[:TESTED_FOR]->(lr:LabResult:BacterialCulture)-[:AGAINST]->(ab:Antibiotic)
RETURN sp.specimen_id AS specimen, ab.class AS drug_class, ab.name AS antibiotic,
lr.value AS mic, lr.unit AS unit, lr.interpretation AS sir
ORDER BY specimen, drug_class, antibiotic;Samples that share an identical contig sequence (by hash) — a quick clonality / transmission signal:
MATCH (s:Sample)-[:HAS_ASSEMBLY]->(:Assembly)-[:PRODUCE]->(:BioDataFile)-[:HAS_CONTIG]->(c:Contig)
WITH c.sequence_hash AS sequence_hash, collect(DISTINCT s.sample_id) AS samples
WHERE size(samples) > 1
RETURN sequence_hash, samples, size(samples) AS n_samples
ORDER BY n_samples DESC;MATCH (:VariantCallingRun)-[:CALLED]->(v:Variant)-[:AFFECTS]->(f:Feature)
RETURN f.locus_tag AS gene, count(v) AS variants,
collect(DISTINCT v.IMPACT) AS impacts
ORDER BY variants DESC;A read-set QC glance: unpack each sample's taxa_json blob (species S + genus G, plus an
"Other" bucket) into rows, ranked by abundance. This needs the APOC
plugin (bundled with Neo4j Desktop):
MATCH (s:Sample)-[:CLASSIFIED_IN]->(tc:TaxonomicClassification)
WHERE tc.taxa_json IS NOT NULL AND tc.taxa_json <> ''
UNWIND apoc.convert.fromJsonList(tc.taxa_json) AS taxon
RETURN s.sample_id AS sample, taxon.sciname AS organism, taxon.taxid AS taxid,
taxon.rank AS rank, taxon.read_count AS reads, taxon.abundance AS abundance
ORDER BY sample, abundance DESC;CALL db.schema.visualization();Run this in the Graph view to see the live node/relationship structure of your loaded database.
To clear the database during development (removes all nodes and relationships):
MATCH (n)
CALL { WITH n DETACH DELETE n }
IN TRANSACTIONS OF 1000 ROWS;flowchart TD
Department
Ward -->|IN_DEPARTMENT| Department
Patient -->|HAS_ADMISSION| Admission
Specimen -->|COLLECTED_FROM| Patient
Specimen -->|TESTED_FOR| LabResult
LabResult -->|AGAINST| Antibiotic
Sample -->|DERIVED_FROM| Specimen
Sample -->|HAS_ASSEMBLY| Assembly
Assembly -->|PRODUCE| BioDataFile
Assembly -->|ASSEMBLED_FROM| BioDataFile
BioDataFile -->|HAS_CONTIG| Contig
ReferenceGenome -->|REFERENCE_GENOME_OF| Organism
ReferenceGenome -->|HAS_FEATURE| Feature
VariantCallingRun -->|CALLED| Variant
Variant -->|AFFECTS| Feature
Variant -->|AGAINST| ReferenceGenome
Sample -->|CLASSIFIED_IN| TaxonomicClassification
TaxonomicClassification -->|CLASSIFIED_FROM| BioDataFile
TaxonomicClassification(a:ProcessRunsubtype) and itsCLASSIFIED_IN/CLASSIFIED_FROMedges are the metagenomics public extension — present only when themetagenomicspipeline has been loaded. The identified taxa are carried as ataxa_jsonproperty on the node (not asOrganismnodes), so there is noIDENTIFIEDedge.
| Label | Key property | Notable properties |
|---|---|---|
Department |
department_id |
name, description
|
Ward |
ward_id |
name, ward_type, department_id
|
Patient |
patient_id |
firstname, lastname, sex, date_of_birth
|
Admission |
admission_id |
ward_id, date_of_admission, date_of_discharge, length_of_stay
|
Specimen |
specimen_id |
specimen_type, specimen_class, category, collection_date
|
Antibiotic |
antibiotic_id |
name, abbreviation, class
|
LabResult |
lab_id |
result_type, value, unit, interpretation (S/I/R), test_date
|
Organism |
taxid |
sciname, strain
|
ReferenceGenome |
accession_no |
name, molecular_type, strain, taxid
|
Sample |
sample_id |
— |
Assembly |
assembly_id |
assembler, completeness, contamination
|
BioDataFile |
uri |
file_type, compressed, sha256
|
Contig |
contig_id |
length, coverage, sequence_hash
|
Feature |
locus_tag |
feature_type, biotype
|
Variant |
(REF_ACC, hgvs_p) |
POS, REF, ALT, EFFECT, IMPACT
|
TaxonomicClassification |
process_run_id |
process, tool, classified_reads, unclassified_reads, taxa_json (metagenomics extension) |
Disclaimer: This project is not affiliated with, endorsed by, or sponsored by Neo4j, Inc. "Neo4j" and related trademarks are the property of Neo4j, Inc.