-
Notifications
You must be signed in to change notification settings - Fork 0
Knowledge graph Tutorial
This page covers installing Neo4j Desktop, creating a ViroWatch database, loading the kg/ CSVs produced by the pipeline, and running template surveillance queries.
Download the installer for your OS from the Neo4j Download Center.
| OS | Package |
|---|---|
| macOS |
.dmg — drag Neo4j Desktop to /Applications
|
| Windows |
.exe installer |
| Linux |
.AppImage or .deb / .rpm — see the download page |
Launch Neo4j Desktop and complete the first-run activation (requires a free account or activation key displayed on the download page).
Neo4j Desktop groups database files into a DBMS (Database Management System). One DBMS can hold multiple named databases.
- In the left sidebar click + New → Create local DBMS.
- Set a name, e.g.
ViroWatch. - Set a password — note it down; you will need it for any Bolt connections.
- Leave the Neo4j version at the default (5.x recommended).
- Click Create.
Click the Start button next to your new DBMS. The status indicator turns green when ready.
Click Open (or the browser icon) to open the Neo4j Browser query interface. You will use this for constraints, indexes, and ad-hoc queries.
The default Bolt URI for a local Neo4j Desktop DBMS is:
bolt://localhost:7687
This is what any application driver uses to connect.
Run these Cypher statements in Neo4j Browser before loading data. They enforce uniqueness and dramatically speed up the MERGE operations used during import.
Copy and paste each block into the Browser query box, then press Ctrl+Enter (or click the play button).
// Unique node keys
CREATE CONSTRAINT sample_id IF NOT EXISTS
FOR (n:Sample) REQUIRE n.sample_id IS UNIQUE;
CREATE CONSTRAINT assembly_id IF NOT EXISTS
FOR (n:Assembly) REQUIRE n.assembly_id IS UNIQUE;
CREATE CONSTRAINT contig_id IF NOT EXISTS
FOR (n:Contig) REQUIRE n.contig_id IS UNIQUE;
CREATE CONSTRAINT biodata_uri IF NOT EXISTS
FOR (n:BioDataFile) REQUIRE n.uri IS UNIQUE;
CREATE CONSTRAINT stanford_alignment_sha IF NOT EXISTS
FOR (n:StanfordHIVDRAlignment) REQUIRE n.result_sha256 IS UNIQUE;
CREATE CONSTRAINT prediction_id IF NOT EXISTS
FOR (n:StanfordHIVDRPrediction) REQUIRE n.prediction_id IS UNIQUE;
CREATE CONSTRAINT drug_name IF NOT EXISTS
FOR (n:Drug) REQUIRE n.name IS UNIQUE;
CREATE CONSTRAINT drug_class_name IF NOT EXISTS
FOR (n:DrugClass) REQUIRE n.name IS UNIQUE;
CREATE CONSTRAINT protein_abbr IF NOT EXISTS
FOR (n:Protein) REQUIRE n.abbreviation IS UNIQUE;
CREATE CONSTRAINT ref_genome_accession IF NOT EXISTS
FOR (n:ReferenceGenome) REQUIRE n.accession_no IS UNIQUE;
CREATE CONSTRAINT organism_taxid IF NOT EXISTS
FOR (n:Organism) REQUIRE n.taxid IS UNIQUE;
CREATE CONSTRAINT tcr_id IF NOT EXISTS
FOR (n:TaxonomicClassification) REQUIRE n.process_run_id IS UNIQUE;// Composite uniqueness for Mutation (gene + text together identify a variant)
CREATE CONSTRAINT mutation_gene_text IF NOT EXISTS
FOR (n:Mutation) REQUIRE (n.gene, n.text) IS UNIQUE;Verify all constraints were created:
SHOW CONSTRAINTS;Neo4j Browser's LOAD CSV can only read files placed in the DBMS's designated import directory. Locate it by clicking ⋮ (three-dot menu) → Open folder → Import on your DBMS tile in Neo4j Desktop. The default paths are:
| OS | Import directory |
|---|---|
| macOS | ~/Library/Application Support/Neo4j Desktop/Application/relate-data/dbmss/<dbms-id>/import/ |
| Windows | %LOCALAPPDATA%\Neo4j\relate-data\dbmss\<dbms-id>\import\ |
| Linux | ~/.config/Neo4j Desktop/Application/relate-data/dbmss/<dbms-id>/import/ |
For each sample, copy the kg/ folder into the import directory preserving the sample ID as a subdirectory:
# Replace <import_dir> and <sample_id> with your values
cp -r results/<sample_id>/kg <import_dir>/<sample_id>/After copying, the import directory should look like:
<import_dir>/
└── test_01/
├── sample.csv
├── assembly.csv
├── biodata_files.csv
├── contigs.csv
├── stanford_alignments.csv
├── stanford_predictions.csv
├── mutations.csv
├── blast_hits.csv # only if BLAST was enabled
└── taxonomic_classification.csv # only if --kraken2_db was enabled
ViroWatch ships with assets/virowatch_cypher_templates.csv — a Neo4j-compatible saved-queries file containing all the load and surveillance queries pre-written.
To import it into Neo4j Desktop:
- Open Neo4j Browser for your running DBMS.
- Click the bookmark icon (Saved Cypher) in the left sidebar.
- Click Import and select
assets/virowatch_cypher_templates.csv.
You will see four folders appear: SETUP, LOAD DATA, QUERIES, and UTILITIES.
Open each query from the LOAD DATA folder in sequence. Before running, replace <sample_id> in the FROM path with your actual sample ID (e.g. test_01):
| Step | Query | Nodes created |
|---|---|---|
| 1 | 01 Sample |
Sample |
| 2 | 02 Assembly |
Assembly, (Sample)-[:HAS_ASSEMBLY]→
|
| 3 | 03 BioDataFile FASTA |
BioDataFile (consensus FASTA), PRODUCE edge |
| 4 | 04 BioDataFile FASTQ |
BioDataFile (input reads), ASSEMBLED_FROM edge |
| 5 | 05 Contigs |
Contig, (BioDataFile)-[:HAS_CONTIG]→
|
| 6 | 06 Stanford Alignments |
StanfordHIVDRAlignment, Protein, contig edge |
| 7 | 07 Stanford Predictions |
StanfordHIVDRPrediction, Drug, DrugClass
|
| 8 | 08 Mutations |
Mutation, (Alignment)-[:FOUND]→
|
| 9 | 09 BLAST Hits |
ReferenceGenome, Organism, contig edge |
| 10 | 10 Taxonomic Classification |
ProcessRun:TaxonomicClassification, (Sample)-[:CLASSIFIED_IN]→, (…)-[:CLASSIFIED_FROM]→(BioDataFile) — only if --kraken2_db was enabled |
All queries use MERGE, so re-running on the same sample is safe.
Kraken2 QC subgraph.
10 Taxonomic Classificationis a read-set QC glance, not a searchable taxonomy graph: the identified taxa are not materialised asOrganismnodes but ride along as a JSON-string propertytaxa_json(already sorted by abundance, with trace taxa folded into an "Other" bucket) on theTaxonomicClassificationnode. ItsCLASSIFIED_FROMedge reuses the input FASTQBioDataFilefrom step 4, so run step 4 first if you want that link (it is skipped silently otherwise).
All queries below are also available as saved queries in assets/virowatch_cypher_templates.csv under the QUERIES folder.
MATCH (n)
RETURN labels(n)[0] AS label, count(n) AS count
ORDER BY count DESC;MATCH ()-[r]->()
RETURN type(r) AS relationship, count(r) AS count
ORDER BY count DESC;All resistance predictions at level ≥ 3 (low-level resistance and above):
MATCH (s:Sample)-[:HAS_STANFORD_HIVDR_PREDICTION]->(pred:StanfordHIVDRPrediction)
-[:PREDICTS_RESISTANCE_TO]->(d:Drug)-[:IN_DRUG_CLASS]->(dc:DrugClass)
WHERE toInteger(pred.level) >= 3
RETURN s.sample_id AS sample,
dc.name AS drug_class,
d.name AS drug,
pred.level AS level,
pred.interpretation AS resistance
ORDER BY sample, drug_class, drug;Level 5 = high-level resistance in the Stanford HIVDB scoring system.
MATCH (s:Sample)-[:HAS_STANFORD_HIVDR_PREDICTION]->(pred:StanfordHIVDRPrediction)
-[:PREDICTS_RESISTANCE_TO]->(d:Drug)-[:IN_DRUG_CLASS]->(dc:DrugClass)
WHERE toInteger(pred.level) >= 5
RETURN s.sample_id, dc.name AS drug_class, d.name AS drug, pred.interpretation
ORDER BY s.sample_id, dc.name, d.name;MATCH (s:Sample)-[:HAS_ASSEMBLY]->(:Assembly)-[:PRODUCE]->(:BioDataFile)
-[:HAS_CONTIG]->(c:Contig)-[:HAS_BLAST_HIT]->(rg:ReferenceGenome)
-[:REFERENCE_GENOME_OF]->(o:Organism)
RETURN s.sample_id AS sample,
c.contig_id AS contig,
o.sciname AS subtype,
rg.accession_no AS accession
ORDER BY sample, contig;MATCH (s:Sample)-[:HAS_ASSEMBLY]->(:Assembly)-[:PRODUCE]->(:BioDataFile)
-[:HAS_CONTIG]->(c:Contig)
-[:HAS_STANFORD_HIVDR_ALIGNMENT]->(al:StanfordHIVDRAlignment)
-[:FOUND]->(m:Mutation)
RETURN s.sample_id AS sample,
m.gene AS gene,
collect(m.text) AS mutations,
count(m) AS mutation_count
ORDER BY sample, gene;MATCH (s:Sample)-[:HAS_ASSEMBLY]->(:Assembly)-[:PRODUCE]->(:BioDataFile)
-[:HAS_CONTIG]->(:Contig)
-[:HAS_STANFORD_HIVDR_ALIGNMENT]->(al:StanfordHIVDRAlignment)
-[:FOUND]->(m:Mutation)
WHERE m.is_sdrm = 'true'
RETURN s.sample_id AS sample,
m.gene AS gene,
m.text AS mutation,
m.primary_type AS type
ORDER BY sample, gene, mutation;Samples with high-level resistance (≥ 5) in two or more drug classes — potential treatment-failure flag:
MATCH (s:Sample)-[:HAS_STANFORD_HIVDR_PREDICTION]->(pred:StanfordHIVDRPrediction)
-[:PREDICTS_RESISTANCE_TO]->(:Drug)-[:IN_DRUG_CLASS]->(dc:DrugClass)
WHERE toInteger(pred.level) >= 5
WITH s, collect(DISTINCT dc.name) AS resistant_classes
WHERE size(resistant_classes) >= 2
RETURN s.sample_id AS sample, resistant_classes
ORDER BY sample;Retrieve the complete graph path from a single sample through to its BLAST subtype:
MATCH path =
(s:Sample {sample_id: 'test_01'})
-[:HAS_ASSEMBLY]->(a:Assembly)
-[:PRODUCE]->(f:BioDataFile)
-[:HAS_CONTIG]->(c:Contig)
-[:HAS_BLAST_HIT]->(rg:ReferenceGenome)
-[:REFERENCE_GENOME_OF]->(o:Organism)
RETURN path;Paste this into Neo4j Browser's Graph view (not table view) to see the visual traversal.
Per-sample Kraken2 read-set QC: classified/unclassified read counts plus the taxa summary. taxa_json is already sorted by abundance with trace taxa folded into an "Other" bucket — eyeball it directly, or parse it with apoc.convert.fromJson if APOC is installed.
MATCH (s:Sample)-[:CLASSIFIED_IN]->(tc:TaxonomicClassification)
RETURN s.sample_id AS sample,
tc.tool AS tool,
tc.classified_reads AS classified,
tc.unclassified_reads AS unclassified,
tc.taxa_json AS taxa
ORDER BY sample;To expand the JSON into one row per taxon (requires APOC):
MATCH (s:Sample)-[:CLASSIFIED_IN]->(tc:TaxonomicClassification)
UNWIND apoc.convert.fromJson(tc.taxa_json) AS taxon
RETURN s.sample_id AS sample,
taxon.sciname AS organism,
taxon.rank AS rank,
taxon.read_count AS reads,
taxon.abundance AS abundance
ORDER BY sample, abundance DESC;Useful during development. This removes all nodes and relationships in the current database.
// Run in batches to avoid heap issues on large graphs
CALL apoc.periodic.iterate(
"MATCH (n) RETURN n",
"DETACH DELETE n",
{batchSize: 1000}
)
YIELD batches, total
RETURN batches, total;If APOC is not installed, use this (slower but works on any Neo4j instance):
MATCH (n)
CALL { WITH n DETACH DELETE n }
IN TRANSACTIONS OF 1000 ROWS;flowchart TD
Sample -->|HAS_ASSEMBLY| Assembly
Assembly -->|PRODUCE| BioDataFile
Assembly -->|ASSEMBLED_FROM| BioDataFile
BioDataFile -->|HAS_CONTIG| Contig
Contig -->|HAS_STANFORD_HIVDR_ALIGNMENT| StanfordHIVDRAlignment
StanfordHIVDRAlignment -->|ALIGNED_TO| Protein
StanfordHIVDRAlignment -->|FOUND| Mutation
Sample -->|HAS_STANFORD_HIVDR_PREDICTION| StanfordHIVDRPrediction
Contig -->|HAS_STANFORD_HIVDR_PREDICTION| StanfordHIVDRPrediction
StanfordHIVDRPrediction -->|PREDICTS_RESISTANCE_TO| Drug
Drug -->|IN_DRUG_CLASS| DrugClass
Contig -->|HAS_BLAST_HIT| ReferenceGenome
ReferenceGenome -->|REFERENCE_GENOME_OF| Organism
| Label | Key property | Notable properties |
|---|---|---|
Sample |
sample_id |
created_at |
Assembly |
assembly_id |
assembler, created_at
|
BioDataFile |
uri |
file_type, compressed, sha256
|
Contig |
contig_id |
length, coverage, is_circular, sequence_hash
|
StanfordHIVDRAlignment |
result_sha256 |
database_version, timestamp
|
StanfordHIVDRPrediction |
prediction_id |
score, level, interpretation
|
Mutation |
(gene, text) |
is_sdrm, primary_type, is_unusual
|
Drug |
name |
full_name, display_abbr
|
DrugClass |
name |
— |
Protein |
abbreviation |
— |
ReferenceGenome |
accession_no |
name, source_database
|
Organism |
taxid |
sciname |
| Level | Interpretation |
|---|---|
| 1 | Susceptible |
| 2 | Potential low-level resistance |
| 3 | Low-level resistance |
| 4 | Intermediate resistance |
| 5 | High-level resistance |
Disclaimer: This project is not affiliated with, endorsed by, or sponsored by Neo4j, Inc. "Neo4j" and related trademarks are the property of Neo4j, Inc.