Skip to content

Knowledge graph Tutorial

minaminii edited this page Jul 1, 2026 · 2 revisions

Knowledge graph Tutorial

This page covers installing Neo4j Desktop, creating a ViroWatch database, loading the kg/ CSVs produced by the pipeline, and running template surveillance queries.


1. Install Neo4j Desktop

Download the installer for your OS from the Neo4j Download Center.

OS Package
macOS .dmg — drag Neo4j Desktop to /Applications
Windows .exe installer
Linux .AppImage or .deb / .rpm — see the download page

Launch Neo4j Desktop and complete the first-run activation (requires a free account or activation key displayed on the download page).


2. Create a DBMS and database

Neo4j Desktop groups database files into a DBMS (Database Management System). One DBMS can hold multiple named databases.

2a. Create a DBMS

  1. In the left sidebar click + NewCreate local DBMS.
  2. Set a name, e.g. ViroWatch.
  3. Set a password — note it down; you will need it for any Bolt connections.
  4. Leave the Neo4j version at the default (5.x recommended).
  5. Click Create.

2b. Start the DBMS

Click the Start button next to your new DBMS. The status indicator turns green when ready.

2c. Open Neo4j Browser

Click Open (or the browser icon) to open the Neo4j Browser query interface. You will use this for constraints, indexes, and ad-hoc queries.

2d. Note the Bolt URI

The default Bolt URI for a local Neo4j Desktop DBMS is:

bolt://localhost:7687

This is what any application driver uses to connect.


3. Create constraints and indexes

Run these Cypher statements in Neo4j Browser before loading data. They enforce uniqueness and dramatically speed up the MERGE operations used during import.

Copy and paste each block into the Browser query box, then press Ctrl+Enter (or click the play button).

// Unique node keys
CREATE CONSTRAINT sample_id IF NOT EXISTS
  FOR (n:Sample) REQUIRE n.sample_id IS UNIQUE;

CREATE CONSTRAINT assembly_id IF NOT EXISTS
  FOR (n:Assembly) REQUIRE n.assembly_id IS UNIQUE;

CREATE CONSTRAINT contig_id IF NOT EXISTS
  FOR (n:Contig) REQUIRE n.contig_id IS UNIQUE;

CREATE CONSTRAINT biodata_uri IF NOT EXISTS
  FOR (n:BioDataFile) REQUIRE n.uri IS UNIQUE;

CREATE CONSTRAINT stanford_alignment_sha IF NOT EXISTS
  FOR (n:StanfordHIVDRAlignment) REQUIRE n.result_sha256 IS UNIQUE;

CREATE CONSTRAINT prediction_id IF NOT EXISTS
  FOR (n:StanfordHIVDRPrediction) REQUIRE n.prediction_id IS UNIQUE;

CREATE CONSTRAINT drug_name IF NOT EXISTS
  FOR (n:Drug) REQUIRE n.name IS UNIQUE;

CREATE CONSTRAINT drug_class_name IF NOT EXISTS
  FOR (n:DrugClass) REQUIRE n.name IS UNIQUE;

CREATE CONSTRAINT protein_abbr IF NOT EXISTS
  FOR (n:Protein) REQUIRE n.abbreviation IS UNIQUE;

CREATE CONSTRAINT ref_genome_accession IF NOT EXISTS
  FOR (n:ReferenceGenome) REQUIRE n.accession_no IS UNIQUE;

CREATE CONSTRAINT organism_taxid IF NOT EXISTS
  FOR (n:Organism) REQUIRE n.taxid IS UNIQUE;

CREATE CONSTRAINT tcr_id IF NOT EXISTS
  FOR (n:TaxonomicClassification) REQUIRE n.process_run_id IS UNIQUE;
// Composite uniqueness for Mutation (gene + text together identify a variant)
CREATE CONSTRAINT mutation_gene_text IF NOT EXISTS
  FOR (n:Mutation) REQUIRE (n.gene, n.text) IS UNIQUE;

Verify all constraints were created:

SHOW CONSTRAINTS;

4. Load data

4a. Find the import directory

Neo4j Browser's LOAD CSV can only read files placed in the DBMS's designated import directory. Locate it by clicking ⋮ (three-dot menu)Open folderImport on your DBMS tile in Neo4j Desktop. The default paths are:

OS Import directory
macOS ~/Library/Application Support/Neo4j Desktop/Application/relate-data/dbmss/<dbms-id>/import/
Windows %LOCALAPPDATA%\Neo4j\relate-data\dbmss\<dbms-id>\import\
Linux ~/.config/Neo4j Desktop/Application/relate-data/dbmss/<dbms-id>/import/

4b. Copy CSVs to the import directory

For each sample, copy the kg/ folder into the import directory preserving the sample ID as a subdirectory:

# Replace <import_dir> and <sample_id> with your values
cp -r results/<sample_id>/kg  <import_dir>/<sample_id>/

After copying, the import directory should look like:

<import_dir>/
└── test_01/
    ├── sample.csv
    ├── assembly.csv
    ├── biodata_files.csv
    ├── contigs.csv
    ├── stanford_alignments.csv
    ├── stanford_predictions.csv
    ├── mutations.csv
    ├── blast_hits.csv          # only if BLAST was enabled
    └── taxonomic_classification.csv   # only if --kraken2_db was enabled

4c. Import the Cypher templates

ViroWatch ships with assets/virowatch_cypher_templates.csv — a Neo4j-compatible saved-queries file containing all the load and surveillance queries pre-written.

To import it into Neo4j Desktop:

  1. Open Neo4j Browser for your running DBMS.
  2. Click the bookmark icon (Saved Cypher) in the left sidebar.
  3. Click Import and select assets/virowatch_cypher_templates.csv.

You will see four folders appear: SETUP, LOAD DATA, QUERIES, and UTILITIES.

4d. Run the load queries in order

Open each query from the LOAD DATA folder in sequence. Before running, replace <sample_id> in the FROM path with your actual sample ID (e.g. test_01):

Step Query Nodes created
1 01 Sample Sample
2 02 Assembly Assembly, (Sample)-[:HAS_ASSEMBLY]→
3 03 BioDataFile FASTA BioDataFile (consensus FASTA), PRODUCE edge
4 04 BioDataFile FASTQ BioDataFile (input reads), ASSEMBLED_FROM edge
5 05 Contigs Contig, (BioDataFile)-[:HAS_CONTIG]→
6 06 Stanford Alignments StanfordHIVDRAlignment, Protein, contig edge
7 07 Stanford Predictions StanfordHIVDRPrediction, Drug, DrugClass
8 08 Mutations Mutation, (Alignment)-[:FOUND]→
9 09 BLAST Hits ReferenceGenome, Organism, contig edge
10 10 Taxonomic Classification ProcessRun:TaxonomicClassification, (Sample)-[:CLASSIFIED_IN]→, (…)-[:CLASSIFIED_FROM]→(BioDataFile) — only if --kraken2_db was enabled

All queries use MERGE, so re-running on the same sample is safe.

Kraken2 QC subgraph. 10 Taxonomic Classification is a read-set QC glance, not a searchable taxonomy graph: the identified taxa are not materialised as Organism nodes but ride along as a JSON-string property taxa_json (already sorted by abundance, with trace taxa folded into an "Other" bucket) on the TaxonomicClassification node. Its CLASSIFIED_FROM edge reuses the input FASTQ BioDataFile from step 4, so run step 4 first if you want that link (it is skipped silently otherwise).


5. Template queries

All queries below are also available as saved queries in assets/virowatch_cypher_templates.csv under the QUERIES folder.

5.1 Node count summary

MATCH (n)
RETURN labels(n)[0] AS label, count(n) AS count
ORDER BY count DESC;

5.2 Relationship count summary

MATCH ()-[r]->()
RETURN type(r) AS relationship, count(r) AS count
ORDER BY count DESC;

5.3 Per-sample drug resistance summary

All resistance predictions at level ≥ 3 (low-level resistance and above):

MATCH (s:Sample)-[:HAS_STANFORD_HIVDR_PREDICTION]->(pred:StanfordHIVDRPrediction)
      -[:PREDICTS_RESISTANCE_TO]->(d:Drug)-[:IN_DRUG_CLASS]->(dc:DrugClass)
WHERE toInteger(pred.level) >= 3
RETURN s.sample_id     AS sample,
       dc.name         AS drug_class,
       d.name          AS drug,
       pred.level      AS level,
       pred.interpretation AS resistance
ORDER BY sample, drug_class, drug;

5.4 High-resistance predictions only (level ≥ 5)

Level 5 = high-level resistance in the Stanford HIVDB scoring system.

MATCH (s:Sample)-[:HAS_STANFORD_HIVDR_PREDICTION]->(pred:StanfordHIVDRPrediction)
      -[:PREDICTS_RESISTANCE_TO]->(d:Drug)-[:IN_DRUG_CLASS]->(dc:DrugClass)
WHERE toInteger(pred.level) >= 5
RETURN s.sample_id, dc.name AS drug_class, d.name AS drug, pred.interpretation
ORDER BY s.sample_id, dc.name, d.name;

5.5 Subtype identification via BLAST

MATCH (s:Sample)-[:HAS_ASSEMBLY]->(:Assembly)-[:PRODUCE]->(:BioDataFile)
      -[:HAS_CONTIG]->(c:Contig)-[:HAS_BLAST_HIT]->(rg:ReferenceGenome)
      -[:REFERENCE_GENOME_OF]->(o:Organism)
RETURN s.sample_id       AS sample,
       c.contig_id       AS contig,
       o.sciname         AS subtype,
       rg.accession_no   AS accession
ORDER BY sample, contig;

5.6 Mutations per gene per sample

MATCH (s:Sample)-[:HAS_ASSEMBLY]->(:Assembly)-[:PRODUCE]->(:BioDataFile)
      -[:HAS_CONTIG]->(c:Contig)
      -[:HAS_STANFORD_HIVDR_ALIGNMENT]->(al:StanfordHIVDRAlignment)
      -[:FOUND]->(m:Mutation)
RETURN s.sample_id AS sample,
       m.gene      AS gene,
       collect(m.text) AS mutations,
       count(m)    AS mutation_count
ORDER BY sample, gene;

5.7 SDRMs (Surveillance Drug Resistance Mutations)

MATCH (s:Sample)-[:HAS_ASSEMBLY]->(:Assembly)-[:PRODUCE]->(:BioDataFile)
      -[:HAS_CONTIG]->(:Contig)
      -[:HAS_STANFORD_HIVDR_ALIGNMENT]->(al:StanfordHIVDRAlignment)
      -[:FOUND]->(m:Mutation)
WHERE m.is_sdrm = 'true'
RETURN s.sample_id AS sample,
       m.gene      AS gene,
       m.text      AS mutation,
       m.primary_type AS type
ORDER BY sample, gene, mutation;

5.8 Samples with multi-class resistance

Samples with high-level resistance (≥ 5) in two or more drug classes — potential treatment-failure flag:

MATCH (s:Sample)-[:HAS_STANFORD_HIVDR_PREDICTION]->(pred:StanfordHIVDRPrediction)
      -[:PREDICTS_RESISTANCE_TO]->(:Drug)-[:IN_DRUG_CLASS]->(dc:DrugClass)
WHERE toInteger(pred.level) >= 5
WITH s, collect(DISTINCT dc.name) AS resistant_classes
WHERE size(resistant_classes) >= 2
RETURN s.sample_id AS sample, resistant_classes
ORDER BY sample;

5.9 Full sample profile (graph traversal)

Retrieve the complete graph path from a single sample through to its BLAST subtype:

MATCH path =
  (s:Sample {sample_id: 'test_01'})
  -[:HAS_ASSEMBLY]->(a:Assembly)
  -[:PRODUCE]->(f:BioDataFile)
  -[:HAS_CONTIG]->(c:Contig)
  -[:HAS_BLAST_HIT]->(rg:ReferenceGenome)
  -[:REFERENCE_GENOME_OF]->(o:Organism)
RETURN path;

Paste this into Neo4j Browser's Graph view (not table view) to see the visual traversal.

5.10 Read-set taxonomic QC (Kraken2)

Per-sample Kraken2 read-set QC: classified/unclassified read counts plus the taxa summary. taxa_json is already sorted by abundance with trace taxa folded into an "Other" bucket — eyeball it directly, or parse it with apoc.convert.fromJson if APOC is installed.

MATCH (s:Sample)-[:CLASSIFIED_IN]->(tc:TaxonomicClassification)
RETURN s.sample_id           AS sample,
       tc.tool               AS tool,
       tc.classified_reads   AS classified,
       tc.unclassified_reads AS unclassified,
       tc.taxa_json          AS taxa
ORDER BY sample;

To expand the JSON into one row per taxon (requires APOC):

MATCH (s:Sample)-[:CLASSIFIED_IN]->(tc:TaxonomicClassification)
UNWIND apoc.convert.fromJson(tc.taxa_json) AS taxon
RETURN s.sample_id     AS sample,
       taxon.sciname   AS organism,
       taxon.rank      AS rank,
       taxon.read_count AS reads,
       taxon.abundance AS abundance
ORDER BY sample, abundance DESC;

5.11 Delete all data (reset)

Useful during development. This removes all nodes and relationships in the current database.

// Run in batches to avoid heap issues on large graphs
CALL apoc.periodic.iterate(
  "MATCH (n) RETURN n",
  "DETACH DELETE n",
  {batchSize: 1000}
)
YIELD batches, total
RETURN batches, total;

If APOC is not installed, use this (slower but works on any Neo4j instance):

MATCH (n)
CALL { WITH n DETACH DELETE n }
IN TRANSACTIONS OF 1000 ROWS;

6. Graph schema reference

flowchart TD
    Sample -->|HAS_ASSEMBLY| Assembly
    Assembly -->|PRODUCE| BioDataFile
    Assembly -->|ASSEMBLED_FROM| BioDataFile
    BioDataFile -->|HAS_CONTIG| Contig
    Contig -->|HAS_STANFORD_HIVDR_ALIGNMENT| StanfordHIVDRAlignment
    StanfordHIVDRAlignment -->|ALIGNED_TO| Protein
    StanfordHIVDRAlignment -->|FOUND| Mutation
    Sample -->|HAS_STANFORD_HIVDR_PREDICTION| StanfordHIVDRPrediction
    Contig -->|HAS_STANFORD_HIVDR_PREDICTION| StanfordHIVDRPrediction
    StanfordHIVDRPrediction -->|PREDICTS_RESISTANCE_TO| Drug
    Drug -->|IN_DRUG_CLASS| DrugClass
    Contig -->|HAS_BLAST_HIT| ReferenceGenome
    ReferenceGenome -->|REFERENCE_GENOME_OF| Organism
Loading

Node properties reference

Label Key property Notable properties
Sample sample_id created_at
Assembly assembly_id assembler, created_at
BioDataFile uri file_type, compressed, sha256
Contig contig_id length, coverage, is_circular, sequence_hash
StanfordHIVDRAlignment result_sha256 database_version, timestamp
StanfordHIVDRPrediction prediction_id score, level, interpretation
Mutation (gene, text) is_sdrm, primary_type, is_unusual
Drug name full_name, display_abbr
DrugClass name
Protein abbreviation
ReferenceGenome accession_no name, source_database
Organism taxid sciname

Stanford HIVDB resistance levels

Level Interpretation
1 Susceptible
2 Potential low-level resistance
3 Low-level resistance
4 Intermediate resistance
5 High-level resistance

Disclaimer: This project is not affiliated with, endorsed by, or sponsored by Neo4j, Inc. "Neo4j" and related trademarks are the property of Neo4j, Inc.