Skip to content

Commit

Permalink
add integration test; fix bugs related to edges being added with nan …
Browse files Browse the repository at this point in the history
…values and split not defined edges; setup code quality with black and isort
  • Loading branch information
nilskre committed Apr 5, 2024
1 parent 00b914b commit 32e9f5b
Show file tree
Hide file tree
Showing 16 changed files with 1,006 additions and 151 deletions.
8 changes: 8 additions & 0 deletions .bumpversion.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
[bumpversion]
current_version = 0.0.1
commit = True
tag = True
parse = (?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)
serialize = {major}.{minor}.{patch}

[bumpversion:file:pyproject.toml]
50 changes: 50 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# See https://pre-commit.com for more information
# See https://pre-commit.com/hooks.html for more hooks
fail_fast: false
default_language_version:
python: python3
default_stages:
- commit
- push
minimum_pre_commit_version: 2.7.1
repos:
- repo: https://github.com/ambv/black
rev: 23.7.0
hooks:
- id: black
- repo: https://github.com/timothycrosley/isort
rev: 5.12.0
hooks:
- id: isort
additional_dependencies: [toml]
- repo: https://github.com/snok/pep585-upgrade
rev: v1.0
hooks:
- id: upgrade-type-hints
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.4.0
hooks:
- id: check-docstring-first
- id: end-of-file-fixer
- id: check-added-large-files
- id: mixed-line-ending
- id: trailing-whitespace
exclude: ^.bumpversion.cfg$
- id: check-merge-conflict
- id: check-case-conflict
- id: check-symlinks
- id: check-yaml
args: [--unsafe]
- id: check-ast
- id: fix-encoding-pragma
args: [--remove] # for Python3 codebase, it's not necessary
- id: requirements-txt-fixer
- repo: https://github.com/pre-commit/pygrep-hooks
rev: v1.10.0
hooks:
- id: python-no-eval
- id: python-use-type-annotations
- id: python-check-blanket-noqa
- id: rst-backticks
- id: rst-directive-colons
- id: rst-inline-touching-normal
22 changes: 11 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
A quick way to set up a BioCypher-driven knowledge graph pipeline.

## Using the GitHub Template functionality
You can use this template in GitHub directly. Just select
You can use this template in GitHub directly. Just select
`biocypher/project-template` as your template when creating a new repository
on GitHub.

Expand Down Expand Up @@ -94,8 +94,8 @@ tutorial](https://biocypher.org/tutorial.html)). To do that, it uses the
following components:

- `create_knowledge_graph.py`: the main script that orchestrates the pipeline.
It brings together the BioCypher package with the data sources. To build a
knowledge graph, you need at least one adapter (see below). For common
It brings together the BioCypher package with the data sources. To build a
knowledge graph, you need at least one adapter (see below). For common
resources, there may already be an adapter available in the BioCypher package or
in a separate repository. You can also write your own adapter, should none be
available for your data.
Expand All @@ -105,17 +105,17 @@ the adapter to the data source. In this case, it is a random generator script.
If you want to create your own adapters, we recommend to use the example adapter
as a blueprint and create one python file per data source, approproately named.
You can then import the adapter in `create_knowledge_graph.py` and add it to
the pipeline. This way, you ensure that others can easily install and use your
the pipeline. This way, you ensure that others can easily install and use your
adapters.

- `schema_config.yaml`: a configuration file (found in the `config` directory)
that defines the schema of the knowledge graph. It is used by BioCypher to map
the data source to the knowledge representation on the basis of ontology (see
[this part of the BioCypher
[this part of the BioCypher
tutorial](https://biocypher.org/tutorial-ontology.html)).

- `biocypher_config.yaml`: a configuration file (found in the `config`
directory) that defines some BioCypher parameters, such as the mode, the
- `biocypher_config.yaml`: a configuration file (found in the `config`
directory) that defines some BioCypher parameters, such as the mode, the
separators used, and other options. More on its use can be found in the
[Documentation](https://biocypher.org/installation.html#configuration).

Expand All @@ -142,7 +142,7 @@ GitHub using the respective functions of poetry or pip.

This repo also contains a `docker compose` workflow to create the example
database using BioCypher and load it into a dockerised Neo4j instance
automatically. To run it, simply execute `docker compose up -d` in the root
automatically. To run it, simply execute `docker compose up -d` in the root
directory of the project. This will start up a single (detached) docker
container with a Neo4j instance that contains the knowledge graph built by
BioCypher as the DB `neo4j` (the default DB), which you can connect to and
Expand Down Expand Up @@ -175,21 +175,21 @@ TODO: describe full pipeline hoe to set it up
Needed preprocessing of the ontologies:

## Snomed CT
The Snomed CT ontology is not provided in the formats, which are used by BioCypher.
The Snomed CT ontology is not provided in the formats, which are used by BioCypher.
Thus, to get a suitable ontology file for BioCypher the following steps are needed:
1. Download a recent Snomed CT release (e.g. from [here](https://www.nlm.nih.gov/healthit/snomedct/international.html)).
2. Use the [snomed-owl-toolkit](https://github.com/IHTSDO/snomed-owl-toolkit) to generate an OWL file from the downloaded Snomed CT release (in RF2 file format).
Therefore, download the executable jar file from [here](https://github.com/IHTSDO/snomed-owl-toolkit/releases) and run
`java -Xms4g -jar snomed-owl-toolkit-3.0.6-executable.jar -rf2-to-owl -rf2-snapshot-archives <SNOMED-CT>.zip`. This generates a functional OWL file.
4. BioCypher only supports normal OWL files. To convert the functional OWL file into a normal OWL file you can use [robot](http://robot.obolibrary.org/).
Download the executable jar file from [here](https://github.com/ontodev/robot/releases) and run
Download the executable jar file from [here](https://github.com/ontodev/robot/releases) and run
`java -Xms4g -jar robot.jar convert -i <ontology-2023-10-23_09-55-46>.owl --format owl -o ./<snomed-ct-ontology>.owl`
5. Finally, you can place the generated OWL file in the `config/ontologies` folder with the name TODO and use it in the `biocypher_config.yaml` file.

## ICD10
1. Download the ICD10 ontology file from [here](https://bioportal.bioontology.org/ontologies/ICD10CM) (select the RDF/TTL file).
2. Manual steps to make ICD ontology usable:
Replace skos:prefLabel with rdfs:label
Replace skos:prefLabel with rdfs:label
3. Add the following to the ontology file (the original ontology does not contain one root node (as part of the ontology file -> thus the ontology can not be loaded at once into BioCypher) -> solution: add root node manually
```
<Icdroot> a owl:Class;
Expand Down
11 changes: 6 additions & 5 deletions create_knowledge_graph.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,16 @@
from biocypher import BioCypher

from patient_kg.adapters.clinical_dataset_adapter import (
ClinicalDatasetAdapter,
SnomedCTAdapterNodeType,
SnomedCTAdapterEdgeType,
SnomedCTAdapterNodeType,
)

# Instantiate the BioCypher interface
# You can use `config/biocypher_config.yaml` to configure the framework or
# supply settings via parameters below
bc = BioCypher(
#biocypher_config_path="config/biocypher_docker_config.yaml",
# biocypher_config_path="config/biocypher_docker_config.yaml",
biocypher_config_path="config/biocypher_config.yaml",
# schema_config_path="config/old_schema_config.yaml",
schema_config_path="config/generated_schema_config_for_data.yaml",
Expand Down Expand Up @@ -56,7 +57,7 @@
bc.write_import_call()

# Print summary
#bc.show_ontology_structure(full=True)
#bc.log_missing_input_labels()
#bc.log_duplicates()
# bc.show_ontology_structure(full=True)
# bc.log_missing_input_labels()
# bc.log_duplicates()
bc.summary()
77 changes: 77 additions & 0 deletions data/example_input/mapping.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
Nodes:
# clinical terms (Snomed CT)
Patient ID:
coding_system: snomedct
object_type: instance
id_in_coding_system: 116154003
Overall Survival (days):
coding_system: snomedct
object_type: concept
id_in_coding_system: 445320007
Clinical_Oxygen saturation in Arterial blood ; %:
coding_system: snomedct
object_type: concept
id_in_coding_system: 442476006
# Lab values (Loinc)
LAB_Eos. Granulozyten# ; /nl:
coding_system: loinc
object_type: concept
id_in_coding_system: 26449-9
# Diseases (ICD)
ICD_B95:
coding_system: icd10
object_type: concept
id_in_coding_system: B95
ICD_A02:
coding_system: icd10
object_type: concept
id_in_coding_system: A02
Cancer_C01:
coding_system: icd10
object_type: concept
id_in_coding_system: C01
# Operations and procedures (german OPS)
OPS_1-100:
coding_system: ops
object_type: concept
id_in_coding_system: 1-100
# not mapped columns
not_mapped_discrete_value:
coding_system: not_mapped_to_ontology
object_type: concept
id_in_coding_system: .nan
not_mapped_continuous_value:
coding_system: not_mapped_to_ontology
object_type: concept
id_in_coding_system: .nan

Edges:
HAS_CLINICAL_PARAMETER:
source_node: Patient ID
target_nodes: [Overall Survival (days), Clinical_Oxygen saturation in Arterial blood ; %]
properties:
value:
type: float # TODO: use float because int is also float (but is this really a good solution?)
HAS_LAB_VALUE:
source_node: Patient ID
target_nodes: [LAB_Eos. Granulozyten# ; /nl]
properties:
value:
type: float
HAS_DISEASE:
source_node: Patient ID
target_nodes: [ICD_B95, ICD_A02, Cancer_C01]
HAS_TREATMENT:
source_node: Patient ID
target_nodes: [OPS_1-100]
NOT_DEFINED_BINARY:
source_node: Patient ID
target_nodes:
- not_mapped_discrete_value
NOT_DEFINED_CONTINUOUS:
properties:
value:
type: float
source_node: Patient ID
target_nodes:
- not_mapped_continuous_value
5 changes: 5 additions & 0 deletions data/example_input/mock_data.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Patient ID,Overall Survival (days),LAB_Eos. Granulozyten# ; /nl,ICD_B95,ICD_A02,Clinical_Oxygen saturation in Arterial blood ; %,Cancer_C01,OPS_1-100,not_mapped_discrete_value,not_mapped_continuous_value
1,150,0.11,0,1,97,0,1,1,0.1
2,164,0.12,1,1,96,0,1,0,0.0
3,"",,,,,1,,,
4,,0.14,0,0,94,,0,,
2 changes: 1 addition & 1 deletion docker-variables.env
Original file line number Diff line number Diff line change
Expand Up @@ -22,4 +22,4 @@ NEO4J_AUTH=neo4j/neo4jpassword
### DO NOT CHANGE ###

# Variable necessary to use enterprise version of dockerized Neo4j
NEO4J_ACCEPT_LICENSE_AGREEMENT=yes
NEO4J_ACCEPT_LICENSE_AGREEMENT=yes
2 changes: 1 addition & 1 deletion docker/create_table.sh
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
sleep 15
echo "Creating database '$BC_TABLE_NAME'"
cypher-shell -u $NEO4J_USER -p $NEO4J_PASSWORD "create database $BC_TABLE_NAME;"
echo "Database created!"
echo "Database created!"
50 changes: 27 additions & 23 deletions generate_schema_config_for_data.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
#!/usr/bin/env python
# coding: utf-8

import yaml

from patient_kg.adapters.node_data_classes import Node

mapping_file_path = "./data/mapping.yaml" # "./data/example_input/mapping.yaml"
with open(mapping_file_path, 'r') as yaml_file:
mapping_file_path = "./data/mapping.yaml" # "./data/example_input/mapping.yaml"
with open(mapping_file_path, "r") as yaml_file:
dataset_mapping = yaml.safe_load(yaml_file)

print(f"Number of nodes {len(dataset_mapping['Nodes'])}")
Expand All @@ -20,49 +19,54 @@
id_in_coding_system = node_config["id_in_coding_system"]
object_type = node_config["object_type"]

node = Node.create_instance(id_in_coding_system, None, {}, coding_system, object_type)
node = Node.create_instance(
id_in_coding_system, None, {}, coding_system, object_type
)
node_label = node.get_label()

if node_label == "nan":
node_label = defined_node

# Hack for directly inserted columns without mapping (which could contain ', which is not possible for neo4j)
node_label = node_label.replace("'", "")

schema_config_data[node_label] = {
"represented_as": "node",
"preferred_id": coding_system,
"input_label": node_label
"input_label": node_label,
}
# OPS has no underlying ontology

# OPS has no underlying ontology
# handle by using explicit inheritance
if (node_label == node.get_id()) and coding_system == 'ops':
if (node_label == node.get_id()) and coding_system == "ops":
schema_config_data[node_label]["is_a"] = "OPS"

# Loinc ontology not yet working
# Loinc ontology not yet working
# handle by using explicit inheritance
if (node_label == node.get_id()) and coding_system == 'loinc':
if (node_label == node.get_id()) and coding_system == "loinc":
schema_config_data[node_label]["is_a"] = "loinc"

# terms with missing mapping to ontology
# handle by using explicit inheritance
if coding_system == 'not_mapped_to_ontology':
if (
coding_system == "not_mapped_to_ontology_binary"
or coding_system == "not_mapped_to_ontology_continuous"
):
schema_config_data[node_label]["is_a"] = "notmappedtoontology"

for edge in dataset_mapping["Edges"]:
edge_config = dataset_mapping["Edges"][edge]

# TODO: check if nodes exist
#source_node = dataset_mapping["Edges"][edge]["source_node"]
#target_node = dataset_mapping["Edges"][edge]["target_node"]
# source_node = dataset_mapping["Edges"][edge]["source_node"]
# target_node = dataset_mapping["Edges"][edge]["target_node"]

# source_node_id = dataset_mapping["Nodes"][source_node]["id_in_coding_system"]
# target_node_id = dataset_mapping["Nodes"][target_node]["id_in_coding_system"]

#source_node_id = dataset_mapping["Nodes"][source_node]["id_in_coding_system"]
#target_node_id = dataset_mapping["Nodes"][target_node]["id_in_coding_system"]
# source_node_label
# target_node_label

#source_node_label
#target_node_label

schema_config_data[edge] = {
"is_a": "concept model attribute (attribute)",
"represented_as": "edge",
Expand All @@ -73,11 +77,11 @@
properties = {}
for property in edge_config["properties"]:
properties[property] = edge_config["properties"][property]["type"]
schema_config_data[edge]["properties"] = properties
schema_config_data[edge]["properties"] = properties

file_path = './config/generated_schema_config_for_data.yaml'
file_path = "./config/generated_schema_config_for_data.yaml"

with open(file_path, 'w') as yaml_file:
with open(file_path, "w") as yaml_file:
yaml.dump(schema_config_data, yaml_file, allow_unicode=True)

print(f'YAML file "{file_path}" has been created.')
Loading

0 comments on commit 32e9f5b

Please sign in to comment.