add integration test; fix bugs related to edges being added with nan …

…values and split not defined edges; setup code quality with black and isort
biocypher · Apr 5, 2024 · 32e9f5b · 32e9f5b
1 parent 00b914b
commit 32e9f5b
Show file tree

Hide file tree

Showing 16 changed files with 1,006 additions and 151 deletions.
diff --git a/.bumpversion.cfg b/.bumpversion.cfg
@@ -0,0 +1,8 @@
+[bumpversion]
+current_version = 0.0.1
+commit = True
+tag = True
+parse = (?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)
+serialize = {major}.{minor}.{patch}
+
+[bumpversion:file:pyproject.toml]
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,50 @@
+# See https://pre-commit.com for more information
+# See https://pre-commit.com/hooks.html for more hooks
+fail_fast: false
+default_language_version:
+    python: python3
+default_stages:
+-   commit
+-   push
+minimum_pre_commit_version: 2.7.1
+repos:
+-   repo: https://github.com/ambv/black
+    rev: 23.7.0
+    hooks:
+    -   id: black
+-   repo: https://github.com/timothycrosley/isort
+    rev: 5.12.0
+    hooks:
+    -   id: isort
+        additional_dependencies: [toml]
+-   repo: https://github.com/snok/pep585-upgrade
+    rev: v1.0
+    hooks:
+    -   id: upgrade-type-hints
+-   repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v4.4.0
+    hooks:
+    -   id: check-docstring-first
+    -   id: end-of-file-fixer
+    -   id: check-added-large-files
+    -   id: mixed-line-ending
+    -   id: trailing-whitespace
+        exclude: ^.bumpversion.cfg$
+    -   id: check-merge-conflict
+    -   id: check-case-conflict
+    -   id: check-symlinks
+    -   id: check-yaml
+        args: [--unsafe]
+    -   id: check-ast
+    -   id: fix-encoding-pragma
+        args: [--remove] # for Python3 codebase, it's not necessary
+    -   id: requirements-txt-fixer
+-   repo: https://github.com/pre-commit/pygrep-hooks
+    rev: v1.10.0
+    hooks:
+    -   id: python-no-eval
+    -   id: python-use-type-annotations
+    -   id: python-check-blanket-noqa
+    -   id: rst-backticks
+    -   id: rst-directive-colons
+    -   id: rst-inline-touching-normal
diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@
 A quick way to set up a BioCypher-driven knowledge graph pipeline.
 
 ## Using the GitHub Template functionality
-You can use this template in GitHub directly. Just select 
+You can use this template in GitHub directly. Just select
 `biocypher/project-template` as your template when creating a new repository
 on GitHub.
 
@@ -94,8 +94,8 @@ tutorial](https://biocypher.org/tutorial.html)). To do that, it uses the
 following components:
 
 - `create_knowledge_graph.py`: the main script that orchestrates the pipeline.
-It brings together the BioCypher package with the data sources. To build a 
-knowledge graph, you need at least one adapter (see below). For common 
+It brings together the BioCypher package with the data sources. To build a
+knowledge graph, you need at least one adapter (see below). For common
 resources, there may already be an adapter available in the BioCypher package or
 in a separate repository. You can also write your own adapter, should none be
 available for your data.
@@ -105,17 +105,17 @@ the adapter to the data source. In this case, it is a random generator script.
 If you want to create your own adapters, we recommend to use the example adapter
 as a blueprint and create one python file per data source, approproately named.
 You can then import the adapter in `create_knowledge_graph.py` and add it to
-the pipeline. This way, you ensure that others can easily install and use your 
+the pipeline. This way, you ensure that others can easily install and use your
 adapters.
 
 - `schema_config.yaml`: a configuration file (found in the `config` directory)
 that defines the schema of the knowledge graph. It is used by BioCypher to map
 the data source to the knowledge representation on the basis of ontology (see
-[this part of the BioCypher 
+[this part of the BioCypher
 tutorial](https://biocypher.org/tutorial-ontology.html)).
 
-- `biocypher_config.yaml`: a configuration file (found in the `config` 
-directory) that defines some BioCypher parameters, such as the mode, the 
+- `biocypher_config.yaml`: a configuration file (found in the `config`
+directory) that defines some BioCypher parameters, such as the mode, the
 separators used, and other options. More on its use can be found in the
 [Documentation](https://biocypher.org/installation.html#configuration).
 
@@ -142,7 +142,7 @@ GitHub using the respective functions of poetry or pip.
 
 This repo also contains a `docker compose` workflow to create the example
 database using BioCypher and load it into a dockerised Neo4j instance
-automatically. To run it, simply execute `docker compose up -d` in the root 
+automatically. To run it, simply execute `docker compose up -d` in the root
 directory of the project. This will start up a single (detached) docker
 container with a Neo4j instance that contains the knowledge graph built by
 BioCypher as the DB `neo4j` (the default DB), which you can connect to and
@@ -175,21 +175,21 @@ TODO: describe full pipeline hoe to set it up
 Needed preprocessing of the ontologies:
 
 ## Snomed CT
-The Snomed CT ontology is not provided in the formats, which are used by BioCypher. 
+The Snomed CT ontology is not provided in the formats, which are used by BioCypher.
 Thus, to get a suitable ontology file for BioCypher the following steps are needed:
 1. Download a recent Snomed CT release (e.g. from [here](https://www.nlm.nih.gov/healthit/snomedct/international.html)).
 2. Use the [snomed-owl-toolkit](https://github.com/IHTSDO/snomed-owl-toolkit) to generate an OWL file from the downloaded Snomed CT release (in RF2 file format).
 Therefore, download the executable jar file from [here](https://github.com/IHTSDO/snomed-owl-toolkit/releases) and run
 `java -Xms4g -jar snomed-owl-toolkit-3.0.6-executable.jar -rf2-to-owl -rf2-snapshot-archives <SNOMED-CT>.zip`. This generates a functional OWL file.
 4. BioCypher only supports normal OWL files. To convert the functional OWL file into a normal OWL file you can use [robot](http://robot.obolibrary.org/).
-Download the executable jar file from [here](https://github.com/ontodev/robot/releases) and run 
+Download the executable jar file from [here](https://github.com/ontodev/robot/releases) and run
 `java -Xms4g -jar robot.jar convert -i <ontology-2023-10-23_09-55-46>.owl --format owl -o ./<snomed-ct-ontology>.owl`
 5. Finally, you can place the generated OWL file in the `config/ontologies` folder with the name TODO and use it in the `biocypher_config.yaml` file.
 
 ## ICD10
 1. Download the ICD10 ontology file from [here](https://bioportal.bioontology.org/ontologies/ICD10CM) (select the RDF/TTL file).
 2. Manual steps to make ICD ontology usable:
-    Replace skos:prefLabel with rdfs:label 
+    Replace skos:prefLabel with rdfs:label
 3. Add the following to the ontology file (the original ontology does not contain one root node (as part of the ontology file -> thus the ontology can not be loaded at once into BioCypher) -> solution: add root node manually
 ```
 <Icdroot> a owl:Class;

diff --git a/create_knowledge_graph.py b/create_knowledge_graph.py
@@ -1,15 +1,16 @@
 from biocypher import BioCypher
+
 from patient_kg.adapters.clinical_dataset_adapter import (
     ClinicalDatasetAdapter,
-    SnomedCTAdapterNodeType,
     SnomedCTAdapterEdgeType,
+    SnomedCTAdapterNodeType,
 )
 
 # Instantiate the BioCypher interface
 # You can use `config/biocypher_config.yaml` to configure the framework or
 # supply settings via parameters below
 bc = BioCypher(
-    #biocypher_config_path="config/biocypher_docker_config.yaml",
+    # biocypher_config_path="config/biocypher_docker_config.yaml",
     biocypher_config_path="config/biocypher_config.yaml",
     # schema_config_path="config/old_schema_config.yaml",
     schema_config_path="config/generated_schema_config_for_data.yaml",
@@ -56,7 +57,7 @@
 bc.write_import_call()
 
 # Print summary
-#bc.show_ontology_structure(full=True)
-#bc.log_missing_input_labels()
-#bc.log_duplicates()
+# bc.show_ontology_structure(full=True)
+# bc.log_missing_input_labels()
+# bc.log_duplicates()
 bc.summary()
diff --git a/data/example_input/mapping.yaml b/data/example_input/mapping.yaml
@@ -0,0 +1,77 @@
+Nodes:
+  # clinical terms (Snomed CT)
+  Patient ID:
+    coding_system: snomedct
+    object_type: instance
+    id_in_coding_system: 116154003
+  Overall Survival (days):
+    coding_system: snomedct
+    object_type: concept
+    id_in_coding_system: 445320007
+  Clinical_Oxygen saturation in Arterial blood ; %:
+    coding_system: snomedct
+    object_type: concept
+    id_in_coding_system: 442476006
+  # Lab values (Loinc)
+  LAB_Eos. Granulozyten# ; /nl:
+    coding_system: loinc
+    object_type: concept
+    id_in_coding_system: 26449-9
+  # Diseases (ICD)
+  ICD_B95:
+    coding_system: icd10
+    object_type: concept
+    id_in_coding_system: B95
+  ICD_A02:
+    coding_system: icd10
+    object_type: concept
+    id_in_coding_system: A02
+  Cancer_C01:
+    coding_system: icd10
+    object_type: concept
+    id_in_coding_system: C01
+  # Operations and procedures (german OPS)
+  OPS_1-100:
+    coding_system: ops
+    object_type: concept
+    id_in_coding_system: 1-100
+  # not mapped columns
+  not_mapped_discrete_value:
+    coding_system: not_mapped_to_ontology
+    object_type: concept
+    id_in_coding_system: .nan
+  not_mapped_continuous_value:
+    coding_system: not_mapped_to_ontology
+    object_type: concept
+    id_in_coding_system: .nan
+
+Edges:
+  HAS_CLINICAL_PARAMETER:
+    source_node: Patient ID
+    target_nodes: [Overall Survival (days), Clinical_Oxygen saturation in Arterial blood ; %]
+    properties:
+      value:
+        type: float # TODO: use float because int is also float (but is this really a good solution?)
+  HAS_LAB_VALUE:
+    source_node: Patient ID
+    target_nodes: [LAB_Eos. Granulozyten# ; /nl]
+    properties:
+      value:
+        type: float
+  HAS_DISEASE:
+    source_node: Patient ID
+    target_nodes: [ICD_B95, ICD_A02, Cancer_C01]
+  HAS_TREATMENT:
+    source_node: Patient ID
+    target_nodes: [OPS_1-100]
+  NOT_DEFINED_BINARY:
+    source_node: Patient ID
+    target_nodes:
+      - not_mapped_discrete_value
+  NOT_DEFINED_CONTINUOUS:
+    properties:
+      value:
+        type: float
+    source_node: Patient ID
+    target_nodes:
+      - not_mapped_continuous_value
diff --git a/data/example_input/mock_data.csv b/data/example_input/mock_data.csv
@@ -0,0 +1,5 @@
+Patient ID,Overall Survival (days),LAB_Eos. Granulozyten# ; /nl,ICD_B95,ICD_A02,Clinical_Oxygen saturation in Arterial blood ; %,Cancer_C01,OPS_1-100,not_mapped_discrete_value,not_mapped_continuous_value
+1,150,0.11,0,1,97,0,1,1,0.1
+2,164,0.12,1,1,96,0,1,0,0.0
+3,"",,,,,1,,,
+4,,0.14,0,0,94,,0,,
diff --git a/docker-variables.env b/docker-variables.env
@@ -22,4 +22,4 @@ NEO4J_AUTH=neo4j/neo4jpassword
 ### DO NOT CHANGE ###
 
 # Variable necessary to use enterprise version of dockerized Neo4j
-NEO4J_ACCEPT_LICENSE_AGREEMENT=yes
+NEO4J_ACCEPT_LICENSE_AGREEMENT=yes
diff --git a/docker/create_table.sh b/docker/create_table.sh
@@ -1,4 +1,4 @@
 sleep 15
 echo "Creating database '$BC_TABLE_NAME'"
 cypher-shell -u $NEO4J_USER -p $NEO4J_PASSWORD "create database $BC_TABLE_NAME;"
-echo "Database created!"
+echo "Database created!"
diff --git a/generate_schema_config_for_data.py b/generate_schema_config_for_data.py
@@ -1,12 +1,11 @@
 #!/usr/bin/env python
-# coding: utf-8
 
 import yaml
 
 from patient_kg.adapters.node_data_classes import Node
 
-mapping_file_path = "./data/mapping.yaml" # "./data/example_input/mapping.yaml"
-with open(mapping_file_path, 'r') as yaml_file:
+mapping_file_path = "./data/mapping.yaml"  # "./data/example_input/mapping.yaml"
+with open(mapping_file_path, "r") as yaml_file:
     dataset_mapping = yaml.safe_load(yaml_file)
 
 print(f"Number of nodes {len(dataset_mapping['Nodes'])}")
@@ -20,49 +19,54 @@
     id_in_coding_system = node_config["id_in_coding_system"]
     object_type = node_config["object_type"]
 
-    node = Node.create_instance(id_in_coding_system, None, {}, coding_system, object_type)
+    node = Node.create_instance(
+        id_in_coding_system, None, {}, coding_system, object_type
+    )
     node_label = node.get_label()
-    
+
     if node_label == "nan":
         node_label = defined_node
 
     # Hack for directly inserted columns without mapping (which could contain ', which is not possible for neo4j)
     node_label = node_label.replace("'", "")
-        
+
     schema_config_data[node_label] = {
         "represented_as": "node",
         "preferred_id": coding_system,
-        "input_label": node_label
+        "input_label": node_label,
     }
-    
-    # OPS has no underlying ontology 
+
+    # OPS has no underlying ontology
     # handle by using explicit inheritance
-    if (node_label == node.get_id()) and coding_system == 'ops':
+    if (node_label == node.get_id()) and coding_system == "ops":
         schema_config_data[node_label]["is_a"] = "OPS"
 
-    # Loinc ontology not yet working 
+    # Loinc ontology not yet working
     # handle by using explicit inheritance
-    if (node_label == node.get_id()) and coding_system == 'loinc':
+    if (node_label == node.get_id()) and coding_system == "loinc":
         schema_config_data[node_label]["is_a"] = "loinc"
 
     # terms with missing mapping to ontology
     # handle by using explicit inheritance
-    if coding_system == 'not_mapped_to_ontology':
+    if (
+        coding_system == "not_mapped_to_ontology_binary"
+        or coding_system == "not_mapped_to_ontology_continuous"
+    ):
         schema_config_data[node_label]["is_a"] = "notmappedtoontology"
 
 for edge in dataset_mapping["Edges"]:
     edge_config = dataset_mapping["Edges"][edge]
 
     # TODO: check if nodes exist
-    #source_node = dataset_mapping["Edges"][edge]["source_node"]
-    #target_node = dataset_mapping["Edges"][edge]["target_node"]
+    # source_node = dataset_mapping["Edges"][edge]["source_node"]
+    # target_node = dataset_mapping["Edges"][edge]["target_node"]
+
+    # source_node_id = dataset_mapping["Nodes"][source_node]["id_in_coding_system"]
+    # target_node_id = dataset_mapping["Nodes"][target_node]["id_in_coding_system"]
 
-    #source_node_id = dataset_mapping["Nodes"][source_node]["id_in_coding_system"]
-    #target_node_id = dataset_mapping["Nodes"][target_node]["id_in_coding_system"]
+    # source_node_label
+    # target_node_label
 
-    #source_node_label
-    #target_node_label
-
     schema_config_data[edge] = {
         "is_a": "concept model attribute (attribute)",
         "represented_as": "edge",
@@ -73,11 +77,11 @@
         properties = {}
         for property in edge_config["properties"]:
             properties[property] = edge_config["properties"][property]["type"]
-        schema_config_data[edge]["properties"] = properties    
+        schema_config_data[edge]["properties"] = properties
 
-file_path = './config/generated_schema_config_for_data.yaml'
+file_path = "./config/generated_schema_config_for_data.yaml"
 
-with open(file_path, 'w') as yaml_file:
+with open(file_path, "w") as yaml_file:
     yaml.dump(schema_config_data, yaml_file, allow_unicode=True)
 
 print(f'YAML file "{file_path}" has been created.')