A quick way to set up a BioCypher-driven knowledge graph pipeline.
You can use this template in GitHub directly. Just select
biocypher/project-template
as your template when creating a new repository
on GitHub.
- Clone this repository and rename to your project name.
git clone https://github.com/biocypher/project-template.git
mv project-template my-project
cd my-project
- Make the repository your own.
rm -rf .git
git init
git add .
git commit -m "Initial commit"
# (you can add your remote repository here)
- Install the dependencies using Poetry. (Or feel
free to use your own dependency management system. We provide a
pyproject.toml
to define dependencies.)
poetry install
- You are ready to go!
poetry shell
python create_knowledge_graph.py
The project template is structured as follows:
.
│ # Project setup
│
├── LICENSE
├── README.md
├── pyproject.toml
│
│ # Docker setup
│
├── Dockerfile
├── docker
│ ├── biocypher_entrypoint_patch.sh
│ ├── create_table.sh
│ └── import.sh
├── docker-compose.yml
├── docker-variables.env
│
│ # Project pipeline
│
├── create_knowledge_graph.py
├── config
│ ├── biocypher_config.yaml
│ ├── biocypher_docker_config.yaml
│ └── schema_config.yaml
└── template_package
└── adapters
└── example_adapter.py
The main components of the BioCypher pipeline are the
create_knowledge_graph.py
, the configuration in the config
directory, and
the adapter module in the template_package
directory. The latter can be used
to publish your own adapters (see below). You can also use other adapters from
anywhere on GitHub, PyPI, or your local machine.
The BioCypher ecosystem relies on the collection of adapters (planned, in development, or already available) to inform the community about the available data sources and to facilitate the creation of knowledge graphs. If you think your adapter could be useful for others, please create an issue for it on the main BioCypher repository.
In addition, the docker setup is provided to run the pipeline (from the same python script) in a docker container, and subsequently load the knowledge graph into a Neo4j instance (also from a docker container). This is useful if you want to run the pipeline on a server, or if you want to run it in a reproducible environment.
python create_knowledge_graph.py
will create a knowledge graph from the
example data included in this repository (borrowed from the BioCypher
tutorial). To do that, it uses the
following components:
-
create_knowledge_graph.py
: the main script that orchestrates the pipeline. It brings together the BioCypher package with the data sources. To build a knowledge graph, you need at least one adapter (see below). For common resources, there may already be an adapter available in the BioCypher package or in a separate repository. You can also write your own adapter, should none be available for your data. -
example_adapter.py
(intemplate_package.adapters
): a module that defines the adapter to the data source. In this case, it is a random generator script. If you want to create your own adapters, we recommend to use the example adapter as a blueprint and create one python file per data source, approproately named. You can then import the adapter increate_knowledge_graph.py
and add it to the pipeline. This way, you ensure that others can easily install and use your adapters. -
schema_config.yaml
: a configuration file (found in theconfig
directory) that defines the schema of the knowledge graph. It is used by BioCypher to map the data source to the knowledge representation on the basis of ontology (see this part of the BioCypher tutorial). -
biocypher_config.yaml
: a configuration file (found in theconfig
directory) that defines some BioCypher parameters, such as the mode, the separators used, and other options. More on its use can be found in the Documentation.
After adding your adapter(s) to the adapters
directory, you may want to
publish them for easier reuse. To create a package to distribute your own
adapter(s), we recommend using Poetry. Poetry,
after setup, allows you to publish your package to PyPI using few simple
commands. To set up your package, rename the template_package
directory to
your desired package name and update the pyproject.toml
file accordingly. Most
importantly, update the name
,author
, and version
fields. You can also add
a description
and a license
. Then, you can publish your package to PyPI
using the following commands:
poetry build
poetry publish
If you don't want to publish your package to PyPI, you can also install it from GitHub using the respective functions of poetry or pip.
This repo also contains a docker compose
workflow to create the example
database using BioCypher and load it into a dockerised Neo4j instance
automatically. To run it, simply execute docker compose up -d
in the root
directory of the project. This will start up a single (detached) docker
container with a Neo4j instance that contains the knowledge graph built by
BioCypher as the DB neo4j
(the default DB), which you can connect to and
browse at localhost:7474. Authentication is deactivated by default and can be
modified in the docker_variables.env
file (in which case you need to provide
the .env file to the deploy stage of the docker-compose.yml
).
Regarding the BioCypher build procedure, the biocypher_docker_config.yaml
file
is used instead of the biocypher_config.yaml
(configured in
scripts/build.sh
). Everything else is the same as in the local setup. The
first container (build
) installs and runs the BioCypher pipeline, the second
container (import
) installs Neo4j and runs the import, and the third container
(deploy
) deploys the Neo4j instance on localhost. The files are shared using a
Docker Volume. This three-stage setup strictly is not necessary for the mounting
of a read-write instance of Neo4j, but is required if the purpose is to provide
a read-only instance (e.g. for a web app) that is updated regularly; for an
example, see the meta graph
repository. The read-only setting is
configured in the docker-compose.yml
file
(NEO4J_dbms_databases_default__to__read__only: "false"
) and is deactivated by
default.
This Knowledge Graph uses Snomed CT, ICD10 and Loinc as ontologies.
TODO: describe full pipeline hoe to set it up
Needed preprocessing of the ontologies:
The Snomed CT ontology is not provided in the formats, which are used by BioCypher. Thus, to get a suitable ontology file for BioCypher the following steps are needed:
- Download a recent Snomed CT release (e.g. from here).
- Use the snomed-owl-toolkit to generate an OWL file from the downloaded Snomed CT release (in RF2 file format).
Therefore, download the executable jar file from here and run
java -Xms4g -jar snomed-owl-toolkit-3.0.6-executable.jar -rf2-to-owl -rf2-snapshot-archives <SNOMED-CT>.zip
. This generates a functional OWL file. - BioCypher only supports normal OWL files. To convert the functional OWL file into a normal OWL file you can use robot.
Download the executable jar file from here and run
java -Xms4g -jar robot.jar convert -i <ontology-2023-10-23_09-55-46>.owl --format owl -o ./<snomed-ct-ontology>.owl
- Finally, you can place the generated OWL file in the
config/ontologies
folder with the name TODO and use it in thebiocypher_config.yaml
file.
- Download the ICD10 ontology file from here (select the RDF/TTL file).
- Manual steps to make ICD ontology usable: Replace skos:prefLabel with rdfs:label
- Add the following to the ontology file (the original ontology does not contain one root node (as part of the ontology file -> thus the ontology can not be loaded at once into BioCypher) -> solution: add root node manually
<Icdroot> a owl:Class;
rdfs:label "Icdroot" .
- Replace
rdfs:subClassOf <owl:Thing> ;
withrdfs:subClassOf <Icdroot> ;
- Put the file in the config/ontologies folder with the name TODO
TODO