-
Notifications
You must be signed in to change notification settings - Fork 26
Home
Prebuilt versions of the KG-COVID-19 knowledge graph build from all available data are available in the following serialization formats:
- RDF N-Triples
-
KGX TSV
- See here for a description of the KGX TSV format
Previous builds are available for download here. Each build contains the following data:
-
raw
: the data ingested for this build -
transformed
: the transformed data from each source -
stats
: detailed statistics about the contents of the KG -
Jenkinsfile
: the exact commands used to generate the KG -
kg-covid-19.nt.gz
: an RDF/Ntriples version of the KG -
kg-covid-19.tar.gz
: a KGX TSV version of the KG -
kg-covid-19.jnl.gz
: the Blazegraph journal file (for loading an endpoint)
A Knowledge Graph Hub (KG Hub) is framework to download and transform data to a central location for building knowledge graphs (KGs) from different combination of data sources, in an automated, YAML-driven way. The workflow constitutes of 3 steps:
- Download data
- Transform data for each data source into two TSV files (
edges.tsv
andnodes.tsv
) as specified here - Merge the graphs for each data source of interest using KGX to produce a merged knowledge graph
To facilitate interoperability of datasets, Biolink categories are added to nodes and Biolink association types are added to edges during transformation.
A more thorough explanation of the KG-hub concept is here.
The KG-COVID-19 project is the first instantiation of such a KG Hub. Thus, KG-COVID-19 is a framework, that follows design patterns of the KG Hub, to download and transform COVID-19/SARS-COV-2 related datasets and emit a knowledge graph that can then be used for machine learning or others uses, to produce actionable knowledge.
- Here is the GitHub repo for this project.
- Here is the GitHub repo for Embiggen, an implementation of node2vec and other methods to generate embeddings and apply machine learning to graphs.
- Here is the GitHub repo for KGX, a knowledge graph exchange tool for working with graphs
- Java/JDK is required in order for the transform step to work properly. See here for instructions on installing.
- On a commodity server with 200 GB of memory, generation of the knowledge graph containing all source data requires a total of 3.7 hours (0.13 hours, 1.5 hours, and 2.1 hours for the download, transform, and merge step, respectively), with a peak memory usage of 34.4 GB and disk use of 37 GB. An estimate of the current build time on a typical server is also available here.
git clone https://github.com/Knowledge-Graph-Hub/kg-covid-19
cd kg-covid-19
python3 -m venv venv && source venv/bin/activate # optional
pip install .
python run.py download
python run.py transform
python run.py merge
We have also prepared a Jupyter notebook demonstrating how to run the pipeline to generate a KG, and also how to use other tooling such as graph sampling for generating holdouts, and graph querying.
- UniProtKB identifiers are used for genes and proteins, where possible
- For drug/compound identifiers, there is a preferred namespace. If there are datasets that provide identifiers from multiple namespaces then the choice is determined based on a descending order of preference,
-
CHEBI
>CHEMBL
>DRUGBANK
>PUBCHEM
-
- Less is more: for each data source, we ingest only the subset of data that is most relevant to the knowledge graph in question (here, it's KG-COVID-19)
- We avoid ingesting data from a source that isn't authoritative for the data in question (e.g. we do not ingest protein interaction data from a drug database)
- Each ingest should make an effort to add provenance data by adding a
provided_by
column for each node and edge in the output TSV file, populated with the source of each datum
A SPARQL endpoint for the merged knowledge graph is available here. For a better experience, consider using https://yasgui.triply.cc/ for your querying needs (for yasgui, set http://kg-hub-rdf.berkeleybop.io/blazegraph/sparql as your SPARQL endpoint). If you are not sure where to start, here are some example SPARQL queries: https://github.com/Knowledge-Graph-Hub/kg-covid-19/tree/master/queries/sparql
A detailed, up-to-date summary of data in KG-COVID-19 is available here, with contents of the knowledge graph broken down by Biolink categories and Biolink association types for nodes and edges, respectively.
An interactive dashboard to explore these stats is available here.
- Here is a more detailed description, and instructions on how to help us with our KG-COVID-19 effort.
- Justin Reese
- Deepak Unni
- Marcin Joachimiak
- Peter Robinson
- Chris Mungall
- Tiffany Callahan
- Luca Cappelletti
- Vida Ravanmehr
We gratefully acknowledge the Elsevier Coronavirus Information Center for sharing their coronavirus pathway data, and also acknowledge and thank all COVID-19 data providers for making their data available.