Skip to content

August 01, 2021

Tiffany J. Callahan edited this page Dec 10, 2023 · 13 revisions

PKT Human Disease Knowledge Graph Benchmark Builds (v2.1.0)

Build Date: August 01, 2021

The Human Disease Mechanism KG benchmarks were originally built and stored using Google Cloud Platform (GCP) resources (for details and a complete description of this process, see here). As of late 2023, we have moved the KG builds to Zenodo. While the original GCP resources contained all associated files (i.e., all data used and processed to create the KGs), due to the file size upload limits associated with each archive, we have limited the archived data to KGs output, associated metadata, and log files. The list of resources used to build each KG, including their URLs, and date of download, can all be found in the associated logs. Details on how to access these files are provided below.


Resources


KG Benchmark Builds can also be obtained from Zenodo


Build Data


🗂 For additional information on the KG file types please see the following Wiki page, which is also available as a download (here). See data_to_download.txt for a complete list of all downloaded build data.


Required Input Documents
See here for a detailed descriptions of these resources.

  • resource_info.txt
  • edge_source_list.txt
  • ontology_source_list.txt

Required Curated Data
Curated data sources are manually created and were designed to support the build. See the Data_Preparation.ipynb for a detailed descriptions of these resources.

  • genomic_sequence_ontology_mappings.xlsx
  • genomic_typing_dict.pkl
  • zooma_tissue_cell_mapping_04JAN2020.xlsx

Build Metadata
The metadata documentation provides details on each downloaded resource including the URL, date of download, and file size. The resources listed in these documents should align to the similarly names files listed in the Required Input Documents section.

  • downloaded_build_metadata.txt
  • edge_source_metadata.txt
  • preprocessed_build_metadata.txt
  • ontology_source_metadata.txt
  • ontology_cleaning_report.txt

Build Logs
The build logs provide detailed information on each step of the build process as well as statistics on the resulting KG builds.

  • pkt_builder_phases12_log.log
  • pkt_build_log.log

Knowledge Graph Output

🚨 Scroll to the right 👉 to see all of the available data 🚨

Instance-based Build
Standard Relations Inverse Relations
OWL OWL-NETS OWL OWL-NETS
Master_Edge_List_Dict.json
node_metadata_dict.pkl
subclass_map_log.json

PheKnowLator_v2.1.0_full_instance_relationsOnly_OWL.nt
PheKnowLator_v2.1.0_full_instance_relationsOnly_OWL_AnnotationsOnly.nt
PheKnowLator_v2.1.0_full_instance_relationsOnly_OWL_LogicOnly.nt
PheKnowLator_v2.1.0_full_instance_relationsOnly_OWL_NetworkxMultiDiGraph.gpickle
PheKnowLator_v2.1.0_full_instance_relationsOnly_OWL_NodeLabels.txt
PheKnowLator_v2.1.0_full_instance_relationsOnly_OWL_Triples_Identifiers.txt
PheKnowLator_v2.1.0_full_instance_relationsOnly_OWL_Triples_Integer_Identifier_Map.json
PheKnowLator_v2.1.0_full_instance_relationsOnly_OWL_Triples_Integers.txt
Master_Edge_List_Dict.json
node_metadata_dict.pkl
subclass_map_log.json

PheKnowLator_v2.1.0_full_instance_relationsOnly_OWLNETS.nt
PheKnowLator_v2.1.0_full_instance_relationsOnly_OWLNETS_NetworkxMultiDiGraph.gpickle
PheKnowLator_v2.1.0_full_instance_relationsOnly_OWLNETS_NodeLabels.txt
PheKnowLator_v2.1.0_full_instance_relationsOnly_OWLNETS_Triples_Identifiers.txt
PheKnowLator_v2.1.0_full_instance_relationsOnly_OWLNETS_Triples_Integer_Identifier_Map.json
PheKnowLator_v2.1.0_full_instance_relationsOnly_OWLNETS_Triples_Integers.txt
PheKnowLator_v2.1.0_full_instance_relationsOnly_OWLNETS_decoding_dict.pkl

PheKnowLator_v2.1.0_full_instance_relationsOnly_OWLNETS_INSTANCE_purified.nt
PheKnowLator_v2.1.0_full_instance_relationsOnly_OWLNETS_INSTANCE_purified_NetworkxMultiDiGraph.gpickle
PheKnowLator_v2.1.0_full_instance_relationsOnly_OWLNETS_INSTANCE_purified_NodeLabels.txt
PheKnowLator_v2.1.0_full_instance_relationsOnly_OWLNETS_INSTANCE_purified_Triples_Identifiers.txt
PheKnowLator_v2.1.0_full_instance_relationsOnly_OWLNETS_INSTANCE_purified_Triples_Integer_Identifier_Map.json
PheKnowLator_v2.1.0_full_instance_relationsOnly_OWLNETS_INSTANCE_purified_Triples_Integers.txt
PheKnowLator_v2.1.0_full_instance_relationsOnly_OWLNETS_INSTANCE_purified_decoding_dict.pkl

Master_Edge_List_Dict.json
node_metadata_dict.pkl
subclass_map_log.json

PheKnowLator_v2.1.0_full_instance_inverseRelations_OWL.nt
PheKnowLator_v2.1.0_full_instance_inverseRelations_OWL_AnnotationsOnly.nt
PheKnowLator_v2.1.0_full_instance_inverseRelations_OWL_LogicOnly.nt
PheKnowLator_v2.1.0_full_instance_inverseRelations_OWL_NetworkxMultiDiGraph.gpickle
PheKnowLator_v2.1.0_full_instance_inverseRelations_OWL_NodeLabels.txt
PheKnowLator_v2.1.0_full_instance_inverseRelations_OWL_Triples_Identifiers.txt
PheKnowLator_v2.1.0_full_instance_inverseRelations_OWL_Triples_Integer_Identifier_Map.json
PheKnowLator_v2.1.0_full_instance_inverseRelations_OWL_Triples_Integers.txt
Master_Edge_List_Dict.json
node_metadata_dict.pkl
subclass_map_log.json

PheKnowLator_v2.1.0_full_instance_inverseRelations_OWLNETS.nt
PheKnowLator_v2.1.0_full_instance_inverseRelations_OWLNETS_NetworkxMultiDiGraph.gpickle
PheKnowLator_v2.1.0_full_instance_inverseRelations_OWLNETS_NodeLabels.txt
PheKnowLator_v2.1.0_full_instance_inverseRelations_OWLNETS_Triples_Identifiers.txt
PheKnowLator_v2.1.0_full_instance_inverseRelations_OWLNETS_Triples_Integer_Identifier_Map.json
PheKnowLator_v2.1.0_full_instance_inverseRelations_OWLNETS_Triples_Integers.txt
PheKnowLator_v2.1.0_full_instance_inverseRelations_OWLNETS_decoding_dict.pkl

PheKnowLator_v2.1.0_full_instance_inverseRelations_OWLNETS_INSTANCE_purified.nt
PheKnowLator_v2.1.0_full_instance_inverseRelations_OWLNETS_INSTANCE_purified_NetworkxMultiDiGraph.gpickle
PheKnowLator_v2.1.0_full_instance_inverseRelations_OWLNETS_INSTANCE_purified_NodeLabels.txt
PheKnowLator_v2.1.0_full_instance_inverseRelations_OWLNETS_INSTANCE_purified_Triples_Identifiers.txt
PheKnowLator_v2.1.0_full_instance_inverseRelations_OWLNETS_INSTANCE_purified_Triples_Integer_Identifier_Map.json
PheKnowLator_v2.1.0_full_instance_inverseRelations_OWLNETS_INSTANCE_purified_Triples_Integers.txt
PheKnowLator_v2.1.0_full_instance_inverseRelations_OWLNETS_INSTANCE_purified_decoding_dict.pkl

Class-based Build
Standard Relations Inverse Relations
OWL OWL-NETS OWL OWL-NETS
Master_Edge_List_Dict.json
node_metadata_dict.pkl
subclass_map_log.json

PheKnowLator_v2.1.0_full_subclass_relationsOnly_OWL.nt
PheKnowLator_v2.1.0_full_subclass_relationsOnly_OWL_AnnotationsOnly.nt
PheKnowLator_v2.1.0_full_subclass_relationsOnly_OWL_LogicOnly.nt
PheKnowLator_v2.1.0_full_subclass_relationsOnly_OWL_NetworkxMultiDiGraph.gpickle
PheKnowLator_v2.1.0_full_subclass_relationsOnly_OWL_NodeLabels.txt
PheKnowLator_v2.1.0_full_subclass_relationsOnly_OWL_Triples_Identifiers.txt
PheKnowLator_v2.1.0_full_subclass_relationsOnly_OWL_Triples_Integer_Identifier_Map.json
PheKnowLator_v2.1.0_full_subclass_relationsOnly_OWL_Triples_Integers.txt
Master_Edge_List_Dict.json
node_metadata_dict.pkl
subclass_map_log.json

PheKnowLator_v2.1.0_full_subclass_relationsOnly_OWLNETS.nt
PheKnowLator_v2.1.0_full_subclass_relationsOnly_OWLNETS_NetworkxMultiDiGraph.gpickle
PheKnowLator_v2.1.0_full_subclass_relationsOnly_OWLNETS_NodeLabels.txt
PheKnowLator_v2.1.0_full_subclass_relationsOnly_OWLNETS_Triples_Identifiers.txt
PheKnowLator_v2.1.0_full_subclass_relationsOnly_OWLNETS_Triples_Integer_Identifier_Map.json
PheKnowLator_v2.1.0_full_subclass_relationsOnly_OWLNETS_Triples_Integers.txt
PheKnowLator_v2.1.0_full_subclass_relationsOnly_OWLNETS_decoding_dict.pkl

PheKnowLator_v2.1.0_full_subclass_relationsOnly_OWLNETS_SUBCLASS_purified.nt
PheKnowLator_v2.1.0_full_subclass_relationsOnly_OWLNETS_SUBCLASS_purified_NetworkxMultiDiGraph.gpickle
PheKnowLator_v2.1.0_full_subclass_relationsOnly_OWLNETS_SUBCLASS_purified_NodeLabels.txt
PheKnowLator_v2.1.0_full_subclass_relationsOnly_OWLNETS_SUBCLASS_purified_Triples_Identifiers.txt
PheKnowLator_v2.1.0_full_subclass_relationsOnly_OWLNETS_SUBCLASS_purified_Triples_Integer_Identifier_Map.json
PheKnowLator_v2.1.0_full_subclass_relationsOnly_OWLNETS_SUBCLASS_purified_Triples_Integers.txt
PheKnowLator_v2.1.0_full_subclass_relationsOnly_OWLNETS_SUBCLASS_purified_decoding_dict.pkl

Master_Edge_List_Dict.json
node_metadata_dict.pkl
subclass_map_log.json

PheKnowLator_v2.1.0_full_subclass_inverseRelations_OWL.nt
PheKnowLator_v2.1.0_full_subclass_inverseRelations_OWL_AnnotationsOnly.nt
PheKnowLator_v2.1.0_full_subclass_inverseRelations_OWL_LogicOnly.nt
PheKnowLator_v2.1.0_full_subclass_inverseRelations_OWL_NetworkxMultiDiGraph.gpickle
PheKnowLator_v2.1.0_full_subclass_inverseRelations_OWL_NodeLabels.txt
PheKnowLator_v2.1.0_full_subclass_inverseRelations_OWL_Triples_Identifiers.txt
PheKnowLator_v2.1.0_full_subclass_inverseRelations_OWL_Triples_Integer_Identifier_Map.json
PheKnowLator_v2.1.0_full_subclass_inverseRelations_OWL_Triples_Integers.txt
Master_Edge_List_Dict.json
node_metadata_dict.pkl
subclass_map_log.json

PheKnowLator_v2.1.0_full_subclass_inverseRelations_OWLNETS.nt
PheKnowLator_v2.1.0_full_subclass_inverseRelations_OWLNETS_NetworkxMultiDiGraph.gpickle
PheKnowLator_v2.1.0_full_subclass_inverseRelations_OWLNETS_NodeLabels.txt
PheKnowLator_v2.1.0_full_subclass_inverseRelations_OWLNETS_Triples_Identifiers.txt
PheKnowLator_v2.1.0_full_subclass_inverseRelations_OWLNETS_Triples_Integer_Identifier_Map.json
PheKnowLator_v2.1.0_full_subclass_inverseRelations_OWLNETS_Triples_Integers.txt
PheKnowLator_v2.1.0_full_subclass_inverseRelations_OWLNETS_decoding_dict.pkl

PheKnowLator_v2.1.0_full_subclass_inverseRelations_OWLNETS_SUBCLASS_purified.nt
PheKnowLator_v2.1.0_full_subclass_inverseRelations_OWLNETS_SUBCLASS_purified_NetworkxMultiDiGraph.gpickle
PheKnowLator_v2.1.0_full_subclass_inverseRelations_OWLNETS_SUBCLASS_purified_NodeLabels.txt
PheKnowLator_v2.1.0_full_subclass_inverseRelations_OWLNETS_SUBCLASS_purified_Triples_Identifiers.txt
PheKnowLator_v2.1.0_full_subclass_inverseRelations_OWLNETS_SUBCLASS_purified_Triples_Integer_Identifier_Map.json
PheKnowLator_v2.1.0_full_subclass_inverseRelations_OWLNETS_SUBCLASS_purified_Triples_Integers.txt
PheKnowLator_v2.1.0_full_subclass_inverseRelations_OWLNETS_SUBCLASS_purified_decoding_dict.pkl





Important Build Updates


We provide several different types of output, each of which is described briefly below. Please note that in order to create the logic (XXXX_OWL_LogicOnly.nt) and annotation (XXXX_OWL_AnnotationsOnly.nt) subsets of each graph and be able to combine them (XXXX_OWL.nt) we have added a namespace to all BNode or anonymous nodes. More specifically, there are two kinds of pkt namespaces you will find within these files:

  1. https://github.com/callahantiff/PheKnowLator/pkt/. This namespace is used for all non-ontology data defined owl:Class and owl:NamedIndividual objects that are added in order to integrate non-ontological entities (see here for more information).
  2. https://github.com/callahantiff/PheKnowLator/pkt/bnode/. This namespace is used for all existing BNode or anonymous nodes and is applied to these types of entities prior to subsetting an input graph.

To remove the second type of namespacing from BNode that are part of the original ontologies used in each build, you can run the code shown below:

from pkt.utils import removes_namespace_from_bnodes

# remove bnode namespaces
updated_graph = removes_namespace_from_bnodes(org_graph) 

Please also note that for all builds prior to v3.0.2, there are 2,008 nodes in the NodeLabels.txt files that contain foreign characters. While there is now code in place to prevent this error from happening in the future, there is also a solution to account for the prior builds. The (bad_node_patch.json) file contains a dictionary where the outer keys are the entity_uri and the outer values are another dictionary where the inner keys are label and description/definition and the inner values for these inner keys are the updated strings without foreign characters. An example of this dictionary is shown below:

key = '<http://purl.obolibrary.org/obo/UBERON_0000468>'

print(bad_node_patch[key])
>>> {'label': 'multicellular organism', 'description/definition': 'Anatomical structure that is an individual member of a species and consists of more than one cell.'}

The code to identify the nodes with erroneous foreign characters is shown below:

import re
import pandas as pd

# link to downloaded `NodeLabels.txt` file
input_file = `'NodeLabels.txt'`

# load data as Pandas DataFrame
nodedf = pd.read_csv(input_file, sep='\t', header=0)

# identify bad nodes and filter DataFrame so it only contains these rows
nodedf['bad'] = nodedf['label'].apply(lambda x: re.search("[\u4e00-\u9FFF]", x) if not pd.isna(x) else None)
nodedf_bad_nodes = nodedf[~pd.isna(nodedf['bad'])].drop_duplicates()


Return to Top


Clone this wiki locally