-
Notifications
You must be signed in to change notification settings - Fork 29
KG Construction
The first step to building a knowledge graph is to design the blueprint or knowledge representation. For PheKnowLator, we consulted with a PhD-level biologist when developing our knowledge representation of the mechanisms underlying human disease. An example knowledge representation is shown in the figure below.
The knowledge graph build algorithm has been designed to run from three different stages of development: full
(runs the full knowledge graph build, except graph closure), partial
(runs the build algorithm up through merging ontologies adding edge data, which excludes closure, the removal of metadata, and the creation of edge lists), and post-closure
(searches for a closed knowledge graph .owl
file and then performs the steps to remove owl semantics metadata and create edge lists).
Select a Build Type:
Build Type | Description | Use Cases |
---|---|---|
full |
Runs all build steps in the algorithm | You want to build a knowledge graph and will not use a reasoner. |
partial |
Runs all of the build steps in the algorithm through adding edges Node metadata can always be added to a partial built knowledge graph by running the build as post-closure
|
You want to build a knowledge graph and plan to run a reasoner over it. You want to build a knowledge graph, but do not want to include node metadata, filter OWL semantics, or generate triple lists. |
post-closure |
Assumes that a reasoner was run over a knowledge graph and that the remaining build steps should be applied to a closed knowledge graph. The remaining build steps include determining whether OWL semantics should be filtered and creating and writing triple lists | You have run the partial build, ran a reasoner over it, and now want to complete the algorithm.You want to use the algorithm to process metadata and OWL semantics for an externally built knowledge graph. |
Wiki Page: Dependencies
The current system uses three documents as instructions for building the knowledge graph. For detailed information on these documents, including examples, please see the Dependencies
Wiki page. The primary dependency document is resource_info.txt
.
Ontology Data
Wiki Page: Dependencies
Jupyter Notebook: Ontology_Cleaning.ipynb
All ontology data sources listed in ontology_source_list.txt
dependency document will be automatically downloaded. The OWLTools
command-line tool is used to download all ontologies. This tool is useful because it ensures that all secondary ontologies imported by the primary ontology are also downloaded and merged.
Linked Open Data
Wiki Page: Dependencies
Jupyter Notebook: Data_Preparation.ipynb
All non-ontology data sources listed in the edge_source_list.txt
file will be automatically downloaded and pre-processed.
Merge ontologies using the OWLTools API. Sometimes errors only exist in the presence of other ontologies. The most common error after merging ontology files is punning.
Wiki Page: Dependencies
Data: subclass_construction_map.pkl
New edges can be added to the knowledge graph using two different approaches: (1) Instance-based
- Asserting a new relation between an individual data point and an instance of an ontology class; (2) Subclass-based
- Asserting a new relation between the subclass of the ontology class and an individual of type owl:Class
. Please see the README (resources/construction_approach/README.md
) for specific details regarding this method.
Instance-based
Data that is not part of an existing ontology is connected to an existing ontology class by creating an instance of an existing ontology class via rdf:Type
and then connecting the data to that instance of the ontology class.
EXAMPLE: Adding the edge: Morphine ➞ isSubstanceThatTreats
➞ Migraine
Would require adding:
-
isSubstanceThatTreats
(Morphine,x1
) -
Type
(x1
, Migraine)
In this example, Morphine
is an ontology data node from ChEBI and Migraine
is a Human Phenotype Ontology term. This would result in the following triples, assuming that both Morphine
and Migraine
are existing ontology concepts:
UUID1 = MD5(Morphine + isSubstanceThatTreats + Migraine + "subject")
UUID2 = MD5(Morphine + isSubstanceThatTreats + Migraine + "object")
UUID1, rdf:type, Morphine
UUID1, rdf:type, owl:NamedIndividual
UUID2, rdf:type, Migraine
UUID2, rdf:type, owl:NamedIndividual
UUID1, isSubstanceThatTreats, UUID2
Subclass-based
Data that is not part of an existing ontology is connected to an existing ontology class via rdfs:subClassOf
. This method allows the newly added data to have rdf:type
owl:Class
.
EXAMPLE: Adding the edge: TGFB1 ➞ participatesIn
➞ Influenza Virus Induced Apoptosis
Would require adding:
-
participatesIn
(TGFB1, Influenza Virus Induced Apoptosis) -
subClassOf
(Influenza Virus Induced Apoptosis, Influenza A pathway) -
Type
(Influenza Virus Induced Apoptosis,owl:Class
)
Where TGFB1
is a Protein Ontology term and Influenza Virus Induced Apoptosis
is a non-ontology data node from Reactome. In this example, Influenza A Pathway
is an existing Pathway Ontology class. This would result in the following triples, assuming that TGFB1
is an existing ontology concept:
UUID1 = MD5(TGFB1 + participatesIn + Influenza Virus Induced Apoptosis)
UUID2 = MD5(TGFB1 + participatesIn + Influenza Virus Induced Apoptosis + owl:Restriction)
Influenza Virus Induced Apoptosis, rdfs:subClassOf, Influenza A Pathway
Influenza Virus Induced Apoptosis, rdf:type, owl:Class
UUID1, rdfs:subClassOf, TGFB1
UUID1, rdfs:subClassOf, UUID2
UUID2, rdf:type, owl:Restriction
UUID2, owl:someValuesFrom, Influenza Virus Induced Apoptosis
UUID2, owl:onProperty, participatesIn
A table is provided below showing the different triples that are added as function of edge type (i.e. class
-class
vs. class
-instance
vs. instance
-instance
) and relation strategy (i.e. relations only or relations + inverse relations).
Jupyter Notebook: Data_Preparation.ipynb
Relations
Wiki Page: Dependencies
PheKnowLator can be built using a single set of provided relations with or without the inclusion of each relation's inverse by leveraging the owl:inverseOf
property. For example:
-
location of
owl:inverseOf
located in -
located in
owl:inverseOf
location of
Entity Metadata
Wiki Page: Dependencies
Jupyter Notebook: Data_Preparation.ipynb
Before building a knowledge graph, one may need to prepare files needed to create mappings between identifiers and/or to filter input edge data sources. The Jupyter Notebook referenced above provides several detailed examples of how these data were created for the knowledge graphs available for the v2.0.0
build.
The knowledge graph can be built with or without the inclusion of instance entity metadata (i.e. labels, descriptions or definitions, and synonyms).
{
'nodes': {
'http://www.ncbi.nlm.nih.gov/gene/1': {
'Label': 'A1BG',
'Description': "A1BG has locus group protein-coding' and is located on chromosome 19 (19q13.43).",
'Synonym': 'HYST2477alpha-1B-glycoprotein|HEL-S-163pA|ABG|A1B|GAB'} ... },
'relations': {
'http://purl.obolibrary.org/obo/RO_0002533': {
'Label': 'sequence atomic unit',
'Description': 'Any individual unit of a collection of like units arranged in a linear order',
'Synonym': 'None'} ... }
}
Wiki Page: Dependencies
The knowledge graph can be built with or without the inclusion of edges that contain OWL Semantics. For information on how OWL-encoded classes and triples are filtered, please see the OWL-NETS 2.0
wiki.
We provide several different types of output, each of which is described briefly below. Please note that in order to create the logic (XXXX_OWL_LogicOnly.nt
) and annotation (XXXX_OWL_AnnotationsOnly.nt
) subsets of each graph and be able to combine them (XXXX_OWL.nt
) we have added a namespace to all BNode
or anonymous nodes. More specifically, there are two kinds of pkt
namespaces you will find within these files:
-
https://github.com/callahantiff/PheKnowLator/pkt/
. This namespace is used for all non-ontology data definedowl:Class
andowl:NamedIndividual
objects that are added in order to integrate non-ontological entities (see here for more information). -
https://github.com/callahantiff/PheKnowLator/pkt/bnode/
. This namespace is used for all existingBNode
or anonymous nodes and is applied to these types of entities prior to subsetting an input graph.
To remove the second type of namespacing from BNode
that are part of the original ontologies used in each build, you can run the code shown below:
from pkt.utils import removes_namespace_from_bnodes
# remove bnode namespaces
updated_graph = removes_namespace_from_bnodes(org_graph)
Please also note that for all builds prior to v3.0.2
, there are 2,008
nodes in the NodeLabels.txt
files that contain foreign characters. While there is now code in place to prevent this error from happening in the future, there is also a solution to account for the prior builds. The (bad_node_patch.json
) file contains a dictionary where the outer keys are the entity_uri
and the puter values are another dictionary where the inner keys are label
and description/definition
and the inner values for these inner keys are the updated strings without foreign characters. An example of this dictionary is shown below:
key = '<http://purl.obolibrary.org/obo/UBERON_0000468>'
print(bad_node_patch[key])
>>> {'label': 'multicellular organism', 'description/definition': 'Anatomical structure that is an individual member of a species and consists of more than one cell.'}
The code to identify the nodes with erroneous foreign characters is shown below:
import re
import pandas as pd
# link to downloaded `NodeLabels.txt` file
input_file = `'NodeLabels.txt'`
# load data as Pandas DataFrame
nodedf = pd.read_csv(input_file, sep='\t', header=0)
# identify bad nodes and filter DataFrame so it only contains these rows
nodedf['bad'] = nodedf['label'].apply(lambda x: re.search("[\u4e00-\u9FFF]", x) if not pd.isna(x) else None)
nodedf_bad_nodes = nodedf[~pd.isna(nodedf['bad'])].drop_duplicates()
File | Details | |
---|---|---|
PheKnowLator_MergedOntologies.owl |
Description | This RDF/XML formatted file only contains the baseline set of cleaned merged ontologies. |
Example Output |
|
|
OWL Builds The OWL builds store the complete expressive graph for either the subclass or instance builds. |
||
XXXX_OWL_LogicOnly.nt |
Description | This N-Triples formatted file contains the logical axioms for the baseline set of cleaned merged ontologies and all non-ontology edges. It does not contains any annotation assertions (i.e., metadata like labels, definitions, and synonyms). This file contains the minimum logical subset needed to run a deductive logic reasoner. |
Example Output |
|
|
XXXX_OWL_AnnotationsOnly.nt |
Description | This N-Triples formatted file contains annotation assertions (i.e., metadata like labels, definitions, and synonyms) for the baseline set of cleaned merged ontologies and all non-ontology edges. |
Example Output |
|
|
XXXX_OWL.nt |
Description | This N-Triples formatted file contains the baseline set of cleaned merged ontologies and all non-ontology edges. It contains the minimum logical subset ( XXXX_OWL_LogicOnly.nt )and all annotation assertions ( XXXX_OWL_AnnotationsOnly.nt ). This file contains all OWLsemantics needed to run a deductive logic reasoner. |
Example Output |
|
|
XXXX_OWL_NetworkxMultiDiGraph.gpickle |
Description | This file is a NetworkX MultiDiGraph representation of the same content that is stored in the XXXX_OWL.nt file. Note that this representation includes keys for nodes and edges( node: key = URI; edge: predicate_key = MD5hash("s_uri" + "p_uri" + "o_uri") ).Each edge also has a default weight of 0.0. |
Example Output | Encoded File format; No preview. See https://networkx.org/documentation/stable/reference/readwrite/gpickle.html for more details. Or see, an example before:
|
|
XXXX_OWL_Triples_Identifiers.txt |
Description | This tab-delimited text file contains the same information as the .nt and .gpickle files,but is organized into a common format used by many graph representation learning algorithms. The file contains three columns, one for each part of a triple (i.e., subject, predicate, object), where each identifier is the full resolvable URI. |
Example Output |
|
|
XXXX_OWL_Triples_Integers.txt |
Description | This tab-delimited text file contains the same information as the .nt and .gpickle filesbut is organized into a common format used by many graph representation learning algorithms. The file contains three columns, one for each part of a triple (i.e., subject, predicate, object), where each identifier is the full resolvable URI. The primary difference between this file and the XXXX_OWL_Triples_Identifiers.txt file is that the identifier URIs have been mapped to integers. |
Example Output |
|
|
XXXX_OWL_Triples_Integer_Identifier_Map.json |
Description | This JSON file contains a dictionary where the keys are node identifiers and the values are integers. It stores the conversion from the XXXX_OWL_Triples_Identifiers.txt file to theXXXX_OWL_Triples_Integers.txt file. |
Example Output |
|
|
XXXX_OWL_NodeLabels.txt |
Description | This tab-delimited .txt file contains metadata on all nodes and relations in the N-Triples, gpickle, andXXXX_OWL_Triples_Identifiers.txt files. It contains the following columns:
NOTE. There will be entries in this file that contain values of "NA" for the entity_type column. Thisis expected for these types of builds; a value of "NA" is used for all nodes and relations that are not an owl:Class , owl:NamedIndividual , owl:ObjectProperty or owl:AnnotationProperty . |
Example Output |
|
|
OWL-NETS Builds The OWL-NETS files have undergone a transformation decodes all OWL semantics in order to create a graph that only contains biologically relevant nodes and edges and is much more useful for inductive types of machine learning. For more information on this transformation see: https://github.com/callahantiff/PheKnowLator/wiki/OWL-NETS-2.0 |
||
XXXX_OWLNETS.nt |
Description | This N-Triples formatted file contains the OWL-NETS transformed build. |
Example Output |
|
|
XXXX_OWLNETS_NetworkxMultiDiGraph.gpickle |
Description | This file is a NetworkX MultiDiGraph representation of the same content that is stored in the XXXX_noOWL_OWLNETS.nt file. Note that this representation includes keys for nodes andedges ( node: key = URI; edge: predicate_key = MD5hash("s_uri" + "p_uri" + "o_uri") ).Each edge also has a default weight of 0.0. |
Example Output | Encoded File format; No preview. See https://networkx.org/documentation/stable/reference/readwrite/gpickle.html for more details. Or see, an example before:
|
|
XXXX_OWLNETS_decoding_dict.pkl |
Description | This dictionary stores details about the OWL-NETS transformation. Specifically, it contains metadata that can be used to reverse the transformation. |
Example Output |
|
|
XXXX_OWLNETS_Triples_Identifiers.txt |
Description | This tab-delimited text file contains the same information as the .nt and .gpickle files,but is organized into a common format used by many graph representation learning algorithms. The file contains three columns, one for each part of a triple (i.e., subject, predicate, object), where each identifier is the full resolvable URI. |
Example Output |
|
|
XXXX_OWLNETS_Triples_Integers.txt |
Description | This tab-delimited text file contains the same information as the .nt and .gpickle files,but is organized into a common format used by many graph representation learning algorithms. The file contains three columns, one for each part of a triple (i.e., subject, predicate, object), where each identifier is the full resolvable URI. The primary difference between this file and the XXXX_noOWL_Triples_Identifiers.txt file is that the identifier URIs have been mapped to integers. |
Example Output |
|
|
XXXX_OWLNETS_Triples_Integer_Identifier_Map.json |
Description | This JSON file contains a dictionary where the keys are node identifiers and the values are integers. It stores the conversion from the XXXX_noOWL_Triples_Identifiers.txt file to theXXXX_noOWL_Triples_Integers.txt file. |
Example Output |
|
|
XXXX_OWLNETS_NodeLabels.txt |
Description | This tab-delimited .txt file contains metadata on all nodes and relations in the N-Triples, gpickle,and XXXX_OWLNETS_Triples_Identifiers.txt files. It contains the following columns:
|
Example Output |
|
|
Purified OWL-NETS Builds The purified version of an OWL-NETS build is designed to convert the base OWL-NETS build into a version that is completing consistent with a specific construction approach. For example, if the build is instance -based, then all rdfs:subClassOf relations are converted to rdf:type and for all triples wherean rdfs:subClassOf relation occurred we add rdf:type relations between the object of this triple and allof its ancestors. For a subclass -based build, we implement the same procedure but replace all occurrencesof rdf:type with rdfs:subClassOf . Please note that these build types are considered experimental as we arestill in the process of fully testing them. |
||
XXXX_OWLNETS_XXXX_purified_OWLNETS.nt |
Description | This N-Triples formatted file contains the purified OWL-NETS transformed build. |
Example Output |
|
|
XXXX_OWLNETS_XXXX_purified_NetworkxMultiDiGraph.gpickle |
Description | This file is a NetworkX MultiDiGraph representation of the same content that is stored in theXXXX_OWLNETS_XXXX_purified_OWLNETS.nt file. Note that this representation includes keys for nodesand edges ( node: key = URI; edge: predicate_key = MD5hash("s_uri" + "p_uri" + "o_uri") ). Each edgealso has a default weight of 0.0. |
Example Output | Encoded File format; No preview. See https://networkx.org/documentation/stable/reference/readwrite/gpickle.html for more details. Or see, an example before:
|
|
XXXX_OWLNETS_XXXX_purified_decoding_dict.pkl |
Description | This dictionary stores details about the purified OWL-NETS transformation. Specifically, it contains metadata that can be used to reverse the transformation. |
Example Output |
|
|
XXXX_OWLNETS_XXXX_purified_Triples_Identifiers.txt |
Description | This tab-delimited text file contains the same information as the .nt and .gpickle files,but is organized into a common format used by many graph representation learning algorithms. The file contains three columns, one for each part of a triple (i.e., subject, predicate, object), where each identifier is the full resolvable URI. |
Example Output |
|
|
XXXX_OWLNETS_XXXX_purified_Triples_Integers.txt |
Description | This tab-delimited text file contains the same information as the .nt and .gpickle files,but is organized into a common format used by many graph representation learning algorithms. The file contains three columns, one for each part of a triple (i.e., subject, predicate, object), where each identifier is the full resolvable URI. The primary difference between this file and the XXXX_OWLNETS_XXXX_purified_Triples_Identifiers.txt file is that the identifier URIs have beenmapped to integers. |
Example Output |
|
|
XXXX_OWLNETS_XXXX_purified_Triples_Integer_Identifier_Map.json |
Description | This JSON file contains a dictionary where the keys are node identifiers and the values are integers. It stores the conversion from the XXXX_OWLNETS_XXXX_purified_Triples_Identifiers.txt file to theXXXX_OWLNETS_XXXX_purified_Triples_Integers.txt file. |
Example Output |
|
|
XXXX_OWLNETS_XXXX_purified_NodeLabels.txt |
Description | This tab-delimited .txt file contains metadata on all nodes and relations in the N-Triples, gpickle,and XXXX_OWLNETS_XXXX_purified_Triples_Identifiers.txt files. It contains the following columns:
|
Example Output |
|