Skip to content

KG Construction

Tiffany J. Callahan edited this page Oct 24, 2023 · 76 revisions

Knowledge Graph Construction


The primary steps involved in this process include:



Create Knowledge Representation


The first step to building a knowledge graph is to design the blueprint or knowledge representation. For PheKnowLator, we consulted with a PhD-level biologist when developing our knowledge representation of the mechanisms underlying human disease. An example knowledge representation is shown in the figure below.






Build Knowledge Graph


The knowledge graph build algorithm has been designed to run from three different stages of development: full (runs the full knowledge graph build, except graph closure), partial (runs the build algorithm up through merging ontologies adding edge data, which excludes closure, the removal of metadata, and the creation of edge lists), and post-closure (searches for a closed knowledge graph .owl file and then performs the steps to remove owl semantics metadata and create edge lists).

Select a Build Type:

Build Type Description Use Cases
full Runs all build steps in the algorithm You want to build a knowledge graph and will not use a reasoner.
partial Runs all of the build steps in the algorithm through adding edges

Node metadata can always be added to a partial built knowledge graph by running the build as post-closure
You want to build a knowledge graph and plan to run a reasoner over it.

You want to build a knowledge graph, but do not want to include node metadata, filter OWL semantics, or generate triple lists.
post-closure Assumes that a reasoner was run over a knowledge graph and that the remaining build steps should be applied to a closed knowledge graph. The remaining build steps include determining whether OWL semantics should be filtered and creating and writing triple lists You have run the partial build, ran a reasoner over it, and now want to complete the algorithm.

You want to use the algorithm to process metadata and OWL semantics for an externally built knowledge graph.




STEP 1: Prepare Input Dependency Documents

Wiki Page: Dependencies

The current system uses three documents as instructions for building the knowledge graph. For detailed information on these documents, including examples, please see the Dependencies Wiki page. The primary dependency document is resource_info.txt.



STEP 2: Download and process Input Data

Ontology Data
Wiki Page: Dependencies
Jupyter Notebook: Ontology_Cleaning.ipynb

All ontology data sources listed in ontology_source_list.txt dependency document will be automatically downloaded. The OWLTools command-line tool is used to download all ontologies. This tool is useful because it ensures that all secondary ontologies imported by the primary ontology are also downloaded and merged.

Linked Open Data
Wiki Page: Dependencies
Jupyter Notebook: Data_Preparation.ipynb

All non-ontology data sources listed in the edge_source_list.txt file will be automatically downloaded and pre-processed.



STEP 3: Merge Ontologies

Merge ontologies using the OWLTools API. Sometimes errors only exist in the presence of other ontologies. The most common error after merging ontology files is punning.



STEP 4: Build Edge Lists from Non-Ontology Data

Wiki Page: Dependencies
Data: subclass_construction_map.pkl

New edges can be added to the knowledge graph using two different approaches: (1) Instance-based - Asserting a new relation between an individual data point and an instance of an ontology class; (2) Subclass-based - Asserting a new relation between the subclass of the ontology class and an individual of type owl:Class. Please see the README (resources/construction_approach/README.md) for specific details regarding this method.

Instance-based
Data that is not part of an existing ontology is connected to an existing ontology class by creating an instance of an existing ontology class via rdf:Type and then connecting the data to that instance of the ontology class.

EXAMPLE: Adding the edge: Morphine ➞ isSubstanceThatTreats ➞ Migraine

Would require adding:

  • isSubstanceThatTreats(Morphine, x1)
  • Type(x1, Migraine)

In this example, Morphine is an ontology data node from ChEBI and Migraine is a Human Phenotype Ontology term. This would result in the following triples, assuming that both Morphine and Migraine are existing ontology concepts:

UUID1 = MD5(Morphine + isSubstanceThatTreats + Migraine + "subject")
UUID2 = MD5(Morphine + isSubstanceThatTreats + Migraine + "object")

UUID1, rdf:type, Morphine
UUID1, rdf:type, owl:NamedIndividual

UUID2, rdf:type, Migraine
UUID2, rdf:type, owl:NamedIndividual

UUID1, isSubstanceThatTreats, UUID2

Subclass-based
Data that is not part of an existing ontology is connected to an existing ontology class via rdfs:subClassOf. This method allows the newly added data to have rdf:type owl:Class.

EXAMPLE: Adding the edge: TGFB1 ➞ participatesIn ➞ Influenza Virus Induced Apoptosis

Would require adding:

  • participatesIn(TGFB1, Influenza Virus Induced Apoptosis)
  • subClassOf(Influenza Virus Induced Apoptosis, Influenza A pathway)
  • Type(Influenza Virus Induced Apoptosis, owl:Class)

Where TGFB1 is a Protein Ontology term and Influenza Virus Induced Apoptosis is a non-ontology data node from Reactome. In this example, Influenza A Pathway is an existing Pathway Ontology class. This would result in the following triples, assuming that TGFB1 is an existing ontology concept:

UUID1 = MD5(TGFB1 + participatesIn + Influenza Virus Induced Apoptosis)
UUID2 = MD5(TGFB1 + participatesIn + Influenza Virus Induced Apoptosis + owl:Restriction)

Influenza Virus Induced Apoptosis, rdfs:subClassOf, Influenza A Pathway
Influenza Virus Induced Apoptosis, rdf:type, owl:Class

UUID1, rdfs:subClassOf, TGFB1
UUID1, rdfs:subClassOf, UUID2
UUID2, rdf:type, owl:Restriction
UUID2, owl:someValuesFrom, Influenza Virus Induced Apoptosis
UUID2, owl:onProperty, participatesIn

A table is provided below showing the different triples that are added as function of edge type (i.e. class-class vs. class-instance vs. instance-instance) and relation strategy (i.e. relations only or relations + inverse relations).



STEP 5: Handling Knowledge Graph Relations and Entity Metadata

Jupyter Notebook: Data_Preparation.ipynb

Relations
Wiki Page: Dependencies

PheKnowLator can be built using a single set of provided relations with or without the inclusion of each relation's inverse by leveraging the owl:inverseOf property. For example:


Entity Metadata
Wiki Page: Dependencies
Jupyter Notebook: Data_Preparation.ipynb

Before building a knowledge graph, one may need to prepare files needed to create mappings between identifiers and/or to filter input edge data sources. The Jupyter Notebook referenced above provides several detailed examples of how these data were created for the knowledge graphs available for the v2.0.0 build.

The knowledge graph can be built with or without the inclusion of instance entity metadata (i.e. labels, descriptions or definitions, and synonyms).

{
    'nodes': {
        'http://www.ncbi.nlm.nih.gov/gene/1': {
            'Label': 'A1BG',
            'Description': "A1BG has locus group protein-coding' and is located on chromosome 19 (19q13.43).",
            'Synonym': 'HYST2477alpha-1B-glycoprotein|HEL-S-163pA|ABG|A1B|GAB'} ... },
    'relations': {
        'http://purl.obolibrary.org/obo/RO_0002533': {
            'Label': 'sequence atomic unit',
            'Description': 'Any individual unit of a collection of like units arranged in a linear order',
            'Synonym': 'None'} ... }
} 


STEP 6: Remove OWL Semantics

Wiki Page: Dependencies

The knowledge graph can be built with or without the inclusion of edges that contain OWL Semantics. For information on how OWL-encoded classes and triples are filtered, please see the OWL-NETS 2.0 wiki.



STEP 7: Generate Knowledge Graph Output

We provide several different types of output, each of which is described briefly below. Please note that in order to create the logic (XXXX_OWL_LogicOnly.nt) and annotation (XXXX_OWL_AnnotationsOnly.nt) subsets of each graph and be able to combine them (XXXX_OWL.nt) we have added a namespace to all BNode or anonymous nodes. More specifically, there are two kinds of pkt namespaces you will find within these files:

  1. https://github.com/callahantiff/PheKnowLator/pkt/. This namespace is used for all non-ontology data defined owl:Class and owl:NamedIndividual objects that are added in order to integrate non-ontological entities (see here for more information).
  2. https://github.com/callahantiff/PheKnowLator/pkt/bnode/. This namespace is used for all existing BNode or anonymous nodes and is applied to these types of entities prior to subsetting an input graph.

To remove the second type of namespacing from BNode that are part of the original ontologies used in each build, you can run the code shown below:

from pkt.utils import removes_namespace_from_bnodes

# remove bnode namespaces
updated_graph = removes_namespace_from_bnodes(org_graph) 

Please also note that for all builds prior to v3.0.2, there are 2,008 nodes in the NodeLabels.txt files that contain foreign characters. While there is now code in place to prevent this error from happening in the future, there is also a solution to account for the prior builds. The (bad_node_patch.json) file contains a dictionary where the outer keys are the entity_uri and the puter values are another dictionary where the inner keys are label and description/definition and the inner values for these inner keys are the updated strings without foreign characters. An example of this dictionary is shown below:

key = '<http://purl.obolibrary.org/obo/UBERON_0000468>'

print(bad_node_patch[key])
>>> {'label': 'multicellular organism', 'description/definition': 'Anatomical structure that is an individual member of a species and consists of more than one cell.'}

The code to identify the nodes with erroneous foreign characters is shown below:

import re
import pandas as pd

# link to downloaded `NodeLabels.txt` file
input_file = `'NodeLabels.txt'`

# load data as Pandas DataFrame
nodedf = pd.read_csv(input_file, sep='\t', header=0)

# identify bad nodes and filter DataFrame so it only contains these rows
nodedf['bad'] = nodedf['label'].apply(lambda x: re.search("[\u4e00-\u9FFF]", x) if not pd.isna(x) else None)
nodedf_bad_nodes = nodedf[~pd.isna(nodedf['bad'])].drop_duplicates()
Table. Knowledge Graph Build Output
File Details
PheKnowLator_MergedOntologies.owl Description This RDF/XML formatted file only contains the baseline set of cleaned merged ontologies.
Example Output

<?xml version="1.0"?>
<rdf:RDF xmlns="http://purl.obolibrary.org/obo/chebi.owl#"
     xml:base="http://purl.obolibrary.org/obo/chebi.owl"
     xmlns:chebi="http://purl.obolibrary.org/obo/chebi/"
     xmlns:refont="http://purl.obolibrary.org/obo/uberon/refont/"
     xmlns:obo2="http://www.geneontology.org/formats/oboInOwl#http://purl.obolibrary.org/obo/"
     xmlns:owl="http://www.w3.org/2002/07/owl#"
     xmlns:cellline1="http://www.ebi.ac.uk/cellline#"
     xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
     xmlns:swrlb="http://www.w3.org/2003/11/swrlb#"
... >
OWL Builds
The OWL builds store the complete expressive graph for either the subclass or instance builds.
XXXX_OWL_LogicOnly.nt Description This N-Triples formatted file contains the logical axioms for the baseline set of cleaned merged
ontologies and all non-ontology edges. It does not contains any annotation assertions (i.e., metadata
like labels, definitions, and synonyms). This file contains the minimum logical subset needed to run
a deductive logic reasoner.
Example Output

<https://github.com/callahantiff/PheKnowLator/pkt/N1008c5d52d72c407c8e1fe6960cc079c> <http://purl.obolibrary.org/obo/RO_0002511> <https://github.com/callahantiff/PheKnowLator/pkt/Ndd2e5c34e5200f57748b92ce48e01e97> .
<http://purl.obolibrary.org/obo/HP_0025154> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Class> .
<https://github.com/callahantiff/PheKnowLator/pkt/N354f816a252cbb880e55791e2f6c6c57> <http://purl.obolibrary.org/obo/RO_0002606> <https://github.com/callahantiff/PheKnowLator/pkt/N3390f9ec251ef7dc03acc8f7131f44dd> .
<https://github.com/callahantiff/PheKnowLator/pkt/N99e5d2b45fed4e35dfeca4adc3efd5f6> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#NamedIndividual> .
<http://purl.obolibrary.org/obo/UBERON_0034871> <http://www.w3.org/2000/01/rdf-schema#subClassOf> <https://github.com/callahantiff/PheKnowLator/pkt/bnode/Naaf1e1ac9eb14bae931889cdfadf1fb2> .
...
XXXX_OWL_AnnotationsOnly.nt Description This N-Triples formatted file contains annotation assertions (i.e., metadata like labels,
definitions, and synonyms) for the baseline set of cleaned merged ontologies and all
non-ontology edges.
Example Output

<https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000504742> <http://www.w3.org/2000/01/rdf-schema#label> "SLC4A9-202" .
<http://purl.obolibrary.org/obo/CLO_0017167> <http://www.w3.org/2000/01/rdf-schema#seeAlso> "OMIM: 168600"^^<http://www.w3.org/2001/XMLSchema#string> .
<https://github.com/callahantiff/PheKnowLator/pkt/bnode/N3794abf456e345b3bb974563deb1e42d> <http://www.geneontology.org/formats/oboInOwl#hasDbXref> "MESH:D000820"^^<http://www.w3.org/2001/XMLSchema#string> .
<http://purl.obolibrary.org/obo/GO_1990556> <http://www.geneontology.org/formats/oboInOwl#created_by> "vw" .
<http://purl.obolibrary.org/obo/UBERON_0004784> <http://www.geneontology.org/formats/oboInOwl#hasExactSynonym> "lower chamber of heart anatomical wall" .
...
XXXX_OWL.nt Description This N-Triples formatted file contains the baseline set of cleaned merged ontologies and
all non-ontology edges. It contains the minimum logical subset (XXXX_OWL_LogicOnly.nt)
and all annotation assertions (XXXX_OWL_AnnotationsOnly.nt). This file contains all OWL
semantics needed to run a deductive logic reasoner.
Example Output

<https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000504742> <http://www.w3.org/2000/01/rdf-schema#label> "SLC4A9-202" .
<http://purl.obolibrary.org/obo/CLO_0017167> <http://www.w3.org/2000/01/rdf-schema#seeAlso> "OMIM: 168600"^^<http://www.w3.org/2001/XMLSchema#string> .
<https://github.com/callahantiff/PheKnowLator/pkt/bnode/N3794abf456e345b3bb974563deb1e42d> <http://www.geneontology.org/formats/oboInOwl#hasDbXref> "MESH:D000820"^^<http://www.w3.org/2001/XMLSchema#string> .
<http://purl.obolibrary.org/obo/GO_1990556> <http://www.geneontology.org/formats/oboInOwl#created_by> "vw" .
<http://purl.obolibrary.org/obo/UBERON_0004784> <http://www.geneontology.org/formats/oboInOwl#hasExactSynonym> "lower chamber of heart anatomical wall".
...
XXXX_OWL_NetworkxMultiDiGraph.gpickle Description This file is a NetworkX MultiDiGraph representation of the same content that is stored in
the XXXX_OWL.nt file. Note that this representation includes keys for nodes and edges
(node: key = URI; edge: predicate_key = MD5hash("s_uri" + "p_uri" + "o_uri")).
Each edge also has a default weight of 0.0.
Example Output Encoded File format; No preview. See https://networkx.org/documentation/stable/reference/readwrite/gpickle.html for more details.
Or see, an example before:

import networkx as nx
from rdflib import URIRef
# read in graph
f = 'XXXX_OWL_NetworkxMultiDiGraph.gpickle'
kg = nx.read_gpickle(f)
# look up nodes
kg[URIRef('http://purl.obolibrary.org/obo/CHEBI_73558')]
kg[URIRef('http://purl.obolibrary.org/obo/CHEBI_28940')]
XXXX_OWL_Triples_Identifiers.txt Description This tab-delimited text file contains the same information as the .nt and .gpickle files,
but is organized into a common format used by many graph representation learning algorithms.
The file contains three columns, one for each part of a triple (i.e., subject, predicate, object),
where each identifier is the full resolvable URI.
Example Output

subject   predicate   object
<https://github.com/callahantiff/PheKnowLator/pkt/N1f1d61aed39aa7c2fd9ad2b40a23dce0>	<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>	<http://www.w3.org/2002/07/owl#NamedIndividual>
<https://github.com/callahantiff/PheKnowLator/pkt/N4e21b014fe4347facaec2a309eafcf3b>	<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>	<http://purl.obolibrary.org/obo/UBERON_0008952>
<https://github.com/callahantiff/PheKnowLator/pkt/N4e756731643dcfdc7fbb6cc6aa898b59>	<http://purl.obolibrary.org/obo/RO_0002200>	<https://github.com/callahantiff/PheKnowLator/pkt/N855ce51e1cbada67ff58bac057e628cc>
<https://github.com/callahantiff/PheKnowLator/pkt/N016dbc163f9535349d961768267afe35>	<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>	<http://www.w3.org/2002/07/owl#NamedIndividual>
<http://purl.obolibrary.org/obo/CHEBI_154851>	<http://www.w3.org/2000/01/rdf-schema#subClassOf>	<http://purl.obolibrary.org/obo/CHEBI_50699>
...
XXXX_OWL_Triples_Integers.txt Description This tab-delimited text file contains the same information as the .nt and .gpickle files
but is organized into a common format used by many graph representation learning algorithms.
The file contains three columns, one for each part of a triple (i.e., subject, predicate, object),
where each identifier is the full resolvable URI. The primary difference between this file and the
XXXX_OWL_Triples_Identifiers.txt file is that the identifier URIs have been mapped to integers.
Example Output

subject   predicate   object
1	2	3
4	2	5
6	7	8
9	2	3
10	11	12
...
XXXX_OWL_Triples_Integer_Identifier_Map.json Description This JSON file contains a dictionary where the keys are node identifiers and the values are integers.
It stores the conversion from the XXXX_OWL_Triples_Identifiers.txt file to the
XXXX_OWL_Triples_Integers.txt file.
Example Output

{"<https://github.com/callahantiff/PheKnowLator/pkt/N55ef15b2f8a12726db7caa5567c2632f>": 398640,
"<https://github.com/callahantiff/PheKnowLator/pkt/Nd3e86eb584157041fa49617139ce5d4c>": 398641,
"<https://github.com/callahantiff/PheKnowLator/pkt/N36f2ad24b20497da3d219f819ee7e37c>": 398642,
"<https://github.com/callahantiff/PheKnowLator/pkt/N13fc5884c3f7e166f5bc2469f79f4b01>": 398643,
"<https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000409200>": 398644
...}
XXXX_OWL_NodeLabels.txt Description This tab-delimited .txt file contains metadata on all nodes and relations in the N-Triples, gpickle, and
XXXX_OWL_Triples_Identifiers.txt files. It contains the following columns:
  1. entity_type (e.g., "NODES", "RELATIONS", or "NA" if not a owl:Class, owl:NamedIndividual,
    owl:ObjectProperty, or owl:AnnotationProperty)
  2. integer_id (e.g., 1 - the integer used to represent this URI in the Edge List output -- matches the
    integer assignment from the XXXX_OWL_Triples_Integers.txt file)
  3. entity_uri (e.g., "GO_0048252")
  4. label (e.g. "lauric acid metabolic process")
  5. description/definition (e.g., "The chemical reactions and pathways involving lauric acid, a fatty
    acid with the formula CH3(CH2)10COOH. Derived from vegetable sources.")
  6. synonym (e.g., "lauric acid metabolism|n-dodecanoic acid metabolic process|n-dodecanoic
    acid metabolism")

NOTE. There will be entries in this file that contain values of "NA" for the entity_type column. This
is expected for these types of builds; a value of "NA" is used for all nodes and relations that are not
an owl:Class, owl:NamedIndividual, owl:ObjectProperty or owl:AnnotationProperty.
Example Output

entity_type   integer_id   entity_uri   label   description/definition   synonym
NODES   375312   <http://www.ncbi.nlm.nih.gov/gene/58155>   PTBP2 (human)   A protein coding gene PTBP2 in human.   None
NODES   6297907   <https://www.ncbi.nlm.nih.gov/snp/rs10902762>   NM_000203.5(IDUA):c.60G>A (p.Ala20=)   This variant is a germline/unknown single nucleotide variant located on chromosome 4 (NC_000004.12, start:987144/stop:987144 positions, cytogenetic location:4p16.3) and has clinical significance 'Benign'. This entry is for the GRCh38 and was last reviewed on Nov 26, 2020 with review status 'criteria provided, multiple submitters, no conflicts'None
NA   7892255   <https://github.com/callahantiff/PheKnowLator/pkt/N707b36b2731f5ca97561eeb17e1fb039>   NA   NA   NA
RELATIONS   2057563   <http://purl.obolibrary.org/obo/RO_0002002>   has boundary   a relation between a material entity and a 2D immaterial entity (the boundary), in which the boundary delimits the material entity   None
RELATIONS   958453   <http://purl.obolibrary.org/obo/RO_0002444>   parasite of   None   direct parasite of
...
OWL-NETS Builds
The OWL-NETS files have undergone a transformation decodes all OWL semantics in order to create a
graph that only contains biologically relevant nodes and edges and is much more useful for inductive
types of machine learning. For more information on this transformation see:
https://github.com/callahantiff/PheKnowLator/wiki/OWL-NETS-2.0
XXXX_OWLNETS.nt Description This N-Triples formatted file contains the OWL-NETS transformed build.
Example Output

<http://purl.obolibrary.org/obo/MONDO_0014305> <http://purl.obolibrary.org/obo/RO_0002200> <http://purl.obolibrary.org/obo/HP_0001336> .
<https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000649544> <http://www.w3.org/2000/01/rdf-schema#subClassOf> <http://purl.obolibrary.org/obo/SO_0000673> .
<http://purl.obolibrary.org/obo/CHEBI_154851> <http://www.w3.org/2000/01/rdf-schema#subClassOf> <http://purl.obolibrary.org/obo/CHEBI_50699> .
<https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000422544> <http://www.w3.org/2000/01/rdf-schema#subClassOf> <http://purl.obolibrary.org/obo/SO_0001217> .
<http://purl.obolibrary.org/obo/CHEBI_50131> <http://purl.obolibrary.org/obo/RO_0002436> <http://purl.obolibrary.org/obo/GO_0010604> .
...
XXXX_OWLNETS_NetworkxMultiDiGraph.gpickle Description This file is a NetworkX MultiDiGraph representation of the same content that is stored in
the XXXX_noOWL_OWLNETS.nt file. Note that this representation includes keys for nodes and
edges (node: key = URI; edge: predicate_key = MD5hash("s_uri" + "p_uri" + "o_uri")).
Each edge also has a default weight of 0.0.
Example Output Encoded File format; No preview. See https://networkx.org/documentation/stable/reference/readwrite/gpickle.html for more details.
Or see, an example before:

import networkx as nx
from rdflib import URIRef
# read in graph
f = 'XXXX_OWLNETS_NetworkxMultiDiGraph.gpickle'
kg = nx.read_gpickle(f)
# look up nodes
kg[URIRef('http://purl.obolibrary.org/obo/CHEBI_73558')]
kg[URIRef('http://purl.obolibrary.org/obo/CHEBI_28940')]
XXXX_OWLNETS_decoding_dict.pkl Description This dictionary stores details about the OWL-NETS transformation. Specifically, it
contains metadata that can be used to reverse the transformation.
Example Output

{disjointWith:
	(rdflib.term.BNode('N0fb945ed26b14180907e29b5ffa1403e'), rdflib.term.URIRef('http://www.w3.org/2002/07/owl#disjointWith'), rdflib.term.BNode('N4a694a93a05843c3a0492587318538ca')),
	(rdflib.term.URIRef('http://purl.obolibrary.org/obo/MONDO_0033946'), rdflib.term.URIRef('http://www.w3.org/2002/07/owl#disjointWith'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/MONDO_0033947')),
	(rdflib.term.URIRef('http://purl.obolibrary.org/obo/UBERON_0001628'), rdflib.term.URIRef('http://www.w3.org/2002/07/owl#disjointWith'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/UBERON_0006764'))
filtered_triples:
	(rdflib.term.URIRef('https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000641330'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/RO_0001025'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/UBERON_0000178')),
	(rdflib.term.URIRef('http://www.ncbi.nlm.nih.gov/gene/23362'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/RO_0002511'), rdflib.term.URIRef('https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000518315')),
	(rdflib.term.URIRef('http://purl.obolibrary.org/obo/PR_Q6ZVK8'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/RO_0002436'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/CHEBI_63715'))
... }
XXXX_OWLNETS_Triples_Identifiers.txt Description This tab-delimited text file contains the same information as the .nt and .gpickle files,
but is organized into a common format used by many graph representation learning algorithms.
The file contains three columns, one for each part of a triple (i.e., subject, predicate, object),
where each identifier is the full resolvable URI.
Example Output

subject   predicate   object
<http://purl.obolibrary.org/obo/MONDO_0014305>   <http://purl.obolibrary.org/obo/RO_0002200>   <http://purl.obolibrary.org/obo/HP_0001336>
<https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000649544>   <http://www.w3.org/2000/01/rdf-schema#subClassOf>   <http://purl.obolibrary.org/obo/SO_0000673>
<http://purl.obolibrary.org/obo/CHEBI_154851>   <http://www.w3.org/2000/01/rdf-schema#subClassOf>   <http://purl.obolibrary.org/obo/CHEBI_50699>
<https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000422544>   <http://www.w3.org/2000/01/rdf-schema#subClassOf>   <http://purl.obolibrary.org/obo/SO_0001217>
<http://purl.obolibrary.org/obo/CHEBI_50131>   <http://purl.obolibrary.org/obo/RO_0002436>   <http://purl.obolibrary.org/obo/GO_0010604>
...
XXXX_OWLNETS_Triples_Integers.txt Description This tab-delimited text file contains the same information as the .nt and .gpickle files,
but is organized into a common format used by many graph representation learning algorithms.
The file contains three columns, one for each part of a triple (i.e., subject, predicate, object),
where each identifier is the full resolvable URI. The primary difference between this file and the
XXXX_noOWL_Triples_Identifiers.txt file is that the identifier URIs have been mapped to integers.
Example Output

subject   predicate   object
1   2   3
4   5   6
7   5   8
9   5   10
11   12   13
...
XXXX_OWLNETS_Triples_Integer_Identifier_Map.json Description This JSON file contains a dictionary where the keys are node identifiers and the values are integers.
It stores the conversion from the XXXX_noOWL_Triples_Identifiers.txt file to the
XXXX_noOWL_Triples_Integers.txt file.
Example Output

{"<http://purl.obolibrary.org/obo/CHEBI_59626>": 763807,
"<http://purl.obolibrary.org/obo/CHEBI_138446>": 763808,
"<http://purl.obolibrary.org/obo/GO_0039685>": 763809,
"<http://purl.obolibrary.org/obo/CHEBI_37269>": 763810,
"<http://purl.obolibrary.org/obo/HP_0025531>": 763811
...}
XXXX_OWLNETS_NodeLabels.txt Description This tab-delimited .txt file contains metadata on all nodes and relations in the N-Triples, gpickle,
and XXXX_OWLNETS_Triples_Identifiers.txt files. It contains the following columns:
  1. entity_type (e.g., "NODES", "RELATIONS", or "NA" if not a owl:Class, owl:NamedIndividual,
    owl:ObjectProperty, or owl:AnnotationProperty)
  2. integer_id (e.g., 1 - the integer used to represent this URI in the Edge List output -- matches the
    integer assignment from the XXXX_noOWL_Triples_Integers.txt file)
  3. entity_uri (e.g., "GO_0048252")
  4. label (e.g., "lauric acid metabolic process")
  5. description/definition (e.g., "The chemical reactions and pathways involving lauric acid, a fatty
    acid with the formula CH3(CH2)10COOH. Derived from vegetable sources.")
  6. synonym (e.g., "lauric acid metabolism|n-dodecanoic acid metabolic process|n-dodecanoic
    acid metabolism")
Example Output

entity_type   integer_id   entity_uri   label   description/definition   synonym
NODES   260743   <https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000651614>   IL6ST-216   Transcript IL6ST-216 is classified as type 'nonsense_mediated_decay'.   None
NODES   289592   <https://www.ncbi.nlm.nih.gov/snp/rs116659770>   NM_001440.4(EXTL3):c.1324G>C (p.Val442Leu)   This variant is a germline single nucleotide variant located on chromosome 8 (NC_000008.11, start:28717383/stop:28717383 positions, cytogenetic location:8p21.1) and has clinical significance 'Benign/Likely benign'. This entry is for the GRCh38 and was last reviewed on Nov 20, 2020 with review status 'criteria provided, multiple submitters, no conflicts'.   None
NODES   45199   <https://www.ncbi.nlm.nih.gov/snp/rs375573986>   NM_000098.3(CPT2):c.399A>G (p.Pro133=)   This variant is a germline single nucleotide variant located on chromosome 1 (NC_000001.11, start:53210073/stop:53210073 positions, cytogenetic location:1p32.3) and has clinical significance 'Likely benign'. This entry is for the GRCh38 and was last reviewed on Aug 30, 2020 with review status 'criteria provided, multiple submitters, no conflicts'.   None
RELATIONS   107080   <http://purl.obolibrary.org/obo/RO_0002492>   existence ends during   Relation between continuant c and occurrent s, such that every instance of c ceases to exist during some s, if it does not die prematurely.   ceases_to_exist_during
RELATIONS   189912   <http://purl.obolibrary.org/obo/VO_0000529>   has vaccine adjuvant   a type of 'has vaccine component' relation that is specifically for vaccine adjuvant component   None
...
Purified OWL-NETS Builds
The purified version of an OWL-NETS build is designed to convert the base OWL-NETS build into a version
that is completing consistent with a specific construction approach. For example, if the build is
instance-based, then all rdfs:subClassOf relations are converted to rdf:type and for all triples where
an rdfs:subClassOf relation occurred we add rdf:type relations between the object of this triple and all
of its ancestors. For a subclass-based build, we implement the same procedure but replace all occurrences
of rdf:type with rdfs:subClassOf. Please note that these build types are considered experimental as we are
still in the process of fully testing them.
XXXX_OWLNETS_XXXX_purified_OWLNETS.nt Description This N-Triples formatted file contains the purified OWL-NETS transformed build.
Example Output

<http://purl.obolibrary.org/obo/MONDO_0014305> <http://purl.obolibrary.org/obo/RO_0002200> <http://purl.obolibrary.org/obo/HP_0001336> .
<http://purl.obolibrary.org/obo/CHEBI_50131> <http://purl.obolibrary.org/obo/RO_0002436> <http://purl.obolibrary.org/obo/GO_0010604> .
<https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000551077> <http://purl.obolibrary.org/obo/RO_0001025> <http://purl.obolibrary.org/obo/CLO_0000652> .
<https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000672281> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.obolibrary.org/obo/SO_0001503> .
<http://purl.obolibrary.org/obo/MONDO_0011070> <http://purl.obolibrary.org/obo/RO_0002200> <http://purl.obolibrary.org/obo/HP_0002714> .
...
XXXX_OWLNETS_XXXX_purified_NetworkxMultiDiGraph.gpickle Description This file is a NetworkX MultiDiGraph representation of the same content that is stored in the
XXXX_OWLNETS_XXXX_purified_OWLNETS.nt file. Note that this representation includes keys for nodes
and edges (node: key = URI; edge: predicate_key = MD5hash("s_uri" + "p_uri" + "o_uri")). Each edge
also has a default weight of 0.0.
Example Output Encoded File format; No preview. See https://networkx.org/documentation/stable/reference/readwrite/gpickle.html for more details.
Or see, an example before:

import networkx as nx
from rdflib import URIRef
# read in graph
f = 'XXXX_OWLNETS_XXXX_purified_NetworkxMultiDiGraph.gpickle'
kg = nx.read_gpickle(f)
# look up nodes
kg[URIRef('http://purl.obolibrary.org/obo/CHEBI_73558')]
kg[URIRef('http://purl.obolibrary.org/obo/CHEBI_28940')]
XXXX_OWLNETS_XXXX_purified_decoding_dict.pkl Description This dictionary stores details about the purified OWL-NETS transformation. Specifically, it
contains metadata that can be used to reverse the transformation.
Example Output

{disjointWith:
	(rdflib.term.BNode('N0fb945ed26b14180907e29b5ffa1403e'), rdflib.term.URIRef('http://www.w3.org/2002/07/owl#disjointWith'), rdflib.term.BNode('N4a694a93a05843c3a0492587318538ca')),
	(rdflib.term.URIRef('http://purl.obolibrary.org/obo/MONDO_0033946'), rdflib.term.URIRef('http://www.w3.org/2002/07/owl#disjointWith'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/MONDO_0033947')),
	(rdflib.term.URIRef('http://purl.obolibrary.org/obo/UBERON_0001628'), rdflib.term.URIRef('http://www.w3.org/2002/07/owl#disjointWith'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/UBERON_0006764'))
filtered_triples:
	(rdflib.term.URIRef('https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000641330'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/RO_0001025'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/UBERON_0000178')),
	(rdflib.term.URIRef('http://www.ncbi.nlm.nih.gov/gene/23362'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/RO_0002511'), rdflib.term.URIRef('https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000518315')),
	(rdflib.term.URIRef('http://purl.obolibrary.org/obo/PR_Q6ZVK8'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/RO_0002436'), rdflib.term.URIRef('http://purl.obolibrary.org/obo/CHEBI_63715'))
... }
 
XXXX_OWLNETS_XXXX_purified_Triples_Identifiers.txt Description This tab-delimited text file contains the same information as the .nt and .gpickle files,
but is organized into a common format used by many graph representation learning algorithms.
The file contains three columns, one for each part of a triple (i.e., subject, predicate, object),
where each identifier is the full resolvable URI.
Example Output

subject   predicate   object
<http://purl.obolibrary.org/obo/MONDO_0014305>   <http://purl.obolibrary.org/obo/RO_0002200>   <http://purl.obolibrary.org/obo/HP_0001336>
<https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000649544>   <http://www.w3.org/2000/01/rdf#type>   <http://purl.obolibrary.org/obo/SO_0000673>
<http://purl.obolibrary.org/obo/CHEBI_154851>   <http://www.w3.org/2000/01/rdf#type>   <http://purl.obolibrary.org/obo/CHEBI_50699>
<https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000422544>   <http://www.w3.org/2000/01/rdf#type>   <http://purl.obolibrary.org/obo/SO_0001217>
<http://purl.obolibrary.org/obo/CHEBI_50131>   <http://purl.obolibrary.org/obo/RO_0002436>   <http://purl.obolibrary.org/obo/GO_0010604>
...
 
XXXX_OWLNETS_XXXX_purified_Triples_Integers.txt Description This tab-delimited text file contains the same information as the .nt and .gpickle files,
but is organized into a common format used by many graph representation learning algorithms.
The file contains three columns, one for each part of a triple (i.e., subject, predicate, object),
where each identifier is the full resolvable URI. The primary difference between this file and the
XXXX_OWLNETS_XXXX_purified_Triples_Identifiers.txt file is that the identifier URIs have been
mapped to integers.
Example Output

subject   predicate   object
1   2   3
4   5   6
7   5   8
9   5   10
11   12   13
...
 
XXXX_OWLNETS_XXXX_purified_Triples_Integer_Identifier_Map.json Description This JSON file contains a dictionary where the keys are node identifiers and the values are integers.
It stores the conversion from the XXXX_OWLNETS_XXXX_purified_Triples_Identifiers.txt file to the
XXXX_OWLNETS_XXXX_purified_Triples_Integers.txt file.
Example Output

{"<http://purl.obolibrary.org/obo/CHEBI_59626>": 763807,
"<http://purl.obolibrary.org/obo/CHEBI_138446>": 763808,
"<http://purl.obolibrary.org/obo/GO_0039685>": 763809,
"<http://purl.obolibrary.org/obo/CHEBI_37269>": 763810,
"<http://purl.obolibrary.org/obo/HP_0025531>": 763811
...}
 
XXXX_OWLNETS_XXXX_purified_NodeLabels.txt Description This tab-delimited .txt file contains metadata on all nodes and relations in the N-Triples, gpickle,
and XXXX_OWLNETS_XXXX_purified_Triples_Identifiers.txt files. It contains the following columns:
  1. entity_type (e.g., "NODES", "RELATIONS", or "NA" if not a owl:Class, owl:NamedIndividual,
    owl:ObjectProperty, or owl:AnnotationProperty)
  2. integer_id (e.g., 1 - the integer used to represent this URI in the Edge List output -- matches the
    integer assignment from the XXXX_OWLNETS_XXXX_purified_Triples_Integers.txt file)
  3. entity_uri (e.g., "GO_0048252")
  4. label (e.g., "lauric acid metabolic process")
  5. description/definition (e.g., "The chemical reactions and pathways involving lauric acid, a fatty
    acid with the formula CH3(CH2)10COOH. Derived from vegetable sources.")
  6. synonym (e.g., "lauric acid metabolism|n-dodecanoic acid metabolic process|n-dodecanoic
    acid metabolism")
Example Output

entity_type   integer_id   entity_uri   label   description/definition   synonym
NODES   260743   <https://uswest.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000651614>   IL6ST-216   Transcript IL6ST-216 is classified as type 'nonsense_mediated_decay'.   None
NODES   289592   <https://www.ncbi.nlm.nih.gov/snp/rs116659770>   NM_001440.4(EXTL3):c.1324G>C (p.Val442Leu)   This variant is a germline single nucleotide variant located on chromosome 8 (NC_000008.11, start:28717383/stop:28717383 positions, cytogenetic location:8p21.1) and has clinical significance 'Benign/Likely benign'. This entry is for the GRCh38 and was last reviewed on Nov 20, 2020 with review status 'criteria provided, multiple submitters, no conflicts'.   None
NODES   45199   <https://www.ncbi.nlm.nih.gov/snp/rs375573986>   NM_000098.3(CPT2):c.399A>G (p.Pro133=)   This variant is a germline single nucleotide variant located on chromosome 1 (NC_000001.11, start:53210073/stop:53210073 positions, cytogenetic location:1p32.3) and has clinical significance 'Likely benign'. This entry is for the GRCh38 and was last reviewed on Aug 30, 2020 with review status 'criteria provided, multiple submitters, no conflicts'.   None
RELATIONS   107080   <http://purl.obolibrary.org/obo/RO_0002492>   existence ends during   Relation between continuant c and occurrent s, such that every instance of c ceases to exist during some s, if it does not die prematurely.   ceases_to_exist_during
RELATIONS   189912   <http://purl.obolibrary.org/obo/VO_0000529>   has vaccine adjuvant   a type of 'has vaccine component' relation that is specifically for vaccine adjuvant component   None
...
 


Return to Top


Clone this wiki locally