-
Notifications
You must be signed in to change notification settings - Fork 29
Dependencies
To successfully run the code included in this repository requires the preparation of the following items:
- Resource Information
- Ontology Data
- Edge Data
- Construction Approach
- Mapping and Filtering Data
- Relations Data
- Metadata
Programmatic Assistance
Users who would like assistance with assembling the required input documents should run the generates_dependency_documents.py
script from the command line:
python3 generates_dependency_documents.py
The figure below provides an overview of how the resources/resource_info.txt
, resources/ontology_source_list.txt
, and resources/edge_source_list.txt
data sources are connected as well as how they work together.
GitHub Repository Location: resources/resource_info.txt
Purpose: This file is used as the master organizer for all project resources.
File Format: The program expects the information stored as a "|"
delimited file:
-
EdgeType
: A string label for an edge (node1-node2; ex: 'gene-disease'). The label matches what is used in theedge_source_list.txt
andontology_source_list.txt
files. -
IdentifierPrefixInformation
: Three ";"-separated items used to update a prefix-identifier pair (e.g., :;GO_;GO_). The first item contains the character that separates existing prefixes and identifiers (e.g. ":" in GO:1283834). The second item contains the current prefix and the third item contains the new prefix (i.e. 'GO_' and 'GO_'). If there is no existing prefix (i.e., data source provides only an identifier), leave the second item empty and provide the prefix as the third item. If the existing prefix is correct, type ";;". -
NodeDataTypes
: A label of "class" or "entity" for each node in an edge separated by-
(e.g. "class-class"). The "class" label represents nodes from ontologies and "entity" represents nodes from other data sources. -
Relation
: A Relation Ontology CURIE (e.g.,RO_0000056
). -
Delimiter
: A character used to split input text rows into columns (e.g.,t
fortab-delimited
data or,
forcomma-delimited
data). -
ColumnIndexes
: Two-column indices separated by;
(e.g.,0;4
for the first and third columns in the input data source). -
IdentifierMaps
: A string of mapping information for each node in an edge. For example, the string"2:mapping_file_1.txt;4:mapping_file_2.txt"
means that the first node requires data contained in the 2nd column of themapping_file_1.txt
and the second node requires data from the 4th column in themapping_file_2.txt
file. -
EvidenceCriteria
: Evidence criteria that can be used to filter an input data source (e.g., scores above a certain cut-off). An evidence set is composed of 3 pieces of ";"-separated information. Multiple evidence sets can be passed as demonstrated by the example above, where each set is separated by "::". Consider the following example:"4;!=;IEA::8;<;0.0001"
:- The index of the column to apply the evidence criteria to (e.g., "4" and "8" in the example above).
- The operator (i.e.,
==
,!=
,<
,>
,<=
,>=
,in
,.startswith()
,.endswith()
) to use when filtering (e.g.,!=
and<
in the example above) - The value (i.e.,
int
,float
,str
,list
) to filter on (e.g., "IEA" and "0.0001" in the example above)
-
FilterCriteria
: Filtering criteria that can be used to filter an input data source (e.g., human proteins). An evidence set is composed of 3 pieces of ";"-separated information. Multiple filtering sets can be passed as demonstrated by the example above, where each set is separated by "::". Consider the following example: "5;==;P::7;==;9606":- The index of the column to apply the evidence criteria to (e.g., "5" and "7" in the example above)
- The operator (i.e.,
==
,!=
,<
,>
,<=
,>=
,in
,.startswith()
,.endswith()
) to use when filtering (e.g.,==
and==
in the example above) - The value (i.e.,
int
,float
,str
,list
) to filter on (e.g., "P" and "9606" in the example above)
NOTE. You can also pass dedup
as a Filtering Criteria
(e.g. 2-0;dedup;desc
):
- The column index should be
col1-col2
:-
col1
is the column you want to filter on -
col2
is the primary identifier to deduplicate
-
- The value should be
asc
ordesc
to indicate the direction to sort thepandas.DataFrame
prior to deduplicating
TABLE: An example resource_info.txt
file is provided in the table below.
Edge Type | Source labels | Data Type | Edge Relation | Subject URI | Object URI | Delimiter | Column Indices | Identifier Maps | Evidence Criteria | Filter Criteria |
---|---|---|---|---|---|---|---|---|---|---|
chemical-gene | ;MESH_; | class-entity | RO_0002434 | http://purl.obolibrary.org/obo/ | http://purl.uniprot.org/geneid/ | t | 1;4 | 0:./resources/data_maps/MESH_CHEBI_MAP.txt | None | 7;==;9606 |
gene-gene | .;; | entity-entity | RO_0002434 | http://purl.uniprot.org/geneid/ | http://purl.uniprot.org/geneid/ | ' ' | 0;1 | 0:./resources/data_maps/STRING_ENTREZ_MAP.txt;1:./resources/data_maps/STRING_ENTREZ_MAP.txt | 2;>=;700 | None |
gene-gobp | ;; | entity-class | BFO_0000056 | http://purl.uniprot.org/geneid/ | http://purl.obolibrary.org/obo/ | t | 1;4 | 0:./resources/edge_data/gene-go_goa_class_data.txt | 8;==;P | 12;==;taxon:9606 |
pathway-disease | ;; | entity-class | RO_0003302 | https://reactome.org/content/detail/ | http://purl.obolibrary.org/obo/ | t | 1;0 | 1:disease-dbxref-map | None | 1;.startswith('R-HSA-'); |
GitHub Repository Location: resources/ontology_source_list.txt
Purpose: This script is used to identify and download specific ontologies.
File Format: The program expects this information to be stored as a ","
delimited file.
TABLE: An example ontology_source_list.txt
file is provided in the table below.
Ontology | URL |
---|---|
disease | http://purl.obolibrary.org/obo/doid.owl |
go | http://purl.obolibrary.org/obo/go.owl |
chemical | ftp://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi_lite.owl |
GitHub Repository Location: resources/edge_source_list.txt
Purpose: This script is used to identify and download specific publicly available data sources that will be used to derive edges between ontology classes and instances of ontology classes.
File Format: The program expects this information to be stored as a ","
delimited file.
TABLE: An example edge_source_list.txt
file is provided in the table below.
Wiki: KG-Construction
GitHub Repository Location: resources/construction_approach
Purpose: New data can be added to the knowledge graph using 2 different construction approaches: (1) instance-based or (2) subclass-based. Each of these approaches is described further below. For more details, please see the resources/construction_approach/README.md
Jupyter Notebook for additional information.
🛑 CONSTRAINTS 🛑
The algorithm makes the following assumptions:
- Make sure that you have created the non-ontology node data to ontology class mapping dictionary and added it to the
./resources/construction_approach/
directory.
Construction Approach: Instance-Based
In this approach, each new edge is added as an instance
of an existing class (via rdf:Type
) in the knowledge graph.
EXAMPLE: Adding the edge: Morphine ➞ isSubstanceThatTreats
➞ Migraine
Would require adding:
-
isSubstanceThatTreats
(Morphine,x1
) -
Type
(x1
, Migraine)
In this example, Morphine is a non-ontology data node and Migraine is an HPO ontology term.
Outputs: As mentioned above, a universally unique identifier (UUID) is created for each anonymous node representing an instance of a class. In order to fully utilize the knowledge graph, a .json
file containing the mapping from each UUID instance to it's ontology class is output to the ./resources/construction_approach/instance
directory. For example,
{
"http://purl.obolibrary.org/obo/CHEBI_24505": "https://github.com/callahantiff/PheKnowLator/obo/ext/c2591241-8952-44ea-a313-e4b3c5fb6d35",
"http://purl.obolibrary.org/obo/PR_000013648": "https://github.com/callahantiff/PheKnowLator/obo/ext/0ea74deb-0002-4f48-b7e4-81a8fd947312",
"http://purl.obolibrary.org/obo/GO_0050031": "https://github.com/callahantiff/PheKnowLator/obo/ext/8f5c81d4-92dd-426e-a2d9-2be87edb1520",
}
Construction Approach: Subclass-Based
In this approach, each new edge is added as a subclass of an existing ontology class (via rdfs:subClassOf
) in the knowledge graph.
EXAMPLE: Adding the edge: TGFB1 ➞ participatesIn
➞ Influenza Virus Induced Apoptosis
Would require adding:
-
participatesIn
(TGFB1, Influenza Virus Induced Apoptosis) -
subClassOf
(Influenza Virus Induced Apoptosis, Influenza A pathway) -
Type
(Influenza Virus Induced Apoptosis,owl:Class
)
Where TGFB1 is an PR ontology term and Influenza Virus Induced Apoptosis is a non-ontology data node. In this example, Influenza A pathway is an existing ontology class.
Outputs: There are no approach-specific output files generated.
Input Requirements for both Approaches: A pickled
dictionary where the keys are node identifiers (non-ontology node data) and the values are lists of ontology class identifiers to subclass has been added to the ./resources/construction_approach/
directory. An example of this dictionary is shown below:
{
'R-HSA-168277' : ['http://purl.obolibrary.org/obo/PW_0001054',
'http://purl.obolibrary.org/obo/GO_0046730'],
'R-HSA-9026286' : ['http://purl.obolibrary.org/obo/PW_000000001',
'http://purl.obolibrary.org/obo/GO_0019372'],
'100129357' : ['SO_0000043'],
'100129358' : ['SO_0000336'],
}
Please see the Reactome Pathways - Pathway Ontology
and Genomic Identifiers - Sequence Ontology
sections of the Data_Preparation.ipynb
Jupyter Notebook for examples of how to construct this document.
Wiki: v2-Data-Sources
Purpose: Several other files are needed to create data used for the filtering and mapping during the creation of knowledge graph edges. For more details on what these data sources are and how they are created, please see the Data_Preparation.ipynb
Jupyter Notebook.
GitHub Repository Location: resources/relations_data
Purpose: PheKnowLator can be built using a single set of provided relations (i.e. the owl:ObjectProperty
or edge which is used to connect the nodes in the graph) with or without the inclusion of each relation's inverse.
🛑 CONSTRAINTS 🛑
If you would like the knowledge graph to include relations and their inverse relations, you must add the following to
the ./resources/relations_data
repository (an example of what should be included in each of these is included below):
- A
.txt
file of all relations and their labels - A
.txt
file of the relations and their inverse relations
Filename: INVERSE_RELATIONS.txt
The owl:inverseOf
property is used to identify each relation's inverse. To make it easier to look up the inverse relations when building the knowledge graph, each relation/inverse relation pair is listed twice, for example:
-
location of
owl:inverseOf
located in -
located in
owl:inverseOf
location of
The data in this file should look like:
RO_0003000 RO_0003001
RO_0003001 RO_0003000
RO_0002233 RO_0002352
RO_0002352 RO_0002233
Filename: RELATIONS_LABELS.txt
Not all relations have an inverse (e.g. interactions). Even though an inverse relation might not exist, we still want to ensure that all interaction relations are symmetrically represented in the graph. To aid in this process, we need to be able to quickly look-up an edge and determine if it is an interaction. To help make this process more efficient, the algorithm expects a list of all relations and their labels in a .txt
file.
The data in this file should look like:
RO_0002285 developmentally replaces
RO_0002287 part of developmental precursor of
RO_0002490 existence overlaps
RO_0002214 has prototype
Please see the Data_Preparation.ipynb
Jupyter Notebook for code on how to create these files.
GitHub Repository Location: resources/node_data
Purpose: The knowledge graph can be built with or without the inclusion of node and relation metadata (i.e. labels, descriptions or definitions, and synonyms). If you'd like to create and use node metadata, please run the Data_Preparation.ipynb
Jupyter Notebook and run the code chunks listed under the INSTANCE AND/OR SUBCLASS (NON-ONTOLOGY CLASS) METADATA section. These code chunks should be run before the knowledge graph is constructed. For more details on what these data sources are and how they are created, please see the node_data
README.md
.
Example structure of the metadata dictionary is shown below:
{
'nodes': {
'http://www.ncbi.nlm.nih.gov/gene/1': {
'Label': 'A1BG',
'Description': "A1BG has locus group protein-coding' and is located on chromosome 19 (19q13.43).",
'Synonym': 'HYST2477alpha-1B-glycoprotein|HEL-S-163pA|ABG|A1B|GAB'} ... },
'relations': {
'http://purl.obolibrary.org/obo/RO_0002533': {
'Label': 'sequence atomic unit',
'Description': 'Any individual unit of a collection of like units arranged in a linear order',
'Synonym': 'None'} ... }
}
🛑 CONSTRAINTS 🛑
The algorithm makes the following assumptions:
- If metadata is provided, only those edges with nodes that have metadata will be created; valid edges without metadata will be discarded.
- Metadata for all non-ontology nodes and all relations for edges added to the core set of ontologies will be saved as a dictionary in the
./resources/node_data/node_metadata_dict.pkl
repository. - For each identifier we try to obtain the following metadata:
Label
,Description
, andSynonym
. An example of these data types is shown below for agene
identifier5620
:
Metadata Type | Definition | Metadata |
---|---|---|
ID | Node identifiers for instance data sources | 5620 |
Label | The primary label or name for the node | LANCL2 |
Description | A definition or other useful details about the node |
Lanc Like 2 is a protein-coding gene that is located on chromosome 7 (map_location: 7p11.2 ) |
Synonym | Alternative terms used for a node |
GPR69B , TASP , lanC-like protein 2 , G protein-coupled receptor 69B , LanC (bacterial lantibiotic synthetase component C)-like 2 , LanC lantibiotic synthetase component C-like 2 , testis-specific adriamycin sensitivity protein
|