Skip to content

Dependencies

Tiffany J. Callahan edited this page Apr 21, 2020 · 75 revisions

Project Dependencies

To successfully run the code included in this repository requires the preparation of three documents:

  1. Master Resources
  2. Ontology Data
  3. Edge Data
  4. Construction Approach
  5. Mapping and Filtering Data
  6. Relations Data
  7. Node Metadata
  8. OWL Properties



Programmatic Assistance
Users' who would like assistance with assembling the required input documents should run the generates_dependency_documents.py script from the command line:

python3 pkt/generates_dependency_documents.py




Master Resources


GitHub Repository Location: resources/resource_info.txt

Purpose: This file is used as the master organizer for all project resources.

File Format: The program excepts the information stored as a "|" delimited file:

  • Edge Type: An edge label (node1-node2) - should match Data Sources in the ontology, class edge, and non-class edge input files.
  • Source labels: Three ; separated items (e.g.:;GO_;GO_): character to split existing labels (e.g. : in GO:1283834) and label for subject and object nodes. If the existing label is correct, type ;;.
  • Data Type: A label of class or instance provided for each node and separated by - (e.g. class-class).
  • Edge Relation: A Relation Ontology identifier to be used as an edge between the nodes (e.g. RO_0000056).
  • Subject URI: A Universal Resource Identifier that will be connected to the subject node in the Edge Type (e.g. http://purl.obolibrary.org/obo/ or http://purl.uniprot.org/geneid/).
  • Object URI: A Universal Resource Identifier that will be connected to the object node in the Edge Type (e.g. http://purl.obolibrary.org/obo/ or http://purl.uniprot.org/geneid/).
  • Delimiter: A character used to split input text rows into columns (e.g. t or tab-delimited data or , for comma-delimited data).
  • Column Indices: two column indices separated by ; (e.g. 0;4 for the first and third columns in the input data source).
  • Identifier Maps: A string which is used to indicate the index of the edge node column in the input data source that needs to be mapped to a different identifier and a file containing a column that contains the data to map to (e.g. 2:mapping_file_1.txt;4:mapping_file_2.txt - means mapping data from the first node in the edge to the 0th column in mapping_file_1.txt and data from the second node in the edge to the 4th column in mapping_file_2.txt).
  • Evidence Criteria: Sets of 3 ::-separated items, where each set is composed of three pieces of ;-separated information (e.g. 4;!=;IEA::8;<;0.0001):
    1. The index of the column to apply the evidence criteria to
      • 4 and 8 from the example above
    2. The operator (i.e. ==, !=, <, >, <=, >=, in, .startswith(), .endswith()) to use when filtering
      • != and < from the example above
    3. The value to filter on (i.e. int, float, str, [])
      • IEA and 0.0001 from the example above
  • Filter Criteria: Sets of 3 ::-separated items, where each set is composed of three pieces of ;-separated information (e.g. 5;==;P::7;==;9606):
    1. The index of the column to apply the filter criteria to
      • 5 and 7 from the example above
    2. The operator to use when filtering (i.e. ==, !=, <, >, <=, >=, in, .startswith(), .endswith()) to use when filtering
    • == and == from the example above
    1. The value to filter on (i.e. int, float, str, [])
      • P and 9606 from the example above.

NOTE. You can also pass dedup as a Filtering Criteria (e.g. 2-0;dedup;desc):

  • The column index should be col1-col2:
    • col1 is the column you want to filter on
    • col2 is the primary identifier to deduplicate
  • The value should be asc or desc to indicate the direction to sort the pandas.DataFrame prior to deduplicating

TABLE: An example resource_info.txt file is provided in the table below.

Edge Type Source labels Data Type Edge Relation Subject URI Object URI Delimiter Column Indices Identifier Maps Evidence Criteria Filter Criteria
chemical-gene ;MESH_; class-entity RO_0002434 http://purl.obolibrary.org/obo/ http://purl.uniprot.org/geneid/ t 1;4 0:./resources/data_maps/MESH_CHEBI_MAP.txt None 7;==;9606
gene-gene .;; entity-entity RO_0002434 http://purl.uniprot.org/geneid/ http://purl.uniprot.org/geneid/ ' ' 0;1 0:./resources/data_maps/STRING_ENTREZ_MAP.txt;1:./resources/data_maps/STRING_ENTREZ_MAP.txt 2;>=;700 None
gene-gobp ;; entity-class BFO_0000056 http://purl.uniprot.org/geneid/ http://purl.obolibrary.org/obo/ t 1;4 0:./resources/edge_data/gene-go_goa_class_data.txt 8;==;P 12;==;taxon:9606
pathway-disease ;; entity-class RO_0003302 https://reactome.org/content/detail/ http://purl.obolibrary.org/obo/ t 1;0 1:disease-dbxref-map None 1;.startswith('R-HSA-');




Ontology Data


GitHub Repository Location: resources/ontology_source_list.txt

Purpose: This script is used to identify and download specific ontologies.

File Format: The program excepts this information to be stored as a "," delimited file.


TABLE: An example ontology_source_list.txt file is provided in the table below.

Ontology URL
disease http://purl.obolibrary.org/obo/doid.owl
go http://purl.obolibrary.org/obo/go.owl
chemical ftp://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi_lite.owl




Edge Data


GitHub Repository Location: resources/edge_source_list.txt

Purpose: This script is used to identify and download specific publicly available data sources that will be used to derive edges between ontology classes and instances of ontology classes.

File Format: The program excepts this information to be stored as a "," delimited file.


TABLE: An example edge_source_list.txt file is provided in the table below.

Data Source URL
chemical-gene http://ctdbase.org/reports/CTD_chem_gene_ixns.tsv.gz
gene-gobp http://geneontology.org/gene-associations/goa_human.gaf.gz
gene-disease https://www.disgenet.org/static/disgenet_ap1/files/downloads/curated_gene_disease_associations.tsv.gz
gene-gene https://stringdb-static.org/download/protein.links.v11.0/9606.protein.links.v11.0.txt.gz



Construction Approach


Wiki: KG-Construction

GitHub Repository Location: resources/construction_approach

Purpose: New data can be added to the knowledge graph using 2 different construction approaches: (1) instance-based or (2) subclass-based. Each of these approaches is described further below. For more details, please see the resources/construction_approach/README.md Jupyter Notebook for additional information.


🛑 CONSTRAINTS 🛑
The algorithm makes the following assumptions:

  • Make sure that you have created the non-ontology node data to ontology class mapping dictionary and added it to the ./resources/construction_approach/ directory.

Construction Approach: Instance-Based
In this approach, each new edge is added as an instance of an existing class (via rdf:Type) in the knowledge graph.

EXAMPLE: Adding the edge: Morphine ➞ isSubstanceThatTreats ➞ Migraine

Would require adding:

  • isSubstanceThatTreats(Morphine, x1)
  • Type(x1, Migraine)

In this example, Morphine is a non-ontology data node and Migraine is an HPO ontology term.

Outputs: As mentioned above, a UUID is created for each anonymous node representing an instance of a class. In order to fully utilize the knowledge graph, a .json file containing the mapping from each UUID instance to it's ontology class is output to the ./resources/construction_approach/instance directory. For example,

{
"http://purl.obolibrary.org/obo/CHEBI_24505": "https://github.com/callahantiff/PheKnowLator/obo/ext/c2591241-8952-44ea-a313-e4b3c5fb6d35",
"http://purl.obolibrary.org/obo/PR_000013648": "https://github.com/callahantiff/PheKnowLator/obo/ext/0ea74deb-0002-4f48-b7e4-81a8fd947312",
"http://purl.obolibrary.org/obo/GO_0050031": "https://github.com/callahantiff/PheKnowLator/obo/ext/8f5c81d4-92dd-426e-a2d9-2be87edb1520",
}

Construction Approach: Subclass-Based
In this approach, each new edge is added as a subclass of an existing ontology class (via rdfs:subClassOf) in the knowledge graph.

EXAMPLE: Adding the edge: TGFB1 ➞ participatesIn ➞ Influenza Virus Induced Apoptosis

Would require adding:

  • participatesIn(TGFB1, Influenza Virus Induced Apoptosis)
  • subClassOf(Influenza Virus Induced Apoptosis, Influenza A pathway)
  • Type(Influenza Virus Induced Apoptosis, owl:Class)

Where TGFB1 is an PR ontology term and Influenza Virus Induced Apoptosis is a non-ontology data node. In this example, Influenza A pathway is an existing ontology class.

Outputs: There are no approach-specific output files generated.


Input Requirements for both Approaches: A pickled dictionary where the keys are node identifiers (non-ontology node data) and the values are lists of ontology class identifiers to subclass has been added to the ./resources/construction_approach/ directory. An example of this dictionary is shown below:

{
  'R-HSA-168277'  : ['http://purl.obolibrary.org/obo/PW_0001054',         'http://purl.obolibrary.org/obo/GO_0046730'],
  'R-HSA-9026286' : ['http://purl.obolibrary.org/obo/PW_000000001',         'http://purl.obolibrary.org/obo/GO_0019372'],
  '100129357'     : ['SO_0000043'],
  '100129358'     : ['SO_0000336'],
}                  

Please see the Reactome Pathways - Pathway Ontology and Genomic Identifiers - Sequence Ontology sections of the Data_Preparation.ipynb Jupyter Notebook for examples of how to construct this document.




Mapping and Filtering Data


Wiki: v2-Data-Sources

Purpose: There were several other files that were needed in order to create data used for filtering and mapping data during the creation of knowledge graph edges. For more details on what and how these data sources were created, please see the Data_Preparation.ipynb Jupyter Notebook for additional information.



Relations Data


GitHub Repository Location: resources/relations_data

Purpose: PheKnowLator can be built using a single set of provided relations (i.e. the owl:ObjectProperty or edge which is used to connect the nodes in the graph) with or without the inclusion of each relation's inverse.



🛑 CONSTRAINTS 🛑
If you would like the knowledge graph to include relations and their inverse relations, you must add the following to the ./resources/relations_data repository (an example of what should be included in each of these is included below):

  • A .txt file of all relations and their labels
  • A .txt file of the relations and their inverse relations

Filename: INVERSE_RELATIONS.txt

The owl:inverseOf property is used to identify each relation's inverse. To make it easier to look up the inverse relations when building the knowledge graph, each relation/inverse relation pair is listed twice, for example:

The data in this file should look like:

  RO_0003000  RO_0003001
  RO_0003001  RO_0003000
  RO_0002233  RO_0002352
  RO_0002352  RO_0002233

Filename: RELATIONS_LABELS.txt

Not all relations have an inverse (e.g. interactions). Even though there might not be an inverse relations, we still want to ensure that all interactions relations are symmetrically represented in the graph. To aid in this process, we need to be able to quickly look-up an edge and determine if it is an interaction. To help make this process more efficient, the algorithm expects a list of all relations and their labels in as a .txt file.

The data in this file should look like:

  RO_0002285  developmentally replaces
  RO_0002287  part of developmental precursor of
  RO_0002490  existence overlaps
  RO_0002214  has prototype

Please see the Data_Preparation.ipynb Jupyter Notebook for code on how to create these files.




Node Metadata


GitHub Repository Location: resources/node_data

Purpose: The knowledge graph can be built with or without the inclusion of instance node metadata (i.e. labels, descriptions or definitions, and synonyms). If you'd like to create and use node metadata, please see the Data_Preparation.ipynb Jupyter Notebook and run the code chunks listed under the Gather Node Metadata Data section. These code chunks should only be run once the edge lists have been created, but before the knowledge graph is constructed. For more details on what and how these data sources were created, please see the node_directory README.md for additional information.


🛑 CONSTRAINTS 🛑
The algorithm makes the following assumptions:

  • If metadata is provided, only those edges with nodes that have metadata will be created. All valid edges without metadata will be discarded.
  • All data files with node metadata are in the ./resources/node_data repository.
  • Each metadata file, in addition to the primary node identifier (labeled as ID), will contain 1 to 3 columns labeled: Label, Description, and Synonym. An example of these data types is shown below.

At this time, the knowledge graph will include the following metadata types for a gene identifier 5620:

Metadata Type Definition Metadata
ID Node identifiers for instance data sources 5620
Label The primary label or name for the node LANCL2
Description A definition or other useful details about the node Lanc Like 2 is a protein-coding gene that is located on chromosome 7 (map_location: 7p11.2)
Synonym Alternative terms used for a node GPR69B, TASP, lanC-like protein 2, G protein-coupled receptor 69B, LanC (bacterial lantibiotic synthetase component C)-like 2, LanC lantibiotic synthetase component C-like 2, testis-specific adriamycin sensitivity protein

The metadata will be used to create the following edges in the knowledge graph:

  • Label ➞ node rdfs:label
  • Description ➞ node obo:IAO_0000115 description
  • Synonyms ➞ node oboInOwl:hasExactSynonym synonym




OWL Properties


GitHub Repository Location: resources/owl_decoding

Purpose: The PheKnowLator program includes functionality to remove OWL semantics from a knowledge graph using an updated implementation of OWL-NETS (new implementation) if the user chooses to run OWL-NETS, then they will need to provide a list of all owl:ObjectProperty types they would like to be included. An example list can be found here.




This project is licensed under Apache License 2.0 - see the LICENSE.md file for details. If you intend to use any of the information on this Wiki, please provide the appropriate attribution by citing this repository:

@misc{callahan_tj_2019_3401437,
  author       = {Callahan, TJ},
  title        = {PheKnowLator},
  month        = mar,
  year         = 2019,
  doi          = {10.5281/zenodo.3401437},
  url          = {https://doi.org/10.5281/zenodo.3401437}
}
Clone this wiki locally