This repository contains scripts and code snippets that serve as utilities for various tasks. The goal is to provide reusable code to avoid rewriting and reinventing the wheel. All content in this repository is open and free to use under a Creative Commons license. Feel free to use and contribute!
1. Parse Hipathia Pathways to CSV for Neo4J: hipath2CSV
Folder
This script extracts pathway information using the Hipathia package for a specified species and pathway ID. It generates two output files:
- A CSV file containing node attributes.
- A CSV file containing interactions/relations between nodes.
The CSV files can be used for further implementations, such as loading into Neo4J.
-
path_id
:- This is a number that should be a valid pathway identifier from the KEGG pathway database.
- Example:
04210
for the Apoptosis pathway.
-
Path_ID1 Path_ID2 ... Path_IDN
:- This represents a space-separated list of pathway IDs that you want to process when using the multiple pathways option.
- Example:
"04210 04150 04010"
for processing the Apoptosis pathway, the Cell Cycle pathway, and the Glycolysis pathway simultaneously.
More information read this paper about Hipathia
For a single pathway, you can use the get_path.R
script. Run the following command:
Rscript get_path.R --species "hsa" --path_id "04210" --output_folder "pathways"
-s
,--species
: Species code (e.g., 'hsa' for Homo sapiens) [default: "hsa"]-p
,--path_id
: Pathway ID (e.g., '04210' for Apoptosis pathway) [default: "04210"]-o
,--output_folder
: Output folder name where the files will be saved [default: "pathways"]-q
,--quiet
: Suppress output messages
Parsing the Apoptosis KEGG pathway for humans:
Rscript get_path.R --species "hsa" --path_id "04210" --output_folder "pathways"
For the Apoptosis KEGG pathway for mouse species with quiet mode:
Rscript get_path.R --species "mmu" --path_id "04150" --output_folder "mouse_pathways" -q
This script runs get_path.R
for a list of pathway IDs in parallel using GNU Parallel. It allows you to process multiple pathways simultaneously, improving efficiency.
Note: if you are using conda env, please run this before:
conda activate <your_env>
conda install -c conda-forge parallel
To use this script, run the following command:
chmod +x get_paths_parallel.sh
./get_paths_parallel.sh "Path_ID1 Path_ID2 ... Path_IDN" "output_folder_name" [-q]
-q : Suppress output messages
Running get_path.R for multiple pathways:
./get_paths_parallel.sh "04210 04150 04010" "ThreePathways"
Running get_path.R for multiple pathways with quiet mode:
./get_paths_parallel.sh "04210 04150 04010" "ThreePathways" -q
Note: -j
is set to 0 by default (Actualy is hardcoded inside the get_paths_parallel.sh
script :/), which uses all available CPU cores. Please adjust this value as needed in the script.
This script requires the following R packages: hipathia
, igraph
, dplyr
, optparse
.
This repository is licensed under a Creative Commons license. You are free to use, share, and adapt the content for any purpose, provided you give appropriate feedback :).
Kinza Rian
To get started, clone this repository and explore the scripts available. Additional utilities and documentation will be added over time.