Skip to content

Commit e22474b

Browse files
author
Erik_L
committed
Revert "Merge pull request Reed-CompBio#79 from agitter/refactoring"
This reverts commit ea5b5fd, reversing changes made to 5334cd4.
1 parent 429add6 commit e22474b

37 files changed

+626
-144
lines changed

CONTRIBUTING.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ These entries are used to tell Snakemake what input files should be present befo
7676
Implement the `generate_inputs` function, following the `omicsintegrator1.py` example.
7777
The nodes should be any node in the dataset that has a prize set, any node that is a source, or any node that is a target.
7878
The network should be all of the edges written in the format `<vertex1>|<vertex2>`.
79-
`src/dataset.py` provides functions that provide access to node information and the interactome (edge list).
79+
`Dataset.py` provides functions that provide access to node information and the interactome (edge list).
8080

8181
Implement the `run` function, following the Path Linker example.
8282
The `prepare_volume` utility function is needed to prepare the network and nodes input files to be mounted and used inside the container.
@@ -89,7 +89,7 @@ Use the `run_container` utility function to run the command in the container `<u
8989

9090
Implement the `parse_output` function.
9191
The edges in the Local Neighborhood output have the same format as the input, `<vertex1>|<vertex2>`.
92-
Convert these to be tab-separated vertex pairs followed by a tab and a `1` at the end of every line, which indicates all edges have the same rank.
92+
Convert these to be space-separated vertex pairs followed by a space and a `1` at the end of every line, which indicates all edges have the same rank.
9393
The output should have the format `<vertex1> <vertex2> 1`.
9494

9595
### Step 4: Make the Local Neighborhood wrapper accessible through SPRAS
@@ -130,7 +130,7 @@ The pull request will be closed so that future contributors can practice with th
130130
1. Add a new subdirectory to `docker-wrappers` with the name `<algorithm>`, write a `Dockerfile` to build an image for `<algorithm>`, and include any other files required to build that image in the subdirectory
131131
1. Build and push the Docker image to the [reedcompbio](https://hub.docker.com/orgs/reedcompbio) Docker organization (SPRAS maintainer required)
132132
1. Add a new Python file `src/<algorithm>.py` to implement the wrapper functions for `<algorithm>`: specify the list of `required_input` files and the `generate_inputs`, `run`, and `parse_output` functions
133-
1. Import the new class in `runner.py` so the wrapper functions can be accessed
133+
1. Import the new class in `PRRunner.py` so the wrapper functions can be accessed
134134
1. Document the usage of the Docker wrapper and the assumptions made when implementing the wrapper
135135
1. Add example usage for the new algorithm and its parameters to the template config file
136136
1. Write test functions and provide example input data in a new test subdirectory `test/<algorithm>`
File renamed without changes.

src/runner.py renamed to PRRunner.py

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
from src.dataset import Dataset
1+
import Dataset
22

33
# supported algorithm imports
44
from src.local_neighborhood import LocalNeighborhood as local_neighborhood
@@ -8,7 +8,6 @@
88
from src.pathlinker import PathLinker as pathlinker
99
from src.mincostflow import MinCostFlow as mincostflow
1010

11-
1211
def run(algorithm, params):
1312
"""
1413
A generic interface to the algorithm-specific run functions
@@ -39,7 +38,7 @@ def merge_input(dataset_dict, dataset_file):
3938
@param dataset_dict: dataset to process
4039
@param dataset_file: output filename
4140
"""
42-
dataset = Dataset(dataset_dict)
41+
dataset = Dataset.Dataset(dataset_dict)
4342
dataset.to_file(dataset_file)
4443

4544

@@ -51,7 +50,7 @@ def prepare_inputs(algorithm, data_file, filename_map):
5150
@param filename_map: a dict mapping file types in the required_inputs to the filename for that type
5251
@return:
5352
"""
54-
dataset = Dataset.from_file(data_file)
53+
dataset = Dataset.Dataset.from_file(data_file)
5554
try:
5655
algorithm_runner = globals()[algorithm.lower()]
5756
except KeyError:

README.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ The Docker images are available on [DockerHub](https://hub.docker.com/orgs/reedc
6262
**Python wrapper for calling algorithms**: Wrapper functions provide an interface between the common file formats for input and output data and the algorithm-specific file formats and reconstruction commands.
6363
These wrappers are in the `src/` subdirectory.
6464

65-
**Test code**: Tests for the Docker wrappers and SPRAS code.
65+
**Test code**: Tests for the Docker wrappers.
6666
The tests require the conda environment in `environment.yml` and Docker.
6767
Run the tests with `pytest -s`.
6868

@@ -71,6 +71,10 @@ Some computing environments are unable to run Docker and prefer Singularity as t
7171
SPRAS has limited experimental support for Singularity instead of Docker, and only for some pathway reconstruction algorithms.
7272
SPRAS uses the spython package to interface with Singularity, which only supports Linux.
7373

74+
## Docker demo
75+
The `docker-demo` subdirectory is not used by the main pathway reconstruction framework.
76+
It serves as a reference for how to set up Dockerfiles and make Docker run calls.
77+
7478
## Attribution
7579
SPRAS builds on public datasets and algorithms.
7680
If you use SPRAS in a research project, please cite the original datasets and algorithms in addition to SPRAS.

Snakefile

Lines changed: 13 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,12 @@
11
import os
2-
from src import runner
2+
import PRRunner
33
import shutil
44
import yaml
5-
from src.dataset import Dataset
5+
from Dataset import Dataset
66
from src.util import process_config
7-
from src.analysis import ml, summary, graphspace
7+
from src.analysis.summary import summary
8+
from src.analysis.viz import graphspace
9+
from src.analysis.ml import ml
810

911
# Snakemake updated the behavior in the 6.5.0 release https://github.com/snakemake/snakemake/pull/1037
1012
# and using the wrong separator prevents Snakemake from matching filenames to the rules that can produce them
@@ -125,7 +127,7 @@ rule merge_input:
125127
run:
126128
# Pass the dataset to PRRunner where the files will be merged and written to disk (i.e. pickled)
127129
dataset_dict = get_dataset(datasets, wildcards.dataset)
128-
runner.merge_input(dataset_dict, output.dataset_file)
130+
PRRunner.merge_input(dataset_dict, output.dataset_file)
129131

130132
# The checkpoint is like a rule but can be used in dynamic workflows
131133
# The workflow directed acyclic graph is re-evaluated after the checkpoint job runs
@@ -144,8 +146,8 @@ checkpoint prepare_input:
144146
# Use the algorithm's generate_inputs function to load the merged dataset, extract the relevant columns,
145147
# and write the output files specified by required_inputs
146148
# The filename_map provides the output file path for each required input file type
147-
filename_map = {input_type: SEP.join([out_dir, 'prepared', f'{wildcards.dataset}-{wildcards.algorithm}-inputs', f'{input_type}.txt']) for input_type in runner.get_required_inputs(wildcards.algorithm)}
148-
runner.prepare_inputs(wildcards.algorithm, input.dataset_file, filename_map)
149+
filename_map = {input_type: SEP.join([out_dir, 'prepared', f'{wildcards.dataset}-{wildcards.algorithm}-inputs', f'{input_type}.txt']) for input_type in PRRunner.get_required_inputs(wildcards.algorithm)}
150+
PRRunner.prepare_inputs(wildcards.algorithm, input.dataset_file, filename_map)
149151

150152
# Collect the prepared input files from the specified directory
151153
# If the directory does not exist for this dataset-algorithm pair, the checkpoint will detect that
@@ -162,7 +164,7 @@ def collect_prepared_input(wildcards):
162164
prepared_dir = SEP.join([out_dir, 'prepared', f'{wildcards.dataset}-{wildcards.algorithm}-inputs'])
163165

164166
# Construct the list of expected prepared input files for the reconstruction algorithm
165-
prepared_inputs = expand(f'{prepared_dir}{SEP}{{type}}.txt',type=runner.get_required_inputs(algorithm=wildcards.algorithm))
167+
prepared_inputs = expand(f'{prepared_dir}{SEP}{{type}}.txt',type=PRRunner.get_required_inputs(algorithm=wildcards.algorithm))
166168
# If the directory is missing, do nothing because the missing output triggers running prepare_input
167169
if os.path.isdir(prepared_dir):
168170
# If the directory exists, confirm all prepared input files exist as well (as opposed to some or none)
@@ -193,7 +195,7 @@ rule reconstruct:
193195
# Create a copy so that the updates are not written to the parameters logfile
194196
params = reconstruction_params(wildcards.algorithm, wildcards.params).copy()
195197
# Add the input files
196-
params.update(dict(zip(runner.get_required_inputs(wildcards.algorithm), *{input})))
198+
params.update(dict(zip(PRRunner.get_required_inputs(wildcards.algorithm), *{input})))
197199
# Add the output file
198200
# All run functions can accept a relative path to the output file that should be written that is called 'output_file'
199201
params['output_file'] = output.pathway_file
@@ -203,15 +205,15 @@ rule reconstruct:
203205
# TODO consider the best way to pass global configuration information to the run functions
204206
# This approach requires that all run functions support a singularity option
205207
params['singularity'] = SINGULARITY
206-
runner.run(wildcards.algorithm, params)
208+
PRRunner.run(wildcards.algorithm, params)
207209

208210
# Original pathway reconstruction output to universal output
209211
# Use PRRunner as a wrapper to call the algorithm-specific parse_output
210212
rule parse_output:
211213
input: raw_file = SEP.join([out_dir, '{dataset}-{algorithm}-{params}', 'raw-pathway.txt'])
212214
output: standardized_file = SEP.join([out_dir, '{dataset}-{algorithm}-{params}', 'pathway.txt'])
213215
run:
214-
runner.parse_output(wildcards.algorithm, input.raw_file, output.standardized_file)
216+
PRRunner.parse_output(wildcards.algorithm, input.raw_file, output.standardized_file)
215217

216218
# Collect summary statistics for a single pathway
217219
rule summarize_pathway:
@@ -245,7 +247,7 @@ rule summary_table:
245247
summary_df.to_csv(output.summary_table, sep='\t', index=False)
246248

247249
# Cluster the output pathways for each dataset
248-
rule ml_analysis:
250+
rule ml:
249251
input:
250252
pathways = expand('{out_dir}{sep}{{dataset}}-{algorithm_params}{sep}pathway.txt', out_dir=out_dir, sep=SEP, algorithm_params=algorithms_with_params)
251253
output:

config/config_prototype.yaml

Lines changed: 158 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,158 @@
1+
#
2+
# This list of algorithms should be generated by a script which checks the filesystem for installs.
3+
# It shouldn't be changed by mere mortals. (alternatively, we could add a path to executable for each algorithm
4+
# in the list to reduce the number of assumptions of the program at the cost of making the config a little more involved)
5+
# Each algorithm has an 'include' parameter. By toggling 'include' to true/false the user can change
6+
# which algorithms are run in a given experiment.
7+
#
8+
# algorithm-specific parameters are embedded in lists so that users can specify multiple. If multiple
9+
# parameters are specified then the algorithm will be run as many times as needed to cover all parameter
10+
# combinations. For instance if we have the following:
11+
# - name: "myAlg"
12+
# params:
13+
# include: true
14+
# a: [1,2]
15+
# b: [0.5,0.75]
16+
#
17+
# then myAlg will be run on (a=1,b=0.5),(a=1,b=0.75),(a=2,b=0.5), and (a=2,b=0,75). Pretty neat, but be
18+
# careful: too many parameters might make your runs take a long time.
19+
20+
algorithms:
21+
- name: "bowtiebuilder"
22+
params:
23+
include: false
24+
run1:
25+
26+
- name: "pathlinker"
27+
params:
28+
include: true
29+
run1:
30+
k: range(100,201,100)
31+
32+
- name: "responsenet"
33+
params:
34+
include: false
35+
run1:
36+
y: [20]
37+
38+
- name: "omicsintegrator1"
39+
params:
40+
include: true
41+
run1:
42+
r: [5]
43+
b: [5,6]
44+
w: np.linspace(0,5,2)
45+
g: [3]
46+
d: [10]
47+
48+
- name: "omicsintegrator2"
49+
params:
50+
include: true
51+
run1:
52+
b: [4]
53+
g: [0]
54+
run2:
55+
b: [2]
56+
g: [3]
57+
58+
- name: "shortestpaths"
59+
params:
60+
include: false
61+
run1:
62+
63+
- name: "rwr"
64+
params:
65+
include: false
66+
run1:
67+
a: [0.85]
68+
t: [0.3]
69+
70+
# Here we specify which pathways to run and other file location information.
71+
# DataLoader.py can currently only load a single dataset
72+
datasets:
73+
-
74+
label: data0
75+
node_files: ["node-prizes.txt", "sources.txt", "targets.txt"]
76+
# DataLoader.py can currently only load a single edge file, which is the primary network
77+
edge_files: ["network.txt"]
78+
# Placeholder
79+
other_files: []
80+
# Relative path from the spras directory
81+
data_dir: "input"
82+
-
83+
label: data1
84+
# Reuse some of the same sources file as 'data0' but different network and targets
85+
node_files: ["sources.txt", "alternative-targets.txt"]
86+
edge_files: ["alternative-network.txt"]
87+
other_files: []
88+
# Relative path from the spras directory
89+
data_dir: "input"
90+
91+
# If we want to reconstruct then we should set run to true.
92+
reconstruction_settings:
93+
94+
#set where everything is saved
95+
locations:
96+
97+
#place the path to your pathway annotation files here
98+
pathway_dir: "/path/to/pathways"
99+
100+
#place the save path here
101+
reconstruction_dir: "output"
102+
103+
run: true
104+
105+
#Do we want to augment our reconstructions
106+
augmentation_settings:
107+
108+
#save locations
109+
locations:
110+
reconstruction_dir: "/path/to/save"
111+
112+
PRAUG:
113+
114+
run: true
115+
116+
ensemble:
117+
118+
run: false
119+
rule: "Intersection"
120+
121+
122+
evaluation_settings:
123+
124+
locations:
125+
reconstruction_dir: "/path/to/reconstructions"
126+
127+
PR:
128+
compute: true
129+
130+
#the open-world assumption is the denial that being unannotated is to be not in the pathway.
131+
#it allows for sub-sampling of negatives.
132+
133+
open_word: true
134+
135+
#in the case that we use the open-world assumption, we need to tell the program how to negatives.
136+
#The options here are "fixed" or "random". In the case that "fixed" is selected the program will use
137+
#negative values picked by the authors of this software. If "random" is chosen, the program will sample
138+
#from unannotated interactions/proteins at a rate of 50 negatives to every positive.
139+
140+
negative_set: "fixed"
141+
142+
143+
144+
#Plotting
145+
plot_settings:
146+
147+
locations:
148+
reconstruction_dir: "/path/to/reconstructions"
149+
150+
plot_dir: "/path/to/plots"
151+
plots:
152+
153+
- name: "example_plot_1"
154+
type: "PR"
155+
#where panel determines whether many subfigures are used or just 1.
156+
# fmax determines whether to display fmax in the legend
157+
# edge determines whether to plot edge PR or node PR
158+
params: ["panel=True","fmax=True","edge=True"]

docker-demo/Dockerfile

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# Test activating conda environment before running command inside container
2+
# Uses the strategy from https://pythonspeed.com/articles/activate-conda-dockerfile/
3+
# by Itamar Turner-Trauring
4+
FROM continuumio/miniconda3
5+
6+
COPY env.yml .
7+
RUN conda env create -f env.yml
8+
9+
ENTRYPOINT ["conda", "run", "--no-capture-output", "-n", "test"]

docker-demo/README.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# Docker tests
2+
3+
This subdirectory contains examples of using Docker's Python API https://github.com/docker/docker-py.
4+
It uses the SINGE [example data](https://github.com/gitter-lab/SINGE/tree/master/data1) and [Docker image](https://hub.docker.com/r/agitter/singe) with a reduced set of hyperparamters.
5+
The docker-py API is more readable than the similar [BEELINE Docker command](https://github.com/Murali-group/Beeline/blob/7f6e07a3cb784227bf3fa889fe0c36e731c22c5c/BLRun/singeRunner.py#L110-L116) and most likely also more robust across different operating systems.
6+
7+
## Installation
8+
9+
Install docker-py with the command `pip install docker`.
10+
11+
The Docker client must be installed separately.
12+
13+
## Usage
14+
15+
Before running `docker-demo.py`, start the Docker client and install the `docker` Python package.
16+
Then, from this `docker` directory run the command:
17+
```
18+
python docker-demo.py
19+
```
20+
21+
SINGE will run inside Docker, which takes a few minutes.
22+
The output files will be written to the `output` subdirectory.
23+
24+
If the Docker image `agitter/singe:0.4.1` is not already available locally, the script will automatically pull it from [DockerHub](https://hub.docker.com/r/agitter/singe).
25+
26+
## Activating conda inside a Docker container
27+
28+
By default, an installed conda environment will not be activated inside the Docker container.
29+
Docker does not invoke Bash as a login shell.
30+
[This blog post](https://pythonspeed.com/articles/activate-conda-dockerfile/) provides a workaround demonstrated here in `Dockerfile` and `env.yml`.
31+
It defines a custom ENTRYPOINT that uses `conda run` to run the command inside the conda environment.
32+
33+
To create the Docker image run:
34+
```
35+
docker build -t conda-test/conda-test -f Dockerfile .
36+
```
37+
38+
To confirm that commands are run inside the conda environment run:
39+
```
40+
winpty docker run conda-test/conda-test conda list
41+
winpty docker run conda-test/conda-test python -c "import networkx; print(networkx.__version__)"
42+
```
43+
The `winpty` prefix is only needed on Windows.

0 commit comments

Comments
 (0)