Lyce24
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 3 additions & 3 deletions b/‎CONTRIBUTING.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎src/dataset.py‎ renamed to ‎Dataset.py‎ b/‎src/dataset.py‎ renamed to ‎Dataset.py‎
diff --git a/‎src/runner.py‎ renamed to ‎PRRunner.py‎
Lines changed: 3 additions & 4 deletions b/‎src/runner.py‎ renamed to ‎PRRunner.py‎
Lines changed: 3 additions & 4 deletions
diff --git a/‎README.md‎
Lines changed: 5 additions & 1 deletion b/‎README.md‎
Lines changed: 5 additions & 1 deletion
diff --git a/‎Snakefile‎
Lines changed: 13 additions & 11 deletions b/‎Snakefile‎
Lines changed: 13 additions & 11 deletions
diff --git a/‎config/config_prototype.yaml‎
Lines changed: 158 additions & 0 deletions b/‎config/config_prototype.yaml‎
Lines changed: 158 additions & 0 deletions
diff --git a/‎docker-demo/Dockerfile‎
Lines changed: 9 additions & 0 deletions b/‎docker-demo/Dockerfile‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎docker-demo/README.md‎
Lines changed: 43 additions & 0 deletions b/‎docker-demo/README.md‎
Lines changed: 43 additions & 0 deletions
@@ -76,7 +76,7 @@ These entries are used to tell Snakemake what input files should be present befo
 Implement the `generate_inputs` function, following the `omicsintegrator1.py` example.
 The nodes should be any node in the dataset that has a prize set, any node that is a source, or any node that is a target.
 The network should be all of the edges written in the format `<vertex1>|<vertex2>`.
-`src/dataset.py` provides functions that provide access to node information and the interactome (edge list).
+`Dataset.py` provides functions that provide access to node information and the interactome (edge list).
 
 Implement the `run` function, following the Path Linker example.
 The `prepare_volume` utility function is needed to prepare the network and nodes input files to be mounted and used inside the container.
@@ -89,7 +89,7 @@ Use the `run_container` utility function to run the command in the container `<u
 
 Implement the `parse_output` function.
 The edges in the Local Neighborhood output have the same format as the input, `<vertex1>|<vertex2>`.
-Convert these to be tab-separated vertex pairs followed by a tab and a `1` at the end of every line, which indicates all edges have the same rank.
+Convert these to be space-separated vertex pairs followed by a space and a `1` at the end of every line, which indicates all edges have the same rank.
 The output should have the format `<vertex1> <vertex2> 1`.
 
 ### Step 4: Make the Local Neighborhood wrapper accessible through SPRAS
@@ -130,7 +130,7 @@ The pull request will be closed so that future contributors can practice with th
 1. Add a new subdirectory to `docker-wrappers` with the name `<algorithm>`, write a `Dockerfile` to build an image for `<algorithm>`, and include any other files required to build that image in the subdirectory
 1. Build and push the Docker image to the [reedcompbio](https://hub.docker.com/orgs/reedcompbio) Docker organization (SPRAS maintainer required)
 1. Add a new Python file `src/<algorithm>.py` to implement the wrapper functions for `<algorithm>`: specify the list of `required_input` files and the `generate_inputs`, `run`, and `parse_output` functions
-1. Import the new class in `runner.py` so the wrapper functions can be accessed
+1. Import the new class in `PRRunner.py` so the wrapper functions can be accessed
 1. Document the usage of the Docker wrapper and the assumptions made when implementing the wrapper
 1. Add example usage for the new algorithm and its parameters to the template config file
 1. Write test functions and provide example input data in a new test subdirectory `test/<algorithm>`
 
@@ -1,4 +1,4 @@
-from src.dataset import Dataset
+import Dataset
 
 # supported algorithm imports
 from src.local_neighborhood import LocalNeighborhood as local_neighborhood
@@ -8,7 +8,6 @@
 from src.pathlinker import PathLinker as pathlinker
 from src.mincostflow import MinCostFlow as mincostflow
 
-
 def run(algorithm, params):
     """
     A generic interface to the algorithm-specific run functions
@@ -39,7 +38,7 @@ def merge_input(dataset_dict, dataset_file):
     @param dataset_dict: dataset to process
     @param dataset_file: output filename
     """
-    dataset = Dataset(dataset_dict)
+    dataset = Dataset.Dataset(dataset_dict)
     dataset.to_file(dataset_file)
 
 
@@ -51,7 +50,7 @@ def prepare_inputs(algorithm, data_file, filename_map):
     @param filename_map: a dict mapping file types in the required_inputs to the filename for that type
     @return:
     """
-    dataset = Dataset.from_file(data_file)
+    dataset = Dataset.Dataset.from_file(data_file)
     try:
         algorithm_runner = globals()[algorithm.lower()]
     except KeyError:
 
@@ -62,7 +62,7 @@ The Docker images are available on [DockerHub](https://hub.docker.com/orgs/reedc
 **Python wrapper for calling algorithms**: Wrapper functions provide an interface between the common file formats for input and output data and the algorithm-specific file formats and reconstruction commands.
 These wrappers are in the `src/` subdirectory.
 
-**Test code**: Tests for the Docker wrappers and SPRAS code.
+**Test code**: Tests for the Docker wrappers.
 The tests require the conda environment in `environment.yml` and Docker.
 Run the tests with `pytest -s`.
 
@@ -71,6 +71,10 @@ Some computing environments are unable to run Docker and prefer Singularity as t
 SPRAS has limited experimental support for Singularity instead of Docker, and only for some pathway reconstruction algorithms.
 SPRAS uses the spython package to interface with Singularity, which only supports Linux.
 
+## Docker demo
+The `docker-demo` subdirectory is not used by the main pathway reconstruction framework.
+It serves as a reference for how to set up Dockerfiles and make Docker run calls.
+
 ## Attribution
 SPRAS builds on public datasets and algorithms.
 If you use SPRAS in a research project, please cite the original datasets and algorithms in addition to SPRAS.
 
@@ -1,10 +1,12 @@
 import os
-from src import runner
+import PRRunner
 import shutil
 import yaml
-from src.dataset import Dataset
+from Dataset import Dataset
 from src.util import process_config
-from src.analysis import ml, summary, graphspace
+from src.analysis.summary import summary
+from src.analysis.viz import graphspace
+from src.analysis.ml import ml
 
 # Snakemake updated the behavior in the 6.5.0 release https://github.com/snakemake/snakemake/pull/1037
 # and using the wrong separator prevents Snakemake from matching filenames to the rules that can produce them
@@ -125,7 +127,7 @@ rule merge_input:
     run:
         # Pass the dataset to PRRunner where the files will be merged and written to disk (i.e. pickled)
         dataset_dict = get_dataset(datasets, wildcards.dataset)
-        runner.merge_input(dataset_dict, output.dataset_file)
+        PRRunner.merge_input(dataset_dict, output.dataset_file)
 
 # The checkpoint is like a rule but can be used in dynamic workflows
 # The workflow directed acyclic graph is re-evaluated after the checkpoint job runs
@@ -144,8 +146,8 @@ checkpoint prepare_input:
         # Use the algorithm's generate_inputs function to load the merged dataset, extract the relevant columns,
         # and write the output files specified by required_inputs
         # The filename_map provides the output file path for each required input file type
-        filename_map = {input_type: SEP.join([out_dir, 'prepared', f'{wildcards.dataset}-{wildcards.algorithm}-inputs', f'{input_type}.txt']) for input_type in runner.get_required_inputs(wildcards.algorithm)}
-        runner.prepare_inputs(wildcards.algorithm, input.dataset_file, filename_map)
+        filename_map = {input_type: SEP.join([out_dir, 'prepared', f'{wildcards.dataset}-{wildcards.algorithm}-inputs', f'{input_type}.txt']) for input_type in PRRunner.get_required_inputs(wildcards.algorithm)}
+        PRRunner.prepare_inputs(wildcards.algorithm, input.dataset_file, filename_map)
 
 # Collect the prepared input files from the specified directory
 # If the directory does not exist for this dataset-algorithm pair, the checkpoint will detect that
@@ -162,7 +164,7 @@ def collect_prepared_input(wildcards):
     prepared_dir = SEP.join([out_dir, 'prepared', f'{wildcards.dataset}-{wildcards.algorithm}-inputs'])
 
     # Construct the list of expected prepared input files for the reconstruction algorithm
-    prepared_inputs = expand(f'{prepared_dir}{SEP}{{type}}.txt',type=runner.get_required_inputs(algorithm=wildcards.algorithm))
+    prepared_inputs = expand(f'{prepared_dir}{SEP}{{type}}.txt',type=PRRunner.get_required_inputs(algorithm=wildcards.algorithm))
     # If the directory is missing, do nothing because the missing output triggers running prepare_input
     if os.path.isdir(prepared_dir):
         # If the directory exists, confirm all prepared input files exist as well (as opposed to some or none)
@@ -193,7 +195,7 @@ rule reconstruct:
         # Create a copy so that the updates are not written to the parameters logfile
         params = reconstruction_params(wildcards.algorithm, wildcards.params).copy()
         # Add the input files
-        params.update(dict(zip(runner.get_required_inputs(wildcards.algorithm), *{input})))
+        params.update(dict(zip(PRRunner.get_required_inputs(wildcards.algorithm), *{input})))
         # Add the output file
         # All run functions can accept a relative path to the output file that should be written that is called 'output_file'
         params['output_file'] = output.pathway_file
@@ -203,15 +205,15 @@ rule reconstruct:
         # TODO consider the best way to pass global configuration information to the run functions
         # This approach requires that all run functions support a singularity option
         params['singularity'] = SINGULARITY
-        runner.run(wildcards.algorithm, params)
+        PRRunner.run(wildcards.algorithm, params)
 
 # Original pathway reconstruction output to universal output
 # Use PRRunner as a wrapper to call the algorithm-specific parse_output
 rule parse_output:
     input: raw_file = SEP.join([out_dir, '{dataset}-{algorithm}-{params}', 'raw-pathway.txt'])
     output: standardized_file = SEP.join([out_dir, '{dataset}-{algorithm}-{params}', 'pathway.txt'])
     run:
-        runner.parse_output(wildcards.algorithm, input.raw_file, output.standardized_file)
+        PRRunner.parse_output(wildcards.algorithm, input.raw_file, output.standardized_file)
 
 # Collect summary statistics for a single pathway
 rule summarize_pathway:
@@ -245,7 +247,7 @@ rule summary_table:
         summary_df.to_csv(output.summary_table, sep='\t', index=False)
 
 # Cluster the output pathways for each dataset
-rule ml_analysis:
+rule ml: 
     input: 
         pathways = expand('{out_dir}{sep}{{dataset}}-{algorithm_params}{sep}pathway.txt', out_dir=out_dir, sep=SEP, algorithm_params=algorithms_with_params)
     output: 
 
@@ -0,0 +1,158 @@
+ #
+ # This list of algorithms should be generated by a script which checks the filesystem for installs.
+ # It shouldn't be changed by mere mortals. (alternatively, we could add a path to executable for each algorithm
+ # in the list to reduce the number of assumptions of the program at the cost of making the config a little more involved)
+ # Each algorithm has an 'include' parameter. By toggling 'include' to true/false the user can change
+ # which algorithms are run in a given experiment.
+ #
+ # algorithm-specific parameters are embedded in lists so that users can specify multiple. If multiple
+ # parameters are specified then the algorithm will be run as many times as needed to cover all parameter
+ # combinations. For instance if we have the following:
+ # - name: "myAlg"
+ #   params:
+ #         include: true
+ #         a: [1,2]
+ #         b: [0.5,0.75]
+ #
+ # then myAlg will be run on (a=1,b=0.5),(a=1,b=0.75),(a=2,b=0.5), and (a=2,b=0,75). Pretty neat, but be
+ # careful: too many parameters might make your runs take a long time.
+
+ algorithms:
+        - name: "bowtiebuilder"
+          params:
+                include: false
+                run1:
+
+        - name: "pathlinker"
+          params:
+                include: true
+                run1:
+                    k: range(100,201,100)
+
+        - name: "responsenet"
+          params:
+                include: false
+                run1:
+                    y: [20]
+
+        - name: "omicsintegrator1"
+          params:
+                include: true
+                run1:
+                    r: [5]
+                    b: [5,6]
+                    w: np.linspace(0,5,2)
+                    g: [3]
+                    d: [10]
+
+        - name: "omicsintegrator2"
+          params:
+                include: true
+                run1:
+                    b: [4]
+                    g: [0]
+                run2:
+                    b: [2]
+                    g: [3]
+
+        - name: "shortestpaths"
+          params:
+                include: false
+                run1:
+
+        - name: "rwr"
+          params:
+                include: false
+                run1:
+                    a: [0.85]
+                    t: [0.3]
+
+ # Here we specify which pathways to run and other file location information.
+ # DataLoader.py can currently only load a single dataset
+ datasets:
+     -
+       label: data0
+       node_files: ["node-prizes.txt", "sources.txt", "targets.txt"]
+       # DataLoader.py can currently only load a single edge file, which is the primary network
+       edge_files: ["network.txt"]
+       # Placeholder
+       other_files: []
+       # Relative path from the spras directory
+       data_dir: "input"
+     -
+       label: data1
+       # Reuse some of the same sources file as 'data0' but different network and targets
+       node_files: ["sources.txt", "alternative-targets.txt"]
+       edge_files: ["alternative-network.txt"]
+       other_files: []
+       # Relative path from the spras directory
+       data_dir: "input"
+
+ # If we want to reconstruct then we should set run to true.
+ reconstruction_settings:
+
+         #set where everything is saved
+         locations:
+
+                #place the path to your pathway annotation files here
+                pathway_dir: "/path/to/pathways"
+
+                #place the save path here
+                reconstruction_dir: "output"
+
+         run: true
+
+ #Do we want to augment our reconstructions
+ augmentation_settings:
+
+         #save locations
+         locations:
+                 reconstruction_dir: "/path/to/save"
+
+         PRAUG:
+
+                 run: true
+
+         ensemble:
+
+                 run: false
+                 rule: "Intersection"
+
+
+ evaluation_settings:
+
+         locations:
+                 reconstruction_dir: "/path/to/reconstructions"
+
+         PR:
+                 compute: true
+
+                 #the open-world assumption is the denial that being unannotated is to be not in the pathway.
+                 #it allows for sub-sampling of negatives.
+
+                 open_word: true
+
+                 #in the case that we use the open-world assumption, we need to tell the program how to negatives.
+                 #The options here are "fixed" or "random". In the case that "fixed" is selected the program will use
+                 #negative values picked by the authors of this software. If "random" is chosen, the program will sample
+                 #from unannotated interactions/proteins at a rate of 50 negatives to every positive.
+
+                 negative_set: "fixed"
+
+
+
+ #Plotting
+ plot_settings:
+
+         locations:
+                 reconstruction_dir: "/path/to/reconstructions"
+
+                 plot_dir: "/path/to/plots"
+         plots:
+
+                - name: "example_plot_1"
+                  type: "PR"
+                  #where panel determines whether many subfigures are used or just 1.
+                  #      fmax determines whether to display fmax in the legend
+                  #      edge determines whether to plot edge PR or node PR
+                  params: ["panel=True","fmax=True","edge=True"]
@@ -0,0 +1,9 @@
+# Test activating conda environment before running command inside container
+# Uses the strategy from https://pythonspeed.com/articles/activate-conda-dockerfile/
+# by Itamar Turner-Trauring
+FROM continuumio/miniconda3
+
+COPY env.yml .
+RUN conda env create -f env.yml
+
+ENTRYPOINT ["conda", "run", "--no-capture-output", "-n", "test"]
@@ -0,0 +1,43 @@
+# Docker tests
+
+This subdirectory contains examples of using Docker's Python API https://github.com/docker/docker-py.
+It uses the SINGE [example data](https://github.com/gitter-lab/SINGE/tree/master/data1) and [Docker image](https://hub.docker.com/r/agitter/singe) with a reduced set of hyperparamters.
+The docker-py API is more readable than the similar [BEELINE Docker command](https://github.com/Murali-group/Beeline/blob/7f6e07a3cb784227bf3fa889fe0c36e731c22c5c/BLRun/singeRunner.py#L110-L116) and most likely also more robust across different operating systems.
+
+## Installation
+
+Install docker-py with the command `pip install docker`.
+
+The Docker client must be installed separately.
+
+## Usage
+
+Before running `docker-demo.py`, start the Docker client and install the `docker` Python package.
+Then, from this `docker` directory run the command:
+```
+python docker-demo.py
+```
+
+SINGE will run inside Docker, which takes a few minutes.
+The output files will be written to the `output` subdirectory.
+
+If the Docker image `agitter/singe:0.4.1` is not already available locally, the script will automatically pull it from [DockerHub](https://hub.docker.com/r/agitter/singe).
+
+## Activating conda inside a Docker container
+
+By default, an installed conda environment will not be activated inside the Docker container.
+Docker does not invoke Bash as a login shell.
+[This blog post](https://pythonspeed.com/articles/activate-conda-dockerfile/) provides a workaround demonstrated here in `Dockerfile` and `env.yml`.
+It defines a custom ENTRYPOINT that uses `conda run` to run the command inside the conda environment.
+
+To create the Docker image run:
+```
+docker build -t conda-test/conda-test -f Dockerfile .
+```
+
+To confirm that commands are run inside the conda environment run:
+```
+winpty docker run conda-test/conda-test conda list
+winpty docker run conda-test/conda-test python -c "import networkx; print(networkx.__version__)"
+```
+The `winpty` prefix is only needed on Windows.