Skip to content

Commit 544ca29

Browse files
authored
Merge pull request #421 from ntalluri/tutorial
docs: updating tutorial for COMBINE25
2 parents 9073efc + 3b08ebd commit 544ca29

12 files changed

+885
-475
lines changed

docs/_static/config/beginner.yaml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -26,8 +26,9 @@ algorithms:
2626
include: true
2727
run1:
2828
k: 1
29-
# run2: # uncomment for step 3.2
30-
# k: [10, 100] # uncomment for step 3.2
29+
30+
# run2: # uncomment for step 3.2
31+
# k: [10, 100] # uncomment for step 3.2
3132

3233
# Here we specify which pathways to run and other file location information.
3334
# Assume that if a dataset label does not change, the lists of associated input files do not change
@@ -45,7 +46,7 @@ reconstruction_settings:
4546

4647
# Set where everything is saved
4748
locations:
48-
reconstruction_dir: "output/basic"
49+
reconstruction_dir: "output/beginner"
4950

5051
analysis:
5152
# Create one summary per pathway file and a single summary table for all pathways for each dataset
42 KB
Loading
119 KB
Loading

docs/_static/images/pca-kde.png

453 KB
Loading
116 KB
Loading
31 KB
Loading
39.6 KB
Loading

docs/tutorial/advanced.rst

Lines changed: 169 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,182 @@
1+
###################################
12
Advanced Capabilities and Features
2-
======================================
3+
###################################
34

4-
More like these are all the things we can do with this, but will not be showing
5+
Parameter tuning
6+
================
7+
Parameter tuning is the process of determining which parameter combinations should be explored for each algorithm for a given dataset.
8+
Parameter tuning focuses on defining and refining the parameter search space.
59

6-
- mention parameter tuning
7-
- say that parameters are not preset and need to be tuned for each dataset
10+
Each dataset has unique characteristics so there are no preset parameters combinations to use.
11+
Instead, we recommend tuning parameters individually for each new dataset.
12+
SPRAS provides a flexible framework for getting parameter grids for any algorithms for a given dataset.
813

9-
CHTC integration
14+
Grid Search
15+
------------
1016

11-
Anything not included in the config file
17+
A grid search systematically checks different combinations of parameter values to see how each affects network reconstruction results.
1218

13-
1. Global Workflow Control
19+
In SPRAS, users can define parameter grids for each algorithm directly in the configuration file.
20+
When executed, SPRAS automatically runs each algorithm across all parameter combinations and collects the resulting subnetworks.
1421

15-
Sets options that apply to the entire workflow.
22+
SPRAS will also support parameter refinement using graph topological heuristics.
23+
These topological metrics help identify parameter regions that produce biologically plausible outputs networks.
24+
Based on these heuristics, SPRAS will generate new configuration files with refined parameter grids for each algorithm per dataset.
1625

17-
- Examples: the container framework (docker, singularity, dsub) and where to pull container images from
26+
Users can further refine these grids by rerunning the updated configuration and adjusting the parameter ranges around the newly identified regions to find and fine-tune the most promising algorithm specific outputs for a given dataset.
1827

19-
running spras with multiple parameter combinations with multiple algorithms on multiple Datasets
20-
- for the tutorial we are only doing one dataset
28+
.. note::
2129

22-
4. Gold Standards
30+
Some grid search features are still under development and will be added in future SPRAS releases.
2331

24-
Defines the input files SPRAS will use to evaluate output subnetworks
32+
Parameter selection
33+
-------------------
2534

26-
A gold standard dataset is comprised of:
35+
Parameter selection refers to the process of determining which parameter combinations should be used for evaluation on a gold standard dataset.
2736

28-
- a label: defines the name of the gold standard dataset
29-
- node_file or edge_file: a list of either node files or edge files. Only one or the other can exist in a single dataset. At the moment only one edge or one node file can exist in one dataset
30-
- data_dir: the path to where the input gold standard files live
31-
- dataset_labels: a list of dataset labels that link each gold standard links to one or more datasets via the dataset labels
37+
Parameter selection is handled in the evaluation code, which supports multiple parameter selection strategies.
38+
Once the grid space search is complete for each dataset, the user can enable evaluation (by setting evaluation ``include: true``) and it will run all of the parameter selection code.
39+
40+
PCA-based parameter selection
41+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
42+
43+
The PCA-based approach identifies a representative parameter setting for each pathway reconstruction algorithm on a given dataset.
44+
It selects the single parameter combination that best captures the central trend of an algorithm's reconstruction behavior.
45+
46+
.. image:: ../_static/images/pca-kde.png
47+
:alt: Principal component analysis visualization across pathway outputs with a kernel density estimate computed on top
48+
:width: 600
49+
:align: center
50+
51+
.. raw:: html
52+
53+
<div style="margin:20px 0;"></div>
54+
55+
For each algorithm, all reconstructed subnetworks are projected into an algorithm-specific 2D PCA space based on the set of edges produced by the respective parameter combinations for that algorithm.
56+
This projection summarizes how the algorithm's outputs vary across different parameter combinations, allowing patterns in the outputs to be visualized in a lower-dimensional space.
57+
58+
Within each PCA space, a kernel density estimate (KDE) is computed over the projected points to identify regions of high density.
59+
The output closest to the highest KDE peak is selected as the most representative parameter setting, as it corresponds to the region where the algorithm most consistently produces similar subnetworks.
60+
61+
Ensemble network-based parameter selection
62+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
63+
The ensemble-based approach combines results from all parameter settings for each pathway reconstruction algorithm on a given dataset.
64+
Instead of focusing on a single "best" parameter combination, it summarizes the algorithm's overall reconstruction behavior across parameters.
65+
66+
All reconstructed subnetworks are merged into algorithm-specific ensemble networks, where each edge weight reflects how frequently that interaction appears across the outputs.
67+
Edges that occur more often are assigned higher weights, highlighting interactions that are most consistently recovered by the algorithm.
68+
69+
These consensus networks help identify the core patterns and overall stability of an algorithm's output's without needing to choose a single parameter setting (no clear optimal parameter combination could exists).
70+
71+
72+
Ground truth-based evaluation without parameter selection
73+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
74+
75+
The no parameter selection approach chooses all parameter combinations for each pathway reconstruction algorithm on a given dataset.
76+
This approach can be useful for idenitifying patterns in algorithm performance without favoring any specific parameter setting.
77+
78+
Evaluation
79+
============
80+
81+
In some cases, users may have a gold standard file that allows them to evaluate the quality of the reconstructed subnetworks generated by pathway reconstruction algorithms.
82+
83+
However, gold standards may not exist for certain types of experimental data where validated ground truth interactions or molecules are unavailable or incomplete.
84+
For example, in emerging research areas or poorly characterized biological systems, interactions may not yet be experimentally verified or fully known, making it difficult to define a reliable reference network for evaluation.
85+
86+
Adding gold standard datasets and evaluation post analysis a configuration
87+
--------------------------------------------------------------------------
88+
89+
In the configuration file, users can specify one or more gold standard datasets to evaluate the subnetworks reconstructed from each dataset.
90+
When gold standards are provided and evaluation is enabled (``include: true``), SPRAS will automatically compare the reconstructed subnetworks for a specific dataset against the corresponding gold standards.
91+
92+
.. code-block:: yaml
93+
94+
gold_standards:
95+
-
96+
label: gs1
97+
node_files: ["gs_nodes0.txt", "gs_nodes1.txt"]
98+
data_dir: "input"
99+
dataset_labels: ["data0"]
100+
-
101+
label: gs2
102+
edge_files: ["gs_edges0.txt"]
103+
data_dir: "input"
104+
dataset_labels: ["data0", "data1"]
105+
106+
analysis:
107+
evaluation:
108+
include: true
109+
110+
A gold standard dataset must include the following types of keys and files:
111+
112+
- ``label``: a name that uniquely identifies a gold standard dataset throughout the SPRAS workflow and outputs.
113+
- ``node_file`` or ``edge_file``: A list of node or edge files. Only one of these can be defined per gold standard dataset.
114+
- ``data_dir``: The file path of the directory where the input gold standard dataset files are located.
115+
- ``dataset_labels``: a list of dataset labels indicating which datasets this gold standard dataset should be evaluated against.
116+
117+
When evaluation is enabled, SPRAS will automatically run its built-in evaluation analysis on each defined dataset-gold standard pair.
118+
This evaluation computes metrics such as precision, recall, and precision-recall curves, depending on the parameter selection method used.
119+
120+
For each pathway, evaluation can be run independently of any parameter selection method (the ground truth-based evaluation without parameter selection idea) to directly inspect precision and recall for each reconstructed network from a given dataset.
121+
122+
.. image:: ../_static/images/pr-per-pathway-nodes.png
123+
:alt: Precision and recall computed for each pathway and visualized on a scatter plot
124+
:width: 600
125+
:align: center
126+
127+
.. raw:: html
128+
129+
<div style="margin:20px 0;"></div>
130+
131+
Ensemble-based parameter selection generates precision-recall curves by thresholding on the frequency of edges across an ensemble of reconstructed networks for an algorithm for given dataset.
132+
133+
.. image:: ../_static/images/pr-curve-ensemble-nodes-per-algorithm-nodes.png
134+
:alt: Precision-recall curve computed for a single ensemble file / pathway and visualized as a curve
135+
:width: 600
136+
:align: center
137+
138+
.. raw:: html
139+
140+
<div style="margin:20px 0;"></div>
141+
142+
PCA-based parameter selection computes a precision and recall for a single reconstructed network selected using PCA from all reconstructed networks for an algorithm for given dataset.
143+
144+
.. image:: ../_static/images/pr-pca-chosen-pathway-per-algorithm-nodes.png
145+
:alt: Precision and recall computed for each pathway chosen by the PCA-selection method and visualized on a scatter plot
146+
:width: 600
147+
:align: center
148+
149+
.. raw:: html
150+
151+
<div style="margin:20px 0;"></div>
152+
153+
.. note::
154+
Evaluation will only execute if ml has ``include: true``, because the PCA parameter selection step depends on the PCA ML analysis.
155+
156+
.. note::
157+
To see evaluation in action, run SPRAS using the config.yaml or egfr.yaml configuration files.
158+
159+
HTCondor integration
160+
=====================
161+
162+
Running SPRAS locally can become slow and resource intensive, especially when running many algorithms, parameter combinations, or datasets simultaneously.
163+
164+
To address this, SPRAS supports an integration with `HTCondor <https://htcondor.org/>`__ (a high throughput computing system), allowing Snakemake jobs to be distributed in parallel and executed across available compute.
165+
166+
See :doc:`Running with HTCondor <../htcondor>` for more information on SPRAS's integrations with HTConder.
167+
168+
169+
Ability to run with different container frameworks
170+
---------------------------------------------------
171+
172+
CHTC uses Apptainer to run containerized software in secure, high-performance environments.
173+
174+
SPRAS accommodates this by allowing users to specify which container framework to use globally within their workflow configuration.
175+
176+
The global workflow control section in the configuration file allows a user to set which SPRAS supported container framework to use:
177+
178+
.. code-block:: yaml
179+
180+
container_framework: docker
181+
182+
The frameworks include Docker, Apptainer/Singularity, or dsub

0 commit comments

Comments
 (0)