|
| 1 | +################################### |
1 | 2 | Advanced Capabilities and Features |
2 | | -====================================== |
| 3 | +################################### |
3 | 4 |
|
4 | | -More like these are all the things we can do with this, but will not be showing |
| 5 | +Parameter tuning |
| 6 | +================ |
| 7 | +Parameter tuning is the process of determining which parameter combinations should be explored for each algorithm for a given dataset. |
| 8 | +Parameter tuning focuses on defining and refining the parameter search space. |
5 | 9 |
|
6 | | -- mention parameter tuning |
7 | | -- say that parameters are not preset and need to be tuned for each dataset |
| 10 | +Each dataset has unique characteristics so there are no preset parameters combinations to use. |
| 11 | +Instead, we recommend tuning parameters individually for each new dataset. |
| 12 | +SPRAS provides a flexible framework for getting parameter grids for any algorithms for a given dataset. |
8 | 13 |
|
9 | | -CHTC integration |
| 14 | +Grid Search |
| 15 | +------------ |
10 | 16 |
|
11 | | -Anything not included in the config file |
| 17 | +A grid search systematically checks different combinations of parameter values to see how each affects network reconstruction results. |
12 | 18 |
|
13 | | -1. Global Workflow Control |
| 19 | +In SPRAS, users can define parameter grids for each algorithm directly in the configuration file. |
| 20 | +When executed, SPRAS automatically runs each algorithm across all parameter combinations and collects the resulting subnetworks. |
14 | 21 |
|
15 | | -Sets options that apply to the entire workflow. |
| 22 | +SPRAS will also support parameter refinement using graph topological heuristics. |
| 23 | +These topological metrics help identify parameter regions that produce biologically plausible outputs networks. |
| 24 | +Based on these heuristics, SPRAS will generate new configuration files with refined parameter grids for each algorithm per dataset. |
16 | 25 |
|
17 | | -- Examples: the container framework (docker, singularity, dsub) and where to pull container images from |
| 26 | +Users can further refine these grids by rerunning the updated configuration and adjusting the parameter ranges around the newly identified regions to find and fine-tune the most promising algorithm specific outputs for a given dataset. |
18 | 27 |
|
19 | | -running spras with multiple parameter combinations with multiple algorithms on multiple Datasets |
20 | | -- for the tutorial we are only doing one dataset |
| 28 | +.. note:: |
21 | 29 |
|
22 | | -4. Gold Standards |
| 30 | + Some grid search features are still under development and will be added in future SPRAS releases. |
23 | 31 |
|
24 | | -Defines the input files SPRAS will use to evaluate output subnetworks |
| 32 | +Parameter selection |
| 33 | +------------------- |
25 | 34 |
|
26 | | -A gold standard dataset is comprised of: |
| 35 | +Parameter selection refers to the process of determining which parameter combinations should be used for evaluation on a gold standard dataset. |
27 | 36 |
|
28 | | -- a label: defines the name of the gold standard dataset |
29 | | -- node_file or edge_file: a list of either node files or edge files. Only one or the other can exist in a single dataset. At the moment only one edge or one node file can exist in one dataset |
30 | | -- data_dir: the path to where the input gold standard files live |
31 | | -- dataset_labels: a list of dataset labels that link each gold standard links to one or more datasets via the dataset labels |
| 37 | +Parameter selection is handled in the evaluation code, which supports multiple parameter selection strategies. |
| 38 | +Once the grid space search is complete for each dataset, the user can enable evaluation (by setting evaluation ``include: true``) and it will run all of the parameter selection code. |
| 39 | + |
| 40 | +PCA-based parameter selection |
| 41 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 42 | + |
| 43 | +The PCA-based approach identifies a representative parameter setting for each pathway reconstruction algorithm on a given dataset. |
| 44 | +It selects the single parameter combination that best captures the central trend of an algorithm's reconstruction behavior. |
| 45 | + |
| 46 | +.. image:: ../_static/images/pca-kde.png |
| 47 | + :alt: Principal component analysis visualization across pathway outputs with a kernel density estimate computed on top |
| 48 | + :width: 600 |
| 49 | + :align: center |
| 50 | + |
| 51 | +.. raw:: html |
| 52 | + |
| 53 | + <div style="margin:20px 0;"></div> |
| 54 | + |
| 55 | +For each algorithm, all reconstructed subnetworks are projected into an algorithm-specific 2D PCA space based on the set of edges produced by the respective parameter combinations for that algorithm. |
| 56 | +This projection summarizes how the algorithm's outputs vary across different parameter combinations, allowing patterns in the outputs to be visualized in a lower-dimensional space. |
| 57 | + |
| 58 | +Within each PCA space, a kernel density estimate (KDE) is computed over the projected points to identify regions of high density. |
| 59 | +The output closest to the highest KDE peak is selected as the most representative parameter setting, as it corresponds to the region where the algorithm most consistently produces similar subnetworks. |
| 60 | + |
| 61 | +Ensemble network-based parameter selection |
| 62 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 63 | +The ensemble-based approach combines results from all parameter settings for each pathway reconstruction algorithm on a given dataset. |
| 64 | +Instead of focusing on a single "best" parameter combination, it summarizes the algorithm's overall reconstruction behavior across parameters. |
| 65 | + |
| 66 | +All reconstructed subnetworks are merged into algorithm-specific ensemble networks, where each edge weight reflects how frequently that interaction appears across the outputs. |
| 67 | +Edges that occur more often are assigned higher weights, highlighting interactions that are most consistently recovered by the algorithm. |
| 68 | + |
| 69 | +These consensus networks help identify the core patterns and overall stability of an algorithm's output's without needing to choose a single parameter setting (no clear optimal parameter combination could exists). |
| 70 | + |
| 71 | + |
| 72 | +Ground truth-based evaluation without parameter selection |
| 73 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 74 | + |
| 75 | +The no parameter selection approach chooses all parameter combinations for each pathway reconstruction algorithm on a given dataset. |
| 76 | +This approach can be useful for idenitifying patterns in algorithm performance without favoring any specific parameter setting. |
| 77 | + |
| 78 | +Evaluation |
| 79 | +============ |
| 80 | + |
| 81 | +In some cases, users may have a gold standard file that allows them to evaluate the quality of the reconstructed subnetworks generated by pathway reconstruction algorithms. |
| 82 | + |
| 83 | +However, gold standards may not exist for certain types of experimental data where validated ground truth interactions or molecules are unavailable or incomplete. |
| 84 | +For example, in emerging research areas or poorly characterized biological systems, interactions may not yet be experimentally verified or fully known, making it difficult to define a reliable reference network for evaluation. |
| 85 | + |
| 86 | +Adding gold standard datasets and evaluation post analysis a configuration |
| 87 | +-------------------------------------------------------------------------- |
| 88 | + |
| 89 | +In the configuration file, users can specify one or more gold standard datasets to evaluate the subnetworks reconstructed from each dataset. |
| 90 | +When gold standards are provided and evaluation is enabled (``include: true``), SPRAS will automatically compare the reconstructed subnetworks for a specific dataset against the corresponding gold standards. |
| 91 | + |
| 92 | +.. code-block:: yaml |
| 93 | +
|
| 94 | + gold_standards: |
| 95 | + - |
| 96 | + label: gs1 |
| 97 | + node_files: ["gs_nodes0.txt", "gs_nodes1.txt"] |
| 98 | + data_dir: "input" |
| 99 | + dataset_labels: ["data0"] |
| 100 | + - |
| 101 | + label: gs2 |
| 102 | + edge_files: ["gs_edges0.txt"] |
| 103 | + data_dir: "input" |
| 104 | + dataset_labels: ["data0", "data1"] |
| 105 | +
|
| 106 | + analysis: |
| 107 | + evaluation: |
| 108 | + include: true |
| 109 | +
|
| 110 | +A gold standard dataset must include the following types of keys and files: |
| 111 | + |
| 112 | +- ``label``: a name that uniquely identifies a gold standard dataset throughout the SPRAS workflow and outputs. |
| 113 | +- ``node_file`` or ``edge_file``: A list of node or edge files. Only one of these can be defined per gold standard dataset. |
| 114 | +- ``data_dir``: The file path of the directory where the input gold standard dataset files are located. |
| 115 | +- ``dataset_labels``: a list of dataset labels indicating which datasets this gold standard dataset should be evaluated against. |
| 116 | + |
| 117 | +When evaluation is enabled, SPRAS will automatically run its built-in evaluation analysis on each defined dataset-gold standard pair. |
| 118 | +This evaluation computes metrics such as precision, recall, and precision-recall curves, depending on the parameter selection method used. |
| 119 | + |
| 120 | +For each pathway, evaluation can be run independently of any parameter selection method (the ground truth-based evaluation without parameter selection idea) to directly inspect precision and recall for each reconstructed network from a given dataset. |
| 121 | + |
| 122 | +.. image:: ../_static/images/pr-per-pathway-nodes.png |
| 123 | + :alt: Precision and recall computed for each pathway and visualized on a scatter plot |
| 124 | + :width: 600 |
| 125 | + :align: center |
| 126 | + |
| 127 | +.. raw:: html |
| 128 | + |
| 129 | + <div style="margin:20px 0;"></div> |
| 130 | + |
| 131 | +Ensemble-based parameter selection generates precision-recall curves by thresholding on the frequency of edges across an ensemble of reconstructed networks for an algorithm for given dataset. |
| 132 | + |
| 133 | +.. image:: ../_static/images/pr-curve-ensemble-nodes-per-algorithm-nodes.png |
| 134 | + :alt: Precision-recall curve computed for a single ensemble file / pathway and visualized as a curve |
| 135 | + :width: 600 |
| 136 | + :align: center |
| 137 | + |
| 138 | +.. raw:: html |
| 139 | + |
| 140 | + <div style="margin:20px 0;"></div> |
| 141 | + |
| 142 | +PCA-based parameter selection computes a precision and recall for a single reconstructed network selected using PCA from all reconstructed networks for an algorithm for given dataset. |
| 143 | + |
| 144 | +.. image:: ../_static/images/pr-pca-chosen-pathway-per-algorithm-nodes.png |
| 145 | + :alt: Precision and recall computed for each pathway chosen by the PCA-selection method and visualized on a scatter plot |
| 146 | + :width: 600 |
| 147 | + :align: center |
| 148 | + |
| 149 | +.. raw:: html |
| 150 | + |
| 151 | + <div style="margin:20px 0;"></div> |
| 152 | + |
| 153 | +.. note:: |
| 154 | + Evaluation will only execute if ml has ``include: true``, because the PCA parameter selection step depends on the PCA ML analysis. |
| 155 | + |
| 156 | +.. note:: |
| 157 | + To see evaluation in action, run SPRAS using the config.yaml or egfr.yaml configuration files. |
| 158 | + |
| 159 | +HTCondor integration |
| 160 | +===================== |
| 161 | + |
| 162 | +Running SPRAS locally can become slow and resource intensive, especially when running many algorithms, parameter combinations, or datasets simultaneously. |
| 163 | + |
| 164 | +To address this, SPRAS supports an integration with `HTCondor <https://htcondor.org/>`__ (a high throughput computing system), allowing Snakemake jobs to be distributed in parallel and executed across available compute. |
| 165 | + |
| 166 | +See :doc:`Running with HTCondor <../htcondor>` for more information on SPRAS's integrations with HTConder. |
| 167 | + |
| 168 | + |
| 169 | +Ability to run with different container frameworks |
| 170 | +--------------------------------------------------- |
| 171 | + |
| 172 | +CHTC uses Apptainer to run containerized software in secure, high-performance environments. |
| 173 | + |
| 174 | +SPRAS accommodates this by allowing users to specify which container framework to use globally within their workflow configuration. |
| 175 | + |
| 176 | +The global workflow control section in the configuration file allows a user to set which SPRAS supported container framework to use: |
| 177 | + |
| 178 | +.. code-block:: yaml |
| 179 | +
|
| 180 | + container_framework: docker |
| 181 | +
|
| 182 | +The frameworks include Docker, Apptainer/Singularity, or dsub |
0 commit comments