-
Notifications
You must be signed in to change notification settings - Fork 1
Module Usage in Projects
As a concrete example, we will apply the unsupervised_analysis
module to the UCI ML hand-written digits dataset, digits
, imported from sklearn.
We provide a minimal example of an unsupervised analysis of the UCI ML hand-written digits datasets imported from sklearn:
- Configuration
- configuration:
config/digits/digits_unsupervised_analysis_config.yaml
- annotation:
config/digits/digits_unsupervised_analysis_annotation.csv
- configuration:
- Data (automatically generated within the example)
- dataset (1797 observations, 64 features):
data/digits/digits_data.csv
- metadata (consisting only of the ground truth label "target"):
data/digits/digits_labels.csv
- dataset (1797 observations, 64 features):
- Results will be generated in the configured results folder
results/digits/
- Performance: On an HPC it took less than 7 minutes to complete a full run split into 92 jobs with up to 32GB of memory per job. Excluding conda environment installations.
First, we provide the configuration file for the application of the unsupervised_analysis module
to digits
using this specific and predefined structure within your project's config/config.yaml.
#### Datasets and Workflows to include ###
workflows:
digits:
unsupervised_analysis: "config/digits/digits_unsupervised_analysis_config.yaml"
Tip
Recommended folder and naming scheme for config files: config/{dataset_name}/{dataset_name}_{module}_config.yaml
.
Second, within the main Snakefile (workflow/Snakefile
) we have to do three things
- load and parse all configurations into a structured dictionary.
# load configs for all workflows and datasets config_wf = dict() for ds in config["workflows"]: for wf in config["workflows"][ds]: with open(config["workflows"][ds][wf], 'r') as stream: try: config_wf[ds+'_'+wf]=yaml.safe_load(stream) except yaml.YAMLError as exc: print(exc)
- include the
workflow/rules/digits.smk
analysis snakefile from the rule subfolder (see last step).##### load rules (one per dataset) ##### include: os.path.join("rules", "digits.smk")
- require all outputs from the used module as inputs to the target rule
all
.#### Target Rule #### rule all: input: #### digits Analysis rules.digits_unsupervised_analysis_all.input, ...
Finally, within the dedicated Snakefile for the analysis of digits
(workflow/rules/digits.smk
) we load the specified version of the unsupervised_analysis
module from your local copy or directly from GitHub, provide it with the previously loaded configuration and use a prefix for all (*
) loaded rules.
# digits Analysis
### digits - Unsupervised Analysis ####
module digits_unsupervised_analysis:
snakefile:
#"/path/to/clone/unsupervised_analysis/workflow/Snakefile"
github("epigen/unsupervised_analysis", path="workflow/Snakefile", tag="v3.0.0")
config:
config_wf["digits_unsupervised_analysis"]
use rule * from digits_unsupervised_analysis as digits_unsupervised_analysis_*
Tip
Recommended naming scheme:
- Datasets/projects always in camelCase (no
_
recommended) e.g.ATACtreated
. - Filename for the analysis/dataset-specific rule file:
./workflow/rules/{dataset_name}.smk
. - Module name:
{dataset_name}_{module}
- Prefix for the loaded rules:
{dataset_name}_{module}_
.
====================== UNDER CONSTRUCTION ====================== Below we show selected results to illustrate an unsupervised analysis, mirroring the modules' features.
To visualize high-dimensional data we employed three different approaches: Principal Component Analysis (PCA; linear), Uniform/Density-preserving Manifold Approximation and Projection (dens/UMAP; non-linear), and Heatmaps.
For clustering, i.e., grouping data points my similarity in their features, we support Leiden, a graph-based clustering algorithm, applied directly to the UMAP knn-graph.