Module Usage in Projects

As a concrete example, we will apply the unsupervised_analysis module to the UCI ML hand-written digits dataset, digits, imported from sklearn.

Data

We provide a minimal example of an unsupervised analysis of the UCI ML hand-written digits datasets imported from sklearn:

Configuration
- configuration: config/digits/digits_unsupervised_analysis_config.yaml
- annotation: config/digits/digits_unsupervised_analysis_annotation.csv
Data (automatically generated within the example)
- dataset (1797 observations, 64 features): data/digits/digits_data.csv
- metadata (consisting only of the ground truth label "target"): data/digits/digits_labels.csv
Results will be generated in the configured results folder results/digits/
Performance: On an HPC it took less than 7 minutes to complete a full run split into 92 jobs with up to 32GB of memory per job. Excluding conda environment installations.

Code & Configuration

First, we provide the configuration file for the application of the unsupervised_analysis module to digits using this specific and predefined structure within your project's config/config.yaml.

#### Datasets and Workflows to include ###
workflows:
    digits:
        unsupervised_analysis: "config/digits/digits_unsupervised_analysis_config.yaml"

Tip

Recommended folder and naming scheme for config files: config/{dataset_name}/{dataset_name}_{module}_config.yaml.

Second, within the main Snakefile (workflow/Snakefile) we have to do three things

load and parse all configurations into a structured dictionary.

# load configs for all workflows and datasets
config_wf = dict()

for ds in config["workflows"]:
    for wf in config["workflows"][ds]:
        with open(config["workflows"][ds][wf], 'r') as stream:
            try:
                config_wf[ds+'_'+wf]=yaml.safe_load(stream)
            except yaml.YAMLError as exc:
                print(exc)

include the workflow/rules/digits.smk analysis snakefile from the rule subfolder (see last step).
```
##### load rules (one per dataset) #####
include: os.path.join("rules", "digits.smk")
```

require all outputs from the used module as inputs to the target rule all.

#### Target Rule ####
rule all:
    input:
        #### digits Analysis
        rules.digits_unsupervised_analysis_all.input,
        ...

Finally, within the dedicated Snakefile for the analysis of digits (workflow/rules/digits.smk) we load the specified version of the unsupervised_analysis module from your local copy or directly from GitHub, provide it with the previously loaded configuration and use a prefix for all (*) loaded rules.

# digits Analysis

### digits - Unsupervised Analysis ####
module digits_unsupervised_analysis:
    snakefile:
        #"/path/to/clone/unsupervised_analysis/workflow/Snakefile"
        github("epigen/unsupervised_analysis", path="workflow/Snakefile", tag="v3.0.0")
    config:
        config_wf["digits_unsupervised_analysis"]

use rule * from digits_unsupervised_analysis as digits_unsupervised_analysis_*

Tip

Recommended naming scheme:

Datasets/projects always in camelCase (no _ recommended) e.g. ATACtreated.
Filename for the analysis/dataset-specific rule file: ./workflow/rules/{dataset_name}.smk.
Module name: {dataset_name}_{module}
Prefix for the loaded rules: {dataset_name}_{module}_.

Results

====================== UNDER CONSTRUCTION ====================== Below we show selected results to illustrate an unsupervised analysis, mirroring the modules' features.

Dimensionality Reduction

To visualize high-dimensional data we employed three different approaches: Principal Component Analysis (PCA; linear), Uniform/Density-preserving Manifold Approximation and Projection (dens/UMAP; non-linear), and Heatmaps.

Cluster Analysis

For clustering, i.e., grouping data points my similarity in their features, we support Leiden, a graph-based clustering algorithm, applied directly to the UMAP knn-graph.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Module Usage in Projects

Data

Code & Configuration

Results

Dimensionality Reduction

Cluster Analysis

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Modules

Module Usage in Projects

Recipes

Tips

CeMM Users

Clone this wiki locally