Skip to content

Module Usage in Projects

Stephan Reichl edited this page Dec 16, 2024 · 23 revisions

As a concrete example, we will apply the unsupervised_analysis module to the UCI ML hand-written digits dataset, digits, imported from sklearn.

Data

We provide a minimal example of an unsupervised analysis of the UCI ML hand-written digits datasets imported from sklearn:

  • Configuration
    • configuration: config/digits/digits_unsupervised_analysis_config.yaml
    • annotation: config/digits/digits_unsupervised_analysis_annotation.csv
  • Data (automatically generated within the example)
    • dataset (1797 observations, 64 features): data/digits/digits_data.csv
    • metadata (consisting only of the ground truth label "target"): data/digits/digits_labels.csv
  • Results will be generated in the configured results folder results/digits/
  • Performance: On an HPC it took less than 7 minutes to complete a full run split into 92 jobs with up to 32GB of memory per job. Excluding conda environment installations.

Code & Configuration

First, we provide the configuration file for the application of the unsupervised_analysis module to digits using this specific and predefined structure within your project's config/config.yaml.

#### Datasets and Workflows to include ###
workflows:
    digits:
        unsupervised_analysis: "config/digits/digits_unsupervised_analysis_config.yaml"

Tip

Recommended folder and naming scheme for config files: config/{dataset_name}/{dataset_name}_{module}_config.yaml.

Second, within the main Snakefile (workflow/Snakefile) we have to do three things

  • load and parse all configurations into a structured dictionary.
    # load configs for all workflows and datasets
    config_wf = dict()
    
    for ds in config["workflows"]:
        for wf in config["workflows"][ds]:
            with open(config["workflows"][ds][wf], 'r') as stream:
                try:
                    config_wf[ds+'_'+wf]=yaml.safe_load(stream)
                except yaml.YAMLError as exc:
                    print(exc)
  • include the workflow/rules/digits.smk analysis snakefile from the rule subfolder (see last step).
    ##### load rules (one per dataset) #####
    include: os.path.join("rules", "digits.smk")
  • require all outputs from the used module as inputs to the target rule all.
    #### Target Rule ####
    rule all:
        input:
            #### digits Analysis
            rules.digits_unsupervised_analysis_all.input,
            ...

Finally, within the dedicated Snakefile for the analysis of digits (workflow/rules/digits.smk) we load the specified version of the unsupervised_analysis module from your local copy or directly from GitHub, provide it with the previously loaded configuration and use a prefix for all (*) loaded rules.

# digits Analysis

### digits - Unsupervised Analysis ####
module digits_unsupervised_analysis:
    snakefile:
        #"/path/to/clone/unsupervised_analysis/workflow/Snakefile"
        github("epigen/unsupervised_analysis", path="workflow/Snakefile", tag="v3.0.0")
    config:
        config_wf["digits_unsupervised_analysis"]

use rule * from digits_unsupervised_analysis as digits_unsupervised_analysis_*

Tip

Recommended naming scheme:

  • Datasets/projects always in camelCase (no _ recommended) e.g. ATACtreated.
  • Filename for the analysis/dataset-specific rule file: ./workflow/rules/{dataset_name}.smk.
  • Module name: {dataset_name}_{module}
  • Prefix for the loaded rules: {dataset_name}_{module}_.

Results

====================== UNDER CONSTRUCTION ====================== Below we show selected results to illustrate an unsupervised analysis, mirroring the modules' features.

Dimensionality Reduction

To visualize high-dimensional data we employed three different approaches: Principal Component Analysis (PCA; linear), Uniform/Density-preserving Manifold Approximation and Projection (dens/UMAP; non-linear), and Heatmaps.

Cluster Analysis

For clustering, i.e., grouping data points my similarity in their features, we support Leiden, a graph-based clustering algorithm, applied directly to the UMAP knn-graph.

Clone this wiki locally