Skip to content

idekerlab/cansrmapp

Repository files navigation

cansrmapp

b

CanSRMaPP is a modeling tool for identifying a minimal feature set describing the metagenome of a cancer cohort.

Dependencies

Compatibility

Note

CUDA is only required for implementations using NVIDIA GPUs; feel free to ignore otherwise.

The root CanSRMaPP module automatically detects whether CUDA is set up; cmbuilder and in particular cmsolver will configure themselves to use the GPU if available.

Installation

Anaconda environment

conda create -n cansrmapp python=3.11 -y
conda activate cansrmapp

Building and installing cansrmapp package

git clone https://github.com/idekerlab/cansrmapp
cd cansrmapp
pip install -r requirements_dev.txt
make dist
pip install dist/cansrmapp*whl

Usage

Basic usage / code test

To fit CanSRMaPP models, scripts are provided in demo/. A simple test invocation (<5 minutes) is :

cd demo
./build.sh
./test-solve.sh
./polish.sh
build.sh
creates the CanSRMaPP input matrices in demo/nest (where nest is the model name).
test-solve.sh
Finds the maximum-posterior solution for the input matrices. In the interest of low runtime and debugging, some parameters in test-solve.sh have been set such that they may not converge on optimal solutions; those in full-solve.sh are set to produce an optimal solution.
polish.sh

Puts the results in a more interpretable format; work will continue on improving presentation. The key files are stored in demo/summary :

feature_summary.csv
contains the Maximum a Posteriori (MAP) estimate of each input feature along with that feature's type (gene, signature, or genomic background), and its name.
selected_events_boolean.csv
contains true/false values for a simple selection test on each alteration type (column) and each gene (row).

To reproduce the core CanSRMaPP workflow (~30 minutes):

cd demo
./build.sh
./full-solve.sh
./polish.sh
./validate.sh

Output for the final command should resemble :

Feature weight agreement with publication (pearson)
PearsonRResult(statistic=0.9999972289807557, pvalue=0.0)
Feature identification agreement with publication (jaccard,differences)
          Local run           |         Publication
-------------------------------------------------------------
       only        |       common       |       only
         0         |         90         |         0

============================================================
Detected GPU.
TCGA-LUAD [training] frequency agreement (pearson) :
PearsonRResult(statistic=0.9750266, pvalue=0.0)
TCGA-CPTAC [evaluation] frequency agreement (pearson) :
PearsonRResult(statistic=0.89953285, pvalue=0.0)

Indicating that the 90 CanSRMaPP features are those recovered by the authors, and that their deviation from the authors' values is less than one part in 105.

Note

Anecdotally, you can expect a single cycle of cmsolver to take about 70 seconds on a GPU and up to 20 minutes when parallelized over multiple CPUs; GPU runtime may be slower on WSL. test-solve.sh runs for one cycle, while full-solve.sh runs for twenty.

Parallelization largely takes place from backends handled by numpy, scipy, and pytorch, so if you wish to limit parallelization, follow procedures relevant to those modules for setting environment variables.

Redistributed data sources

CanSRMaPP relies on a number of third-party files for reference and reconciling multiple data sources. This document describes the provenance of all such files, and hosts frozen copies since some may be updated in-place by the maintainers.

Cancer Genomic Data

Cancer genomic data was downloaded from the Genomic Data Commons on February 2, 2024. Because this data is subject to varying degrees of controlled access, it cannot be redistributed here in its original form. Binarized alteration states and signature activities, which constitute a de-identified data derivative under the NIH universal Data Use Certification, are hosted here and on zenodo. Gene level alteration states for the TCGA LUAD cohort are located in data/omics_tcga_luad.csv.gz; for the CPTAC LUAD cohort, data/omics_cptac_luad.csv.gz. Signature activities for the TCGA LUAD cohort are in data/signatures_tcga_luad.csv.gz.

Gene Info

Homo_sapiens.gene_info was downloaded from https://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz on November 3, 2024. This file is unrestricted as described here

GRCh38 genomic annotation

GCF_000001405.40_GRCh38.p14_genomic.gff.gz was downloaded from this FTP directory on November 12, 2024. This file is unrestricted as described according to these terms The reduced file gff_reduced.gff.gz derived from this one is the result of running the command

gunzip -c GCF_000001405.40_GRCh38.p14_genomic.gff.gz | awk -F'     ' '$0 !~ /^#/ && $3 == "gene" && $9 ~/GeneID/ ' | gzip -c > gff_reduced.gff.gz

NeSTv0

"NeSTv0" is a precursor of the interaction map found in Zheng, Kelly, et al., 2021, prior to filtering for mutation-enriched systems. It is distributed here as nest.pickle with permission from the authors, and is subject to the license governing this repository. The file contains a dict object mapping each system to a set of member gene Entrez IDs. Because systems in this file are named Clusterx-y, an additional file, NeST_map_1.5_default_node_Nov20.csv, is incorporated to map these to their NEST IDs as published.

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •