CanSRMaPP is a modeling tool for identifying a minimal feature set describing the metagenome of a cancer cohort.
- Free software: BSD license
- Source code: https://github.com/idekerlab/cansrmapp
- Pytorch 2.5+ with torchaudio, torchvision (tested on 2.5.0)0
- tables
- matplotlib
- numpy
- pandas
- scikit-learn
- scikit-image
- scipy
- Python 3.11+
- CUDA 12.1 if using GPU. Download the appropriate CUDA toolkit for your system here <https://developer.nvidia.com/cuda-12-1-1-download-archive> :
- Note
CUDA is only required for implementations using NVIDIA GPUs; feel free to ignore otherwise.
The root CanSRMaPP module automatically detects whether CUDA is set up; cmbuilder and in particular cmsolver will configure themselves to use the GPU if available.
conda create -n cansrmapp python=3.11 -y conda activate cansrmapp
Building and installing cansrmapp package
git clone https://github.com/idekerlab/cansrmapp cd cansrmapp pip install -r requirements_dev.txt make dist pip install dist/cansrmapp*whl
To fit CanSRMaPP models, scripts are provided in demo/. A simple test invocation (<5 minutes) is :
cd demo
./build.sh
./test-solve.sh
./polish.sh- build.sh
- creates the CanSRMaPP input matrices in
demo/nest(wherenestis the model name). - test-solve.sh
- Finds the maximum-posterior solution for the input matrices. In the interest of low runtime and debugging, some parameters in test-solve.sh have been set such that they may not converge on optimal solutions; those in full-solve.sh are set to produce an optimal solution.
- polish.sh
Puts the results in a more interpretable format; work will continue on improving presentation. The key files are stored in
demo/summary:- feature_summary.csv
- contains the Maximum a Posteriori (MAP) estimate of each input feature along with that feature's type (gene, signature, or genomic background), and its name.
- selected_events_boolean.csv
- contains true/false values for a simple selection test on each alteration type (column) and each gene (row).
To reproduce the core CanSRMaPP workflow (~30 minutes):
cd demo
./build.sh
./full-solve.sh
./polish.sh
./validate.shOutput for the final command should resemble :
Feature weight agreement with publication (pearson)
PearsonRResult(statistic=0.9999972289807557, pvalue=0.0)
Feature identification agreement with publication (jaccard,differences)
Local run | Publication
-------------------------------------------------------------
only | common | only
0 | 90 | 0
============================================================
Detected GPU.
TCGA-LUAD [training] frequency agreement (pearson) :
PearsonRResult(statistic=0.9750266, pvalue=0.0)
TCGA-CPTAC [evaluation] frequency agreement (pearson) :
PearsonRResult(statistic=0.89953285, pvalue=0.0)
Indicating that the 90 CanSRMaPP features are those recovered by the authors, and that their deviation from the authors' values is less than one part in 105.
- Note
Anecdotally, you can expect a single cycle of cmsolver to take about 70 seconds on a GPU and up to 20 minutes when parallelized over multiple CPUs; GPU runtime may be slower on WSL.
test-solve.shruns for one cycle, whilefull-solve.shruns for twenty.Parallelization largely takes place from backends handled by numpy, scipy, and pytorch, so if you wish to limit parallelization, follow procedures relevant to those modules for setting environment variables.
CanSRMaPP relies on a number of third-party files for reference and reconciling multiple data sources. This document describes the provenance of all such files, and hosts frozen copies since some may be updated in-place by the maintainers.
Cancer genomic data was downloaded from the Genomic Data Commons on
February 2, 2024. Because this data is subject to varying degrees of
controlled access, it cannot be redistributed here in its original form.
Binarized alteration states and signature activities, which constitute
a de-identified data derivative under the NIH universal Data Use Certification,
are hosted here and on zenodo. Gene level alteration states for the
TCGA LUAD cohort are located in data/omics_tcga_luad.csv.gz;
for the CPTAC LUAD cohort, data/omics_cptac_luad.csv.gz.
Signature activities for the TCGA LUAD cohort are in data/signatures_tcga_luad.csv.gz.
Homo_sapiens.gene_info was downloaded from
https://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz on
November 3, 2024. This file is unrestricted as described here
GCF_000001405.40_GRCh38.p14_genomic.gff.gz was downloaded from this FTP directory on November 12, 2024.
This file is unrestricted as described according to these terms
The reduced file gff_reduced.gff.gz derived from this one is the result of running the command
gunzip -c GCF_000001405.40_GRCh38.p14_genomic.gff.gz | awk -F' ' '$0 !~ /^#/ && $3 == "gene" && $9 ~/GeneID/ ' | gzip -c > gff_reduced.gff.gz
"NeSTv0" is a precursor of the interaction map found in
Zheng, Kelly, et al., 2021, prior to filtering for mutation-enriched systems.
It is distributed here as nest.pickle with permission from the authors, and is
subject to the license governing this repository. The file contains a dict object
mapping each system to a set of member gene Entrez IDs. Because systems in this
file are named Clusterx-y, an additional file, NeST_map_1.5_default_node_Nov20.csv,
is incorporated to map these to their NEST IDs as published.
This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.