GitHub

cansrmapp

CanSRMaPP is a modeling tool for identifying a minimal feature set describing the metagenome of a cancer cohort.

Free software: BSD license
Source code: https://github.com/idekerlab/cansrmapp

Dependencies

Pytorch 2.5+ with torchaudio, torchvision (tested on 2.5.0)0
tables
matplotlib
numpy
pandas
scikit-learn
scikit-image
scipy

Compatibility

Python 3.11+
CUDA 12.1 if using GPU. Download the appropriate CUDA toolkit for your system here <https://developer.nvidia.com/cuda-12-1-1-download-archive> :

Note

CUDA is only required for implementations using NVIDIA GPUs; feel free to ignore otherwise.

The root CanSRMaPP module automatically detects whether CUDA is set up; cmbuilder and in particular cmsolver will configure themselves to use the GPU if available.

Installation

Anaconda environment

conda create -n cansrmapp python=3.11 -y
conda activate cansrmapp

Building and installing cansrmapp package

git clone https://github.com/idekerlab/cansrmapp
cd cansrmapp
pip install -r requirements_dev.txt
make dist
pip install dist/cansrmapp*whl

Usage

Basic usage / code test

To fit CanSRMaPP models, scripts are provided in demo/. A simple test invocation (<5 minutes) is :

cd demo
./build.sh
./test-solve.sh
./polish.sh

build.sh

creates the CanSRMaPP input matrices in demo/nest (where nest is the model name).

test-solve.sh

Finds the maximum-posterior solution for the input matrices. In the interest of low runtime and debugging, some parameters in test-solve.sh have been set such that they may not converge on optimal solutions; those in full-solve.sh are set to produce an optimal solution.

polish.sh

Puts the results in a more interpretable format; work will continue on improving presentation. The key files are stored in demo/summary :

feature_summary.csv: contains the Maximum a Posteriori (MAP) estimate of each input feature along with that feature's type (gene, signature, or genomic background), and its name.
selected_events_boolean.csv: contains true/false values for a simple selection test on each alteration type (column) and each gene (row).

To reproduce the core CanSRMaPP workflow (~30 minutes):

cd demo
./build.sh
./full-solve.sh
./polish.sh
./validate.sh

Output for the final command should resemble :

Feature weight agreement with publication (pearson)
PearsonRResult(statistic=0.9999972289807557, pvalue=0.0)
Feature identification agreement with publication (jaccard,differences)
          Local run           |         Publication
-------------------------------------------------------------
       only        |       common       |       only
         0         |         90         |         0

============================================================
Detected GPU.
TCGA-LUAD [training] frequency agreement (pearson) :
PearsonRResult(statistic=0.9750266, pvalue=0.0)
TCGA-CPTAC [evaluation] frequency agreement (pearson) :
PearsonRResult(statistic=0.89953285, pvalue=0.0)

Indicating that the 90 CanSRMaPP features are those recovered by the authors, and that their deviation from the authors' values is less than one part in 10⁵.

Note

Anecdotally, you can expect a single cycle of cmsolver to take about 70 seconds on a GPU and up to 20 minutes when parallelized over multiple CPUs; GPU runtime may be slower on WSL. test-solve.sh runs for one cycle, while full-solve.sh runs for twenty.

Parallelization largely takes place from backends handled by numpy, scipy, and pytorch, so if you wish to limit parallelization, follow procedures relevant to those modules for setting environment variables.

Redistributed data sources

CanSRMaPP relies on a number of third-party files for reference and reconciling multiple data sources. This document describes the provenance of all such files, and hosts frozen copies since some may be updated in-place by the maintainers.

Cancer Genomic Data

Cancer genomic data was downloaded from the Genomic Data Commons on February 2, 2024. Because this data is subject to varying degrees of controlled access, it cannot be redistributed here in its original form. Binarized alteration states and signature activities, which constitute a de-identified data derivative under the NIH universal Data Use Certification, are hosted here and on zenodo. Gene level alteration states for the TCGA LUAD cohort are located in data/omics_tcga_luad.csv.gz; for the CPTAC LUAD cohort, data/omics_cptac_luad.csv.gz. Signature activities for the TCGA LUAD cohort are in data/signatures_tcga_luad.csv.gz.

Gene Info

Homo_sapiens.gene_info was downloaded from https://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz on November 3, 2024. This file is unrestricted as described here

GRCh38 genomic annotation

GCF_000001405.40_GRCh38.p14_genomic.gff.gz was downloaded from this FTP directory on November 12, 2024. This file is unrestricted as described according to these terms The reduced file gff_reduced.gff.gz derived from this one is the result of running the command

gunzip -c GCF_000001405.40_GRCh38.p14_genomic.gff.gz | awk -F'     ' '$0 !~ /^#/ && $3 == "gene" && $9 ~/GeneID/ ' | gzip -c > gff_reduced.gff.gz

NeSTv0

"NeSTv0" is a precursor of the interaction map found in Zheng, Kelly, et al., 2021, prior to filtering for mutation-enriched systems. It is distributed here as nest.pickle with permission from the authors, and is subject to the license governing this repository. The file contains a dict object mapping each system to a set of member gene Entrez IDs. Because systems in this file are named Clusterx-y, an additional file, NeST_map_1.5_default_node_Nov20.csv, is incorporated to map these to their NEST IDs as published.

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
.github		.github
cansrmapp		cansrmapp
data		data
demo		demo
docker		docker
docs		docs
systems_maps		systems_maps
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
.travis.yml		.travis.yml
AUTHORS.rst		AUTHORS.rst
CONTRIBUTING.rst		CONTRIBUTING.rst
HISTORY.rst		HISTORY.rst
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.rst		README.rst
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cansrmapp

Dependencies

Compatibility

Installation

Anaconda environment

Usage

Basic usage / code test

Redistributed data sources

Cancer Genomic Data

Gene Info

GRCh38 genomic annotation

NeSTv0

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

idekerlab/cansrmapp

Folders and files

Latest commit

History

Repository files navigation

cansrmapp

Dependencies

Compatibility

Installation

Anaconda environment

Usage

Basic usage / code test

Redistributed data sources

Cancer Genomic Data

Gene Info

GRCh38 genomic annotation

NeSTv0

Credits

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages