Skip to content

NOT READY FOR PUBLIC RELEASE. Procedure for classification of gene expression data from the TARGET dataset into cancer type (or normal)

Notifications You must be signed in to change notification settings

andrew-weisman/target_classification

Repository files navigation

TARGET classification workflow (using the GDC Data Portal)

Set up the directory structure:

project_dir="/data/BIDS-HPC/private/projects/dmi2"
working_dir="/home/weismanal/notebook/2020-06-10/dmi"
mkdir "$project_dir" "$working_dir"
cd "$working_dir"
git clone git@github.com:andrew-weisman/target_classification.git "$project_dir/checkout"
mkdir "$project_dir/data"

Note: The effort using the data directly from the TARGET data website (as opposed to the GDC Data Portal) is in the target_data_website branch of this repository.

Download the manifest for all the gene expression quantification files in the TARGET program (click on the blue "Manifest" button):

all_gene_expression_files_in_target.png

Place the downloaded manifest file as $project_dir/checkout/manifests/gdc_manifest.2020-06-10-all_gene_expression_files_in_target.txt.

In addition, click on the blue "Add All Files to Cart" button, go to the cart (top right of page), click on the two blue buttons "Sample Sheet" and "Metadata", and save the resulting two files to $project_dir/data. The two files will be named, e.g., gdc_sample_sheet.2020-07-02.tsv and metadata.cart.2020-07-02.json.

Note that these 5,149 files correspond to 1,192 cases (people [for sure that's what it means]).

Download the expression files from the manifest on Helix:

module load gdc-client
mkdir "$project_dir/data/all_gene_expression_files_in_target"
cd !!:1
gdc-client download -m "$project_dir/checkout/manifests/gdc_manifest.2020-06-10-all_gene_expression_files_in_target.txt"

Extract the resulting compressed files and link to them from a single folder $project_dir/data/all_gene_expression_files_in_target/links:

mkdir links
cd !!:1
for file in $(find ../ -iname "*.gz"); do gunzip "$file"; done
for file in $(find ../ -type f | grep -v "/logs/\|/annotations.txt"); do ln -s $file; done
ln -s "$project_dir/checkout/manifests/gdc_manifest.2020-06-10-all_gene_expression_files_in_target.txt" MANIFEST.txt

Note that

for file in $(ls | grep -v MANIFEST.txt); do echo $file | awk -v FS="." '{print $1}'; done | sort -u | wc -l

shows that, ostensibly, there are 2,481 unique expression files (independent of normalization). This is just based on the filenames, and is not actually correct.

Start an interactive allocation, using, e.g.,

sinteractive --mem=40g # --mem=20g may be fine

Go through the Python Jupyter notebook /data/BIDS-HPC/private/projects/dmi2/checkout/main.ipynb. Use the conda environment /data/BIDS-HPC/public/software/conda/envs/r_env. (Note this environment contains pandas version 1.1.0, whereas Biowulf's default python module has pandas version 0.24.2, which is insufficient.) See here for more notes on the environment.

About

NOT READY FOR PUBLIC RELEASE. Procedure for classification of gene expression data from the TARGET dataset into cancer type (or normal)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages