Set up the directory structure:
project_dir="/data/BIDS-HPC/private/projects/dmi2"
working_dir="/home/weismanal/notebook/2020-06-10/dmi"
mkdir "$project_dir" "$working_dir"
cd "$working_dir"
git clone git@github.com:andrew-weisman/target_classification.git "$project_dir/checkout"
mkdir "$project_dir/data"
Note: The effort using the data directly from the TARGET data website (as opposed to the GDC Data Portal) is in the target_data_website
branch of this repository.
Download the manifest for all the gene expression quantification files in the TARGET program (click on the blue "Manifest" button):
Place the downloaded manifest file as $project_dir/checkout/manifests/gdc_manifest.2020-06-10-all_gene_expression_files_in_target.txt
.
In addition, click on the blue "Add All Files to Cart" button, go to the cart (top right of page), click on the two blue buttons "Sample Sheet" and "Metadata", and save the resulting two files to $project_dir/data
. The two files will be named, e.g., gdc_sample_sheet.2020-07-02.tsv
and metadata.cart.2020-07-02.json
.
Note that these 5,149 files correspond to 1,192 cases (people [for sure that's what it means]).
Download the expression files from the manifest on Helix:
module load gdc-client
mkdir "$project_dir/data/all_gene_expression_files_in_target"
cd !!:1
gdc-client download -m "$project_dir/checkout/manifests/gdc_manifest.2020-06-10-all_gene_expression_files_in_target.txt"
Extract the resulting compressed files and link to them from a single folder $project_dir/data/all_gene_expression_files_in_target/links
:
mkdir links
cd !!:1
for file in $(find ../ -iname "*.gz"); do gunzip "$file"; done
for file in $(find ../ -type f | grep -v "/logs/\|/annotations.txt"); do ln -s $file; done
ln -s "$project_dir/checkout/manifests/gdc_manifest.2020-06-10-all_gene_expression_files_in_target.txt" MANIFEST.txt
Note that
for file in $(ls | grep -v MANIFEST.txt); do echo $file | awk -v FS="." '{print $1}'; done | sort -u | wc -l
shows that, ostensibly, there are 2,481 unique expression files (independent of normalization). This is just based on the filenames, and is not actually correct.
Start an interactive allocation, using, e.g.,
sinteractive --mem=40g # --mem=20g may be fine
Go through the Python Jupyter notebook /data/BIDS-HPC/private/projects/dmi2/checkout/main.ipynb
. Use the conda
environment /data/BIDS-HPC/public/software/conda/envs/r_env
. (Note this environment contains pandas
version 1.1.0
, whereas Biowulf's default python
module has pandas
version 0.24.2
, which is insufficient.) See here for more notes on the environment.