Cohort selection portal: https://www.cancerimagingarchive.net/histopathology-imaging-on-tcia/
- Cancer location: Lung
Note, for some reason, Adenocarcinoma and Squamous Cell Carcinoma do not show up if you choose "Cancer Type" to be "Lung Cancer"
Data selection portal: https://cancerimagingarchive.net/datascope/cptac/home/
- Topographic_Site: Lung
Lung Cohorts
- LUAD (Lung Adenocarcinoma)
- LSCC (Lung Squamous Cell Carcinoma)
Clicking "download cohort" downloads: ./tcia-luad-lusc-cohort.csv
The guide was sent by the TCIA Portal support team https://github.com/kirbyju/TCIA_Notebooks/blob/main/TCIA_Aspera_CLI_Downloads.ipynb
Getting the data onto a remote server if you do not have the ability to download to the server directly
-
If you do not have 450GB free on your computer (for downloading and uploading one package at a time), get an external hard drive and make it writable
-
Download LUAD and LSCC with Aspera Connect onto the external drive (one by one if needed) - location of download can be changed in Aspera Connect App's settings. Aspera Connect ensures the integrity of the files so at this point the files should be intact.
-
Compute the
md5sum
codes for the locally downloaded files. We will compare them with the md5sum codes after uploading data onto the remote server. My md5sum_hashes are in md5sum_hashes.txt, but they are not the official ones. I was not able to found any md5sum hashes shared by the TCIA team.- Linux with
md5sum
:# sorting makes it easier to compare to the remote server version find . -type f -name "*.svs" | sort | xargs md5sum > local_hashes.txt
- Mac with
md5
# adding -r produces linux-like format # sorting makes it easier to compare to the remote server version find . -type f -name "*.svs" | sort | xargs md5 -r > local_hashes.txt
- Mac with
md5
andparallel
to make it fasterbrew install parallel # adding -r produces linux-like format # sorting makes it easier to compare to the remote server version # parallel -k results in the same order, while omitting -k might not preserve the order find . -type f -name "*.svs" | sort | parallel -k md5 -r > local_hashes.txt
- Linux with
-
Upload from the external drive onto the cluster using
rsync
with flag-aP
. See rsync tutorial. -
Compute the remote server hashes
# sorting makes it easier to compare to the remote server
find . -type f -name "*.svs" | sort | xargs md5sum > server_hashes.txt
-
Copy
server_hashes.txt
to local machine orlocal_hashes.txt
to the server usingscp
orrsync
-
Check the md5sum codes (local download vs server codes)
# format might still be different (2 spaces vs 1 space between the md5sum code and the path to the file)
# use something like that to replace 2 spaces with 1 space
# sed 's/ / /' server_hashes.txt > formatted_server_hashes.txt
# check the difference, if
# 1. the upload went without errors
# 2. the contents of the files are sorted
# 3. the contents of the files have the same format,
# then the command should produce an empty output and you can be reasonably sure you succeded
diff local_hashes.txt formatted_server_hashes.txt
Disclaimer: My md5sum_hashes are in md5sum_hashes.txt, but they are not the official ones. I was not able to found any md5sum hashes shared by the TCIA team.