This repository summarises TCGA expression data from various cancer types to be collated and the steps for their harmonization.
Normalised expression data across 33 cancer types integrated by TCGA Pan-Cancer Atlas (PanCanAtlas, see paper The Cancer Genome Atlas Pan-Cancer analysis project) initiative is available on the Genomic Data Commons (GDC) and accomapnying Cell paper webpages.
Data
The RNA batch corrected matrix file is EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv.
Data source
Genomic Data Commons (GDC)
Follow the steps below to download published TCGA expression files and prepare the data from individual datasets to read count matrices ready for downstream analyses.
There are expression data from 33 cancer types (<projects>) publically available in TCGA. The individual TCGA projects are summarised in the projects summary table.
Download and prepare the TCGA expression read count data for each of the 33 cancer type (<project>, see TCGA projects summary table) using the TCGAbiolinks_transcriptome_profiling_data.R script described here. The TCGAbiolinks package is used to download the most recent data files by accessing the National Cancer Institute (NCI) GDC thorough its GDC Application Programming Interface (API).
Rscript TCGAbiolinks_transcriptome_profiling_data.R --out_dir <project> --project_id TCGA-<project> --tissue 1 --workflow Counts
NOTE, in case of the Acute Myeloid Leukaemia (LAML) the --tissue
argument was set to 3
(Primary Blood Derived Cancer - Peripheral Blood).
The output data for each cancer type (<project>) follows the directory structure below:
|
|____<project>
|
|____transcriptome_profiling
|
|____Counts
|____Counts.exp
|____Counts_boxplot.pdf
|____Counts_clinical_info.txt
|____Counts_samples.txt
|____gdc-client
|____gdc-client_v1.1.0_OSX_x64.zip
|____gdc_manifest.txt
|____R_parameters.txt
|____GDCdata
|
|____<project>
|
|____harmonized
|
|____Transcriptome_Profiling
| |
| |____Gene_Expression_Quantification
| |____…
| |____…
|
|____Clinical
|
|____Clinical_Supplement
|____…
|____…
The two key output files used for downstream analyses are:
File | Description |
---|---|
Counts.exp | Read count data matrix |
Counts_samples.txt | Combined samples annotation and associated clinical information |
The description of TCGA barcodes that are used to represent the metadata of the TCGA participants and their samples is [here](TCGA barcodes).
Data source
Genomic Data Commons (GDC)
The expression read count data from each of the 33 cancer types was then cleaned based on the quality metrics provided in the Merged Sample Quality Annotations file merged_sample_quality_annotations.tsv from TCGA PanCanAtlas initiative webpage.
The exclusion criteria (rather rigorous to minimise data variation due to unwanted factors) are listed in the table below. The samples were excluded if at least one of the criterium is meet.
Column | Inclusion | Exclusion |
---|---|---|
patient_annotation * |
No comments | Issue (Observation or Notification) reported |
sample_annotation |
No comments | Issue (* Notification*) reported |
aliquot_annotation |
No comments | Issue (CenterNotification) reported |
AWG_excluded_because_of_pathology * (see also AWG_pathology_exclusion_reason **) |
0 | 1 |
Do_not_use |
0 | 1 |
* Except for LAML project where Notification is due to "Alternate sample pipeline: Biospecimens (...) were processed into analyte outside of the normal TCGA standardized laboratory pipeline"
** AWG - Analysis Working Group
In total, 1001 samples were excluded and 9438 samples remained for downstream analyses (see the TCGA projects summary table for per-project summary).
Data source
Genomic Data Commons (GDC)
The Pan-Cancer clinical data was integrated by TCGA Pan-Cancer Clinical Data Resource and is described in a paper An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics.
TCGA Pan-Cancer Clinical Data Resource provides resource of the clinical annotations for TCGA data and provides recommendations for use of clinical endpoints.
Data
It is strongly recommended that the published TCGA-CDR-SupplementalTableS1.xlsx file is used for clinical elements and survival outcome data to drive high quality analyses.
Data source
Genomic Data Commons (GDC)