TCGA expression data

This repository summarises TCGA expression data from various cancer types to be collated and the steps for their harmonization.

Pan-Cancer normalised data

Normalised expression data across 33 cancer types integrated by TCGA Pan-Cancer Atlas (PanCanAtlas, see paper The Cancer Genome Atlas Pan-Cancer analysis project) initiative is available on the Genomic Data Commons (GDC) and accomapnying Cell paper webpages.

Data

The RNA batch corrected matrix file is EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv.

Data source

Genomic Data Commons (GDC)

GDC counts data

Follow the steps below to download published TCGA expression files and prepare the data from individual datasets to read count matrices ready for downstream analyses.

TCGA projects

There are expression data from 33 cancer types (<projects>) publically available in TCGA. The individual TCGA projects are summarised in the projects summary table.

Data download

Download and prepare the TCGA expression read count data for each of the 33 cancer type (<project>, see TCGA projects summary table) using the TCGAbiolinks_transcriptome_profiling_data.R script described here. The TCGAbiolinks package is used to download the most recent data files by accessing the National Cancer Institute (NCI) GDC thorough its GDC Application Programming Interface (API).

Rscript TCGAbiolinks_transcriptome_profiling_data.R --out_dir <project> --project_id TCGA-<project> --tissue 1 --workflow Counts

NOTE, in case of the Acute Myeloid Leukaemia (LAML) the --tissue argument was set to 3 (Primary Blood Derived Cancer - Peripheral Blood).

Output data directory structure

The output data for each cancer type (<project>) follows the directory structure below:

|
|____<project>
  |
  |____transcriptome_profiling
    |
    |____Counts
      |____Counts.exp
      |____Counts_boxplot.pdf
      |____Counts_clinical_info.txt
      |____Counts_samples.txt
      |____gdc-client
      |____gdc-client_v1.1.0_OSX_x64.zip
      |____gdc_manifest.txt
      |____R_parameters.txt
      |____GDCdata
        |
        |____<project>
          |
          |____harmonized
            |
            |____Transcriptome_Profiling
            | |
            | |____Gene_Expression_Quantification
            |   |____…
            |   |____…
            |
            |____Clinical
              |
              |____Clinical_Supplement
                |____…
                |____…

The two key output files used for downstream analyses are:

File	Description
Counts.exp	Read count data matrix
Counts_samples.txt	Combined samples annotation and associated clinical information

The description of TCGA barcodes that are used to represent the metadata of the TCGA participants and their samples is [here](TCGA barcodes).

Data source

Genomic Data Commons (GDC)

Data clean-up

The expression read count data from each of the 33 cancer types was then cleaned based on the quality metrics provided in the Merged Sample Quality Annotations file merged_sample_quality_annotations.tsv from TCGA PanCanAtlas initiative webpage.

The exclusion criteria (rather rigorous to minimise data variation due to unwanted factors) are listed in the table below. The samples were excluded if at least one of the criterium is meet.

Column	Inclusion	Exclusion
`patient_annotation`*	No comments	Issue (Observation or Notification) reported
`sample_annotation`	No comments	Issue (* Notification*) reported
`aliquot_annotation`	No comments	Issue (CenterNotification) reported
`AWG_excluded_because_of_pathology`* (see also `AWG_pathology_exclusion_reason`**)	0	1
`Do_not_use`	0	1

* Except for LAML project where Notification is due to "Alternate sample pipeline: Biospecimens (...) were processed into analyte outside of the normal TCGA standardized laboratory pipeline"
** AWG - Analysis Working Group

In total, 1001 samples were excluded and 9438 samples remained for downstream analyses (see the TCGA projects summary table for per-project summary).

Data source

Genomic Data Commons (GDC)

Pan-Cancer clinical data

The Pan-Cancer clinical data was integrated by TCGA Pan-Cancer Clinical Data Resource and is described in a paper An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics.

TCGA Pan-Cancer Clinical Data Resource provides resource of the clinical annotations for TCGA data and provides recommendations for use of clinical endpoints.

Data

It is strongly recommended that the published TCGA-CDR-SupplementalTableS1.xlsx file is used for clinical elements and survival outcome data to drive high quality analyses.

Data source

Genomic Data Commons (GDC)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

TCGA expression data

Table of contents

Pan-Cancer normalised data

GDC counts data

TCGA projects

Data download

Output data directory structure

Data clean-up

Pan-Cancer clinical data

Files

README.md

Latest commit

History

README.md

File metadata and controls

TCGA expression data

Table of contents

Pan-Cancer normalised data

GDC counts data

TCGA projects

Data download

Output data directory structure

Data clean-up

Pan-Cancer clinical data