Skip to content

Latest commit

 

History

History
executable file
·
147 lines (98 loc) · 7.15 KB

README.md

File metadata and controls

executable file
·
147 lines (98 loc) · 7.15 KB

TCGA expression data

This repository summarises TCGA expression data from various cancer types to be collated and the steps for their harmonization.

Table of contents


Pan-Cancer normalised data

Normalised expression data across 33 cancer types integrated by TCGA Pan-Cancer Atlas (PanCanAtlas, see paper The Cancer Genome Atlas Pan-Cancer analysis project) initiative is available on the Genomic Data Commons (GDC) and accomapnying Cell paper webpages.

Data

The RNA batch corrected matrix file is EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv.

Data source

Genomic Data Commons (GDC)

GDC counts data

Follow the steps below to download published TCGA expression files and prepare the data from individual datasets to read count matrices ready for downstream analyses.

TCGA projects

There are expression data from 33 cancer types (<projects>) publically available in TCGA. The individual TCGA projects are summarised in the projects summary table.

Data download

Download and prepare the TCGA expression read count data for each of the 33 cancer type (<project>, see TCGA projects summary table) using the TCGAbiolinks_transcriptome_profiling_data.R script described here. The TCGAbiolinks package is used to download the most recent data files by accessing the National Cancer Institute (NCI) GDC thorough its GDC Application Programming Interface (API).

Rscript TCGAbiolinks_transcriptome_profiling_data.R --out_dir <project> --project_id TCGA-<project> --tissue 1 --workflow Counts

NOTE, in case of the Acute Myeloid Leukaemia (LAML) the --tissue argument was set to 3 (Primary Blood Derived Cancer - Peripheral Blood).

Output data directory structure

The output data for each cancer type (<project>) follows the directory structure below:

|
|____<project>
  |
  |____transcriptome_profiling
    |
    |____Counts
      |____Counts.exp
      |____Counts_boxplot.pdf
      |____Counts_clinical_info.txt
      |____Counts_samples.txt
      |____gdc-client
      |____gdc-client_v1.1.0_OSX_x64.zip
      |____gdc_manifest.txt
      |____R_parameters.txt
      |____GDCdata
        |
        |____<project>
          |
          |____harmonized
            |
            |____Transcriptome_Profiling
            | |
            | |____Gene_Expression_Quantification
            |   |____…
            |   |____…
            |
            |____Clinical
              |
              |____Clinical_Supplement
                |____…
                |____…

The two key output files used for downstream analyses are:

File Description
Counts.exp Read count data matrix
Counts_samples.txt Combined samples annotation and associated clinical information

The description of TCGA barcodes that are used to represent the metadata of the TCGA participants and their samples is [here](TCGA barcodes).

Data source

Genomic Data Commons (GDC)

Data clean-up

The expression read count data from each of the 33 cancer types was then cleaned based on the quality metrics provided in the Merged Sample Quality Annotations file merged_sample_quality_annotations.tsv from TCGA PanCanAtlas initiative webpage.

The exclusion criteria (rather rigorous to minimise data variation due to unwanted factors) are listed in the table below. The samples were excluded if at least one of the criterium is meet.

Column Inclusion Exclusion
patient_annotation* No comments Issue (Observation or Notification) reported
sample_annotation No comments Issue (* Notification*) reported
aliquot_annotation No comments Issue (CenterNotification) reported
AWG_excluded_because_of_pathology* (see also AWG_pathology_exclusion_reason**) 0 1
Do_not_use 0 1

* Except for LAML project where Notification is due to "Alternate sample pipeline: Biospecimens (...) were processed into analyte outside of the normal TCGA standardized laboratory pipeline"
** AWG - Analysis Working Group


In total, 1001 samples were excluded and 9438 samples remained for downstream analyses (see the TCGA projects summary table for per-project summary).

Data source

Genomic Data Commons (GDC)

Pan-Cancer clinical data

The Pan-Cancer clinical data was integrated by TCGA Pan-Cancer Clinical Data Resource and is described in a paper An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics.

TCGA Pan-Cancer Clinical Data Resource provides resource of the clinical annotations for TCGA data and provides recommendations for use of clinical endpoints.

Data

It is strongly recommended that the published TCGA-CDR-SupplementalTableS1.xlsx file is used for clinical elements and survival outcome data to drive high quality analyses.

Data source

Genomic Data Commons (GDC)