This repository contains all the scripts to reproduce the results published in the article entitled "Detection of genes with a differential expression dispersion unravels the role of autophagy in cancer progression".
All the R scripts are stored in the src/ directory.
There are 3 types of scripts:
- function files
- analysis scripts
- figure scripts
The function files only contain the definition of functions called by the analysis and figures scripts. Their names end by '-functions.R'.
The DD_analysis-functions.R
file contains generic functions for the identification of differentially dispersed (DD) genes in RNA-seq datasets. The analysis of simulated and TCGA datasets requires this file. The simulations-functions.R
file is dedicated to the analysis of simulated datasets and the TCGA-functions.R
file is dedicated to the analysis of TCGA datasets.
The analysis and figures R scripts contain different sections. The 'Parameter' section defines the parameters defined for the analysis. The 'Analysis' section contains the commands to run the analysis or generate the figures published in the article entitled "Detection of genes with a differential expression dispersion unravels the role of autophagy in cancer progression" and is not supposed to be modified to reproduce them.
The names of analysis scripts end by '-analysis.R' and the names of figures scripts end by '-figures.R'
By default, all the files generated by the analysis and figures scripts are stored in the directory './output/simulations/'.
The simulations-generateDatasets.R
script generates simulated RNA-seq datasets based on the parameters provided in its 'Parameters' section. The outputs are stored in the '00-Data/' subdirectory.
The simulations-DD_analysis.R
script identifies DD genes in simulated RNA-seq datasets using Levene's test, MDSeq, DiPhiSeq, GAMLSS and DiffDist and evaluates the performances of these methods. It requires the functions defined in the DD_analysis-functions.R and simulations-functions.R files and parameters provided in the 'Parameters' section. The outputs of the different methods are stored in the subdirectories '10-Levene/', '20-MDSeq/', '30-DiPhiSeq/', '40-GAMLSS/' and '50-DiffDist/', respectively, one subdirectory per dataset. The subdirectory structure is set according to the type of datasets (either containing highly differentially expressed (DE) genes or only lowly DE genes) and the parameters used.
The simulations-figures.R
script generates the figures 1, 2, 3 and the supporting file S1 based on the Levene's test, MDSeq, DiPhiSeq, GAMLSS and DiffDist output files generated by the simulations-DD_analysis.R script and stored in subdirectories of the directory whose path is defined in the variable 'output_dir'. The figures are contained in files stored in the directory whose path is defined in the variable 'output_dir'.
By default, all the files generated by the analysis and figures scripts are stored in the directory './output/TCGA/'.
The TCGA-downloadDatasets.R
script downloads TCGA RNA-seq datasets based on the parameters provided in its 'Parameters' section. The downloaded files are stored in the '00-Data/' subdirectory and available in this GitHub repository.
The TCGA-DD_analysis.R
script identifies DD genes in a TCGA RNA-seq dataset using Levene's test, MDSeq, DiPhiSeq, GAMLSS and DiffDist. The name of the TCGA dataset is defined by the 'dataset' variable and the input files are the outputs files of TCGA-downloadDatasets.R script stored in the directory whose path is defined in the variable 'dataset_input_dir'. The outputs of the different methods are stored in the subdirectories '10-Levene/', '20-MDSeq/', '30-DiPhiSeq/', '40-GAMLSS/' and '50-DiffDist/', respectively, one subdirectory per dataset.
The TCGA-GO_cluster.R
script performs Gene Ontology term enrichment analysis for overdispersed genes in tumors (DD+) among lowly differentially expressed (DE) genes, highly upregulated (DE+) and highly downregulated genes in tumors (DE-), respectively, for all TCGA datasets defined in the 'datasets' variable. To ease comparison, representative terms of closely related GO terms are identified thanks to semantic similarity and hierarchical clustering. This script requires parameters provided in its 'Parameters' section and functions defined in the TCGA-functions.R file. The input files are the Levene's test, MDSeq, DiPhiSeq, GAMLSS and DiffDist output files generated by the TCGA-DD_analysis.R script and stored in subdirectories of the directory whose path is defined in the variable 'output_dir'. The outputs are stored in the subdirectory '60-GO_cluster/'. The subdirectory '00-Gene-lists/' contains the identifiers of genes identified each of the different methods in categories according diffenrial expression or dispersion. The subdirectory '10-GO_analyisis/' contains the outputs of GO term cluster analysis for DD+, DE+ and DE- genes. The first page of '60-GO_cluster/10-GO_analysis/non-DE_DD+/10-GO_cluster/TCGA_TP_vs_NT_TMM_enrichGO_simplified_Rel_0_8_cluster_generic_terms_customplot.pdf' file is the figure 6 of the article entitled "Detection of genes with a differential expression dispersion unravels the role of autophagy in cancer progression".
The TCGA-figures.R
script generates the figures 4, 5, and the supporting file S2 based on the Levene's test, MDSeq, DiPhiSeq, GAMLSS and DiffDist output files generated by the TCGA-DD_analysis.R script and stored in subdirectories of the directory whose path is defined in the variable 'output_dir'. The figures are contained in files stored in the directory whose path is defined in the variable 'output_dir'.
The examples/ directory contains the parameter sections of scripts for the analysis of a simulated dataset and a TCGA dataset.
For more details about the analysis, please refer to the article entitled 'Detection of genes with a differential expression dispersion unravels the role of autophagy in cancer progression'
If you have any question, please contact chris.lepriol@gmail.com.