Project Overview: Statistical Analysis on Biochemical Recurrence Prostate Cancer Among Patients in 3 Locations
- Computed Patient Baseline Characteristics table and Kaplan-Meier Curve graphs regarding 3 locations for prostate cancer patients
- Proved probabilities of freedom from BCR among three locations were significantly different after stratification by conducting Stratified Logrank Test
- built a AIC Stepwise Cox’s proportional hazards regression model with imputation and interaction terms, obtaining which demographic factors and genes have effects on survival
- Found which factors and genes have greater effects on survival using Random Survival Forest method
- Data Source: 3 datasets are from built-in libraries: library(prostateCancerCamcap), library(prostateCancerTaylor), library(prostateCancerStockholm)
- Language Used: R
- Libraries Used: ggplot2, tidyr, devtools, dplyr, party, survival, tcltk, tcltk2, forcats, compareGroups, rms, npsurv, MASS, survminer, randomForestSRC
- To investigate whether the target genes as well as demographic factors have effects on biochemical recurrence(BCR) free time among prostate cancer patients in three public datasets, Taylor, Cambridge and Stockholm (see Section 4)
- To test if BCR free times are significantly different among three datasets (see Section 2 and Section 3)
- To investigate which factors and genes have greater effects on survival using Random Survival Forests (Section 7)
Biochemical recurrence(BCR) continues to occur in a large proportion of prostate cancer patients. Yet, established clinical predictors(PSA, Gleason score) often provide poor prognosis. It’s noticeable that more research is now focusing on the effect of genes such as integrated genomics and gene signatures on prostate cancer BCR.
Some main cleaning I did as follows:
- Cleaned 3 datasets (3 locations) respectively to prepare them in the same format for horizontal join such as
- extracting Gene expressions, survival and demographic variables
- making variables the same names with the same data types for the 3 datasets
- dropping the rows having NAs in Time and Event indicator
- Selected variables in the same order for all datasets and fully joined them horizontally
- Created new variables such as 'dataset' with 3 levels (Cam, Taylor and stoch) to distinguish data from 3 different locations in the fully joined dataset
- Cleaned categorical variables with multiple levels such as combining values with the same meaning to one level
After cleaning, we learned the following from the final joined dataset:
-
Columns included:
- Age: patient’s age
- Time: patient’s biochemical recurrence(BRC) free time
- PSA: patient’s PSA(prostate-specific antigen) level
- Event: two levels, 0 (right censoring), 1 (relapse)
- Stage: it measures how far the cancer has spread. Nine levels
- iCluster: seven levels
- Gleason: Gleason Score. Six levels
- ECE: extra-capsular extension. Two levels, Yes and No
- 23 columns for 23 target genes: SLC2A1-14, HK1-3, GCK, OAX1, ENO2, NCAM1, SYP, CHGA, FOLH1, KLK3
-
Total number of patients (rows): 343 patients. 111 from Cam dataset, 92 from Stoch dataset and 140 from Taylor dataset.
-
Factors that were missing or contained missing values, which were imputed later when fitting some Cox's proportional hazards regression models in Section 4:
- Factor—“Age” was unavailable in both Stoch and Taylor datasets
- Factors—“PSA” and “ECE” were unavailable in Taylor dataset.
- 2 NAs in “Stage”, 14 NAs in “iCluster” and 4 NAs in “Gleason”. Also, we assumed the deleted missing values in “Time” and “Event” were random distributed.
After cleaning and joining the three datasets(Taylor, Cambridge and Stockholm), I conducted univariate analysis among three datasets as a table (Table 1) using compareGroups library. Below is part of the table.
- KM curves among three datasets (Graph 1)
It’s clear from Graph 1 that patients in dataset—Stoch have the highest survival rate before BCR free time turns 90 months. The patients in dataset—Cam have the second highest survival rate until BCR free time turns 60 months, after which the survival rate plumps to 0 when the BCR free time is about 67 months. The patients in dataset—Taylor have the lowest survival time at the beginning and becomes the second in the end. Even though the curves have 95% CIs, it’s still hard to tell if they are significantly different among three datasets.
- KM curves among different Gleason Scores (Graph 2)
From Graph 2, it’s uneasy to tell if the survival rates of patients having different Gleason Scores are significantly different. Also, not enough observations for Gleason Scores are 5 and 10.
It’s uneasy to see whether the statistically difference exists among three datasets from Graph 1 since the KM curves cross each other and so on, so conducting a logrank test is a good way to investigate if the probabilities of freedom from BCR among three datasets are significantly different.
Also, a cofounder might exist. After investigation, I found “iCluster” was a cofounder. Firstly, “iCluster” was noticed to be related to the survival under a logrank test. Secondly, it was associated with “dataset” under the Pearson's Chi-squared test.
Therefore, I conducted a logrank test stratified by "iCluster" among the three datasets. From the R output below, we can see the survival rates among these three datasets are significantly different after being stratified by “iCluster”.
survdiff(formula = Surv(Time, Event) ~ dataset + strata(iCluster),
data = datacamtaylorstock, rho = 0)
n=329, 14 observations deleted due to missingness.
N Observed Expected (O-E)^2/E (O-E)^2/V
dataset=Cam 104 18 13.1 1.840 2.83
dataset=stoch 85 41 51.1 2.010 7.73
dataset=taylor 140 36 30.8 0.889 2.21
Chisq= 8 on 2 degrees of freedom, p= 0.02
Imputation of missing values:
Since “Age”, “PSA” and “ECE” were unavailable in all three datasets, I deleted them. Besides, there were still 2 missing values in “Stage”, 14 Missing values in “iCluster” and 4 missing values in “Gleason”. I imputed all the NA’s in “iCluster” and “Gleason” using Random Forest algorithm and delete the 2 rows with NA’s in “Stage”.
In order to know which demographic factors and genes have influence on BCR free time, I fit several Cox’s proportional hazards regression models using AIC Stepwise method such as the 2 models below:
- Model 1: AIC Stepwise Cox’s model using the pooled dataset with imputation: Likelihood ratio test=114.8 on 20 df, p=3e-15
Surv(Time, Event, type = "right") ~ iCluster + Gleason + ENO2 + SYP + CHGA + SLC2A1 + SLC2A6 + SLC2A12 + SLC2A13 + SLC2A14 + HK1 # the model after AIC stepwise Likelihood ratio test=114.8 on 20 df, p=3e-15 n= 341, number of events= 98
- Model 2: adding interaction terms between “dataset” and each gene: Likelihood ratio test=182.5 on 55 df, p=1e-15
Surv(Time, Event, type = "right") ~ iCluster + Gleason + dataset + ENO2 + NCAM1 + SYP + CHGA + FOLH1 + SLC2A1 + SLC2A3 + SLC2A4 + SLC2A5 + SLC2A6 + SLC2A7 +SLC2A8 + SLC2A10 + SLC2A12 + GCK + HK1 + dataset:NCAM1 + dataset:SYP + dataset:CHGA + dataset:FOLH1 + dataset:SLC2A1 + dataset:SLC2A3 + dataset:SLC2A5 +dataset:SLC2A6 + dataset:SLC2A7 + dataset:SLC2A8 + dataset:SLC2A12 + dataset:GCK + dataset:HK1 #26(13*2) interaction terms stay in the model
Likelihood ratio test=182.5 on 55 df, p=1e-15
n= 341, number of events= 98
I used Random Survival Forest technique to find which factors and genes have greater effects on survival since it applies to right-censored data. I grew 400 bootstrapped trees and implemented log-rank splitting.
- Graph below shows the Leave-one-out cross validation error drops fast and then stables at around 29% as the number of trees increases.
- We can see the importance of each variable with ranking below. “Gleason” has the greatest effect on survival followed by genes—FOLH1, SLC2A1, SYP, SLC2A8 and so forth.