Built a shiny app that use the logistic model below and pateint details provided by the user to predict the factors that will influence the patient's survival.
The main objective of this study is to determine the factors that are critical for survival among Cardiovascular Heart Disease (CHD) Patients.
- I created a Logistic Regression model in R using RStudio to predict the survival of CHD patients.
- Extracted the data from the PHIS database using SQL; built tableau dashboards.
The model predicted the survival of CHD with an AUC of over .90 and indicated that a higher EF % is a major contributing factor.
- The study is based on 299 patients of Cardiovascular Heart Disease (CHD) comprising of 105 women and 194 men.
- All the patients were more than 40 years old, having left ventricular systolic dysfunction and falling in NYHA class III and IV.
- Cardiovascular Heart Disease (CHD) was diagnosed by cardiac echo report or notes written by physician.
- The information related to risk factors were taken from blood reports while smoking status and blood pressure were taken from physician's notes.
with the readr library
library(readr)
survival <- read_csv("S1Data_fixed.csv")
attach (survival)
head(survival)
index | TIME | Survival | Gender | Smoking | Diabetes | BP | Anaemia | Age | EF | Sodium | Serum Creatinine | platelets | CPK |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 97 | 1 | 0 | 0 | 0 | 0 | 1 | 43.0 | 2 | 1 | 1 | 1 | 2 |
1 | 180 | 1 | 1 | 1 | 1 | 0 | 1 | 73.0 | 0 | 3 | 1 | 0 | 1 |
Response variable:
Survival - Cardiovascular Heart Disease (CHD) Pateints that were still alive after the follow up period by the physicians. If the patient was alive by the follow up date: 1 and if not: 0.
Predictor variables:
-
Age, serum sodium, serum creatinine, gender, smoking, Blood Pressure (BP), Ejection Fraction (EF), anemia, platelets, Creatinine Phosphokinase (CPK) and diabetes.
Continuos predictors
-
Age & Time
Categorical preditcors
-
EF, serum sodium, platelets, CPK and Serum Creatinine
Predictor Descriptions
-
Time is follow up time was 4 to 285 days with an average of 130 days.
-
Gender was divided into two levels ( Male coded as 1, Female coded as 0)
-
EF was divided into three levels (i.e. EF<30 (coded as 0), 30<EF<45 (coded as 1) and EF>45 (coded as 2)) and platelets was also divided into three level on the basis of quartiles.
-
Sodium was also coded into three levels based on the quartiles
-
Serum Creatinine greater than its normal level (1.5) is an indicator of renal dysfunction. Its effect on mortality was studied as creatinine >1.5 (coded as 0) vs <1.5 (coded as 1).
-
Anemia in patients was assessed by their haematocrit level. The patients with haematocrit less than 36 (minimum normal level of haematocrit) were taken as anemic.
Removing Null Values
na.omit(survival)
Skewness and Correlation Analysis:
- Histograms (ggplot & tidyr library)
library (ggplot2)
library (tidyr)
ggplot(gather(survival), aes(value)) +
geom_histogram(bins = 8) +
facet_wrap(~key, scales = 'free_x')
There is some skewness in the data, however, the plot shows that all the predictor variables are not exteremely right or left skewed. It is also noticeable from the plot that all the varaibles are catagorical variales expect for Age and Time.
- Scatterplot Matrix
pairs (survival[,0:7], col= ifelse (Survival==1, "green", "red"))
pairs (survival[,8:13], col= ifelse (Survival==1, "green", "red"))
These plots further emphasize the minimum skewness in the data and the categorical sense of the majority of the variables.
- Correlation Matrix (using corplot library)
transcorr = cor(survival [, c(1:10, 10, 12:13)], use= "complete.obs")
library("corrplot")
corrplot (transcorr, method = "number", number.cex=0.6)
The correlation matrix shows that the varaible Time and surviaval are higly correlated (0.53). Gender and Smoking have a high correlation as well(0.45). Survaial and Serum Creatinine appears to have a moderate correlation(0.37).
- Odd Tables (using xtabs in the MASS library)
- Survival vs EF
library(MASS)
table1= xtabs(~Survival+EF)
table1
EF 0 | EF 1 | EF 2 | |
---|---|---|---|
Survival 0 | 51 | 31 | 14 |
Survival 1 | 42 | 115 | 46 |
93 patients had an Ejection Fraction of less than 30 %, while 146 have an Ejection Fraction of 30 to 45 % and only 60 have an Ejection Fraction higher than 45% . Of all the patients that had an ejection fraction less than 30 %, 55 % died of cardio-vascular heart disease. That is 32% more than the patients who died that had an ejection fraction higher than 45%. Patients with an ejection fraction between 30 and 45 % had the highest survival rate at 79 %. You have a higher chance of dying if your ejection fraction is less than 30 %.
- Survival vs Gender
library(MASS)
table2= xtabs(~Survival+Gender)
table2
Female | Male | |
---|---|---|
Survival 0 | 34 | 62 |
Survival 1 | 71 | 132 |
Odds of surviving:
- Male 132/194 = 0.680
- Female 71/105 = 0.676
- Odds Ratio for Male VS Female: 0.680/0.676 = 1.0059
Female patients are 1.0059 times more likely to survive cardio-vascular heart diseases than male patients. This essentially means the chances of survival for male and female patients is approximately 1 to 1.
- Survival vs Diabetes
library(MASS)
table2= xtabs(~Survival+Diabetes)
table2
Diabetes 0 | Diabetes 1 | |
---|---|---|
Survival 0 | 56 | 40 |
Survival 1 | 118 | 85 |
Odds of not surviving:
- Diabetic 40/125 = 0.32
- Non-diabetic 56/174 = 0.32
- Odds Ratio for Diabetic VS Non-Diabetic: 0.32/0.32 = 1
This essentially means the chances of not surviving for Diabetic and non-Diabetic patients is 1 to 1.
- Survival vs Sodium
library(MASS)
table2= xtabs(~Survival+Sodium)
table2
Sodium 0 | Sodium 1 | Sodium 2 | Sodium 3 | |
---|---|---|---|---|
Survival 0 | 42 | 24 | 19 | 11 |
Survival 1 | 41 | 70 | 61 | 31 |
Urinary Sodium to creatine ratio or Sodium levels 0 to 3 represent the quartiles of the collected Sodium ratios, 0 being less 25th quartile and 3 being greater than 75th quartile. We can see that when the patient's Sodium to Creatine ratio is less than the 25th quartile they have less than 50% chance of surving cardio-vascular heart disaese. While pateints with Sodium to Creatine ratio greater than the 25th quartile all have higher than 70% chance of surviving cardio-vascular heart disease.
Fitting
order1_fit1 <- glm (Survival ~ as.factor(EF) + `Serum Creatinine` + Diabetes + TIME + Gender + Smoking + BP + Anaemia + Age + Sodium + platelets + CPK, family = binomial)
summary(order1_fit1)
Deviance Residuals:
Min | 1Q | Median | 3Q | Max |
---|---|---|---|---|
-2.4184 | -0.4308 | 0.1899 | 0.5562 | 2.7121 |
Coefficients:
Estimate | Std. Error | z value | Pr(>abs(z)) | Significance | |
---|---|---|---|---|---|
(Intercept) | -1.358485 | 1.308658 | -1.038 | 0.299235 | |
as.factor(EF)1 | 1.836955 | 0.422611 | 4.347 | 1.38e-05 | *** |
as.factor(EF)2 | 1.857695 | 0.518107 | 3.586 | 0.000336 | *** |
Serum Creatinine | 1.678121 | 0.429314 | 3.909 | 9.27e-05 | *** |
Diabetes | -0.192139 | 0.359440 | -0.535 | 0.592961 | |
TIME | 0.022433 | 0.003194 | 7.024 | 2.15e-12 | *** |
Gender | 0.553463 | 0.436964 | 1.267 | 0.205295 | |
Smoking | -0.135759 | 0.430672 | -0.315 | 0.752590 | |
BP | -0.009470 | 0.368440 | -0.026 | 0.979495 | |
Anaemia | -0.049583 | 0.379851 | -0.131 | 0.896146 | |
Age | -0.046199 | 0.016617 | -2.780 | 0.005432 | ** |
Sodium | 0.292238 | 0.184413 | 1.585 | 0.113036 | |
platelets | 0.114345 | 0.165775 | 0.690 | 0.490344 | |
CPK | -0.397067 | 0.161646 | -2.456 | 0.014034 | * |
Signifance codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1 Null deviance: 375.35 on 298 degrees of freedom Residual deviance: 209.93 on 285 degrees of freedom
Getting Coeffiecents for the fit
exp (order1_fit1$coefficients)
(Intercept) | as.factor(EF)1 | as.factor(EF)2 | Serum Creatinine | Diabetes | TIME | Gender | Smoking | BP | Anaemia | Age | Sodium | platelets | CPK | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Coefficients: | 0.2570498 | 6.2773944 | 6.4089462 | 5.3554847 | 0.8251924 | 1.0226867 | 1.7392656 | 0.8730532 | 0.9905751 | 0.9516266 | 0.9548522 | 1.3394224 | 1.1211391 | 0.6722891 |
Getting Confidence Intervals
exp (confint (order1_fit1))
2.5 % | 97.5 % | |
---|---|---|
(Intercept) | 0.01872896 | 3.2432977 |
as.factor(EF)1 | 2.80878250 | 14.8482312 |
as.factor(EF)2 | 2.40283407 | 18.5011624 |
Serum Creatinine | 2.35640903 | 12.7988731 |
Diabetes | 0.40569020 | 1.6706351 |
TIME | 1.01669786 | 1.0295687 |
Gender | 0.74725273 | 4.1808057 |
Smoking | 0.37153732 | 2.0267428 |
BP | 0.48198078 | 2.0566115 |
Anaemia | 0.45160198 | 2.0164303 |
Age | 0.92319656 | 0.9856701 |
Sodium | 0.93674051 | 1.9377001 |
platelets | 0.81029581 | 1.5569596 |
CPK | 0.48416180 | 0.9155519 |
The model shows that pateints having an ejection fraction between 30 to 45 % and higher 45 % are significantly different from patients who have an ejection fraction of less than 30 %. The model also shows that Serum Creatinine, Time, Age and CPK are significant predictors for predicting the survival of cardio-vascular heart disease. For every unit increase in Serum Creatinine the odds of surviving cardio-vascular disease go up by 5.3554847-folds. The odds of surviving cardio-vascular heart disease for male pateints vs female pateints is not only 1 to 1 as we predicted earlier but essentially ranges from 0.74725273 to 4.1808057 -folds.
Jittered Actuals vs Predicted logits
order1_fit1.logit = predict (order1_fit1)
plot (jitter (Survival, 0.2) ~ order1_fit1.logit, data=survival)
pihat.order1_fit1 = predict (order1_fit1, type='response')
pihat.ord = order (pihat.order1_fit1)
lines (order1_fit1.logit [pihat.ord], pihat.order1_fit1 [pihat.ord])
lines (lowess (order1_fit1.logit [pihat.ord], pihat.order1_fit1 [pihat.ord]), col='red', lty=2)
The jittered response vs. predicted values with the fitted logistic curve and a lowess fit shows that the model fits the data aaproximately well since the solid and the jittered line are approximately similar in shape.
Residuals Plots
plot(predict(order1_fit1),residuals(order1_fit1))
abline(h=0,lty=1,col="brown")
par (mfrow=c(1,2))
plot (order1_fit1, which=c(5))
The linaerity condition is statisfied because the plot is flat and right on zero. There are no points with high leverage or high Cook's Distance therefore those influence measures do not appear on the plot. The point index 240 has a standerized residual around 6. We will see whether it is an influential point.
Pull out the cases with high leverage and large residuals.
survival$Residual = round (order1_fit1$residuals, 4)
survival$leverage = round (hatvalues(order1_fit1), 4)
survival$rstudent = round (rstudent(order1_fit1), 4)
high.levg.resd = survival [survival$leverage > 3*(order1_fit1$rank) / (order1_fit1$rank + order1_fit1$df.residual) |abs (order1_fit1$residuals) > 30 , ]
high.levg.resd[order(-high.levg.resd$rstudent),][c(1,9:17)]
TIME | EF | Sodium | Serum Creatinine | platelets | CPK | index | Residual | leverage | rstudent | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 30 | 0 | 1 | 0 | 3 | 3 | 240 | 39.6 | 0.0131 | 2.81 |
Index 240 has a large residual value, however, its 0.0131 leverage value is not bigger than the Leverage Cutoff value at 0.13043478 (3(k+1)/n). Therefore, this point is an influential point. We will see whether there are any other influential points that do not have high residuals as index 240.
high.Leverage = survival [survival$leverage > 0.13043478 & is.na(survival$leverage) == FALSE, ]
high.Leverage[order(-high.Leverage$leverage),][c(1,9:17)]
TIME | EF | Sodium | Serum Creatinine | platelets | CPK | index | Residual | leverage | rstudent | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 94 | 2 | 2 | 0 | 3 | 0 | 242 | 2.55 | 0.134 | 1.45 |
2 | 172 | 0 | 3 | 0 | 3 | 3 | 222 | -1.65 | 0.132 | -1.05 |
There are only 2 points with leverage values higher than the Leverage Cutoff value at 0.13043478 (3(k+1)/n). The point index 242 has the highest leverage value of 0.134.
car::vif(order1_fit1)
GVIF | Df | GVIF^(1/(2*Df)) | |
---|---|---|---|
as.factor(EF) | 1.328792 | 2 | 1.073654 |
Serum Creatinine | 1.166354 | 1 | 1.079979 |
Diabetes | 1.044061 | 1 | 1.021793 |
TIME | 1.280348 | 1 | 1.131525 |
Gender | 1.464522 | 1 | 1.210174 |
Smoking | 1.343264 | 1 | 1.158993 |
BP | 1.063805 | 1 | 1.031409 |
Anaemia | 1.172183 | 1 | 1.082674 |
Age | 1.208652 | 1 | 1.099387 |
Sodium | 1.122267 | 1 | 1.059371 |
platelets | 1.088017 | 1 | 1.043081 |
CPK | 1.254166 | 1 | 1.119896 |
All of the VIF are less than 5. Hence, multi-colinearity is low in our model. We will be doing stepwise regresion to find the best model.
step.AIC = step (order1_fit1, direction = "both")
summary(step.AIC)
beta2 = coefficients(step.AIC)
exp (beta2)
exp (confint (step.AIC))
step.BIC = step(order1_fit1, direction = "both", k=log (order1_fit1$rank + order1_fit1$df.residual))
summary(step.BIC)
anova ( step.BIC, step.AIC )
pchisq(2.0858, 1,lower.tail = FALSE)
Analysis of Deviance Table
Model | Resid. Df | Resid. Dev | Df | Deviance |
---|---|---|---|---|
AIC | 292 | 214.42 | ||
BIC | 291 | 212.34 | 1 | 2.0858 |
ANOVA model pvalue: 0.1486743 |
We choose the AIC model to do the interaction effects on becuase it has significant predictors and has a lower residual deviance of 214.42 Residual Deviance on 292 degrees of freedom. The ANOVA pvalue (0.1486743) which is less than 0.05 (the typical cutoff).
# Center quantitative predictors and add interaction effects
my.center = function (y) y - mean (y)
survival$TIME.c = my.center (survival$TIME)
survival$Age.c = my.center (survival$Age)
fit1_InteractionEffects = glm (Survival ~ (as.factor(EF) + TIME.c + CPK + `Serum Creatinine` + Age.c)^2,
data=survival,
family = binomial)
summary (fit1_InteractionEffects)
Step-wise with AIC
step.AIC_final = step (fit1_InteractionEffects, direction = "both")
summary(step.AIC_final)
Step-wise with BIC
step.BIC_final = step (fit1_InteractionEffects, direction = "both",k=log (fit1_InteractionEffects$rank + fit1_InteractionEffects$df.residual))
summary(step.BIC_final)
anova (step.AIC_final, step.BIC_final)
pchisq(11.949, 3, lower.tail = FALSE)
Analysis of Deviance Table
Model | Resid. Df | Resid. Dev | Df | Deviance |
---|---|---|---|---|
Final AIC | 289 | 202.47 | ||
Final BIC | 292 | 214.42 | -3 | -11.949 |
ANOVA model pvalue: 0.007559923 |
We choose AIC model as final becuase it's Residual Deviance is 202.47 which less than the Residual Deviance 214.42 of the BIC model. The pvalue (0.0075599) is less than 0.05 which means that the two models are statistically differnt from each other.
final_model_1 = step.AIC_final
summary(final_model_1)
Deviance Residuals:
Min | 1Q | Median | 3Q | Max |
---|---|---|---|---|
-2.71103 | -0.49728 | 0.08546 | 0.50304 | 2.33555 |
Coefficients:
Estimate | Std. Error | z value | Pr(>abs(z)) | Significance | |
---|---|---|---|---|---|
(Intercept) | -0.683582 | 0.423220 | -1.615 | 0.10627 | |
as.factor(EF)1 | 2.555116 | 0.564167 | 4.529 | 5.93e-06 | *** |
as.factor(EF)2 | 1.955026 | 0.694497 | 2.815 | 0.00488 | ** |
TIME.c | 0.008115 | 0.005224 | 1.553 | 0.12032 | |
CPK | -0.401538 | 0.156240 | -2.570 | 0.01017 | * |
Serum Creatinine | 1.924515 | 0.443882 | 4.336 | 1.45e-05 | *** |
Age.c | -0.050382 | 0.016646 | -3.027 | 0.00247 | ** |
as.factor(EF)1:TIME.c | 0.022274 | 0.008176 | 2.724 | 0.00644 | ** |
as.factor(EF)2:TIME.c | 0.008078 | 0.009461 | 0.854 | 0.39321 | |
TIME.c:Serum Creatinine | 0.010449 | 0.006271 | 1.666 | 0.09565 | . |
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
categorize = function (x, quantiles=(1:3)/4) {
cutoffs = quantile (x, quantiles)
n.cutoffs = length (cutoffs)
result = rep ("C1", length (x))
for (j in 1:n.cutoffs) {
result [x > cutoffs [j]] = paste ("C", j+1, sep="")
}
return (result)
}
library (ggplot2)
qplot (TIME, predict (final_model_1), data=survival, color=categorize (EF)) + geom_smooth (method="lm")
qplot (`Serum Creatinine`, predict (final_model_1), data=survival, color=categorize (TIME)) + geom_smooth (method="lm")
The interaction plot between time and Ejection Fracton shows that the predicted logits of suriving cardio-vascular disease increases over time for all ejection fraction (EF) categories(EF<30 (coded as 0), 30<EF<45 (coded as 1)). Patients with 30<EF<45 have the highest rise in the predicted logits of suriving cardio-vascular disease over time. We had mentioned earlier with our frequency table that patients with an ejection fraction (EF) between 30 and 45 % had the highest survival rate at 79 %. However, with this interaction plot with time we see that statement holds true only for patients whose follow up time was longer than 100 days. We see that for patients with a follow up time of less than 25 days, they have lowest predicted logits of suriving cardio-vascular disease.
The interaction plot between Serum Creatinine and time shows the predicted logits of surviving cardio-vascular disease increases over time because C1 (25th quartile of follow time) is the lowest and C4 (75th quartile of follow time) is the highest in the graph. This effect was already accounted for in the previuos interaction plot. Serum Creatinine was coded as follows, >1.5(indicator of renal dysfunction coded as 0) and < 1.5(normal level coded as 1). We can see from the positive slope of all four of the TIME graphs that all patients with renal dysfunction (0) have lower predicted logits of surviving than those who do not (1). All patients with follow up time in C4 (75 quartile),regardless of whether they have renal dysfuntion, have higher predicted logits of suriving cardio-vascular disease than all other patients in C1,2 & 3, regardless of whether those patients do not have renal dysfuntion. Patients with follow up time in C4 (75 quartile), also have the most noticeable change in the predicted logits for surviving cardio-vascular disease between patients with renal dysfunction(0) and patients without (0).
Get Confidence Intervals
exp (confint (final_model_1))
2.5 % | 97.5 % | |
---|---|---|
(Intercept) | 0.2156857 | 1.1469405 |
as.factor(EF)1 | 4.6119614 | 43.0804413 |
as.factor(EF)2 | 2.0696061 | 33.5811069 |
TIME.c | 0.9980162 | 1.0189091 |
CPK | 0.4877742 | 0.9025785 |
Serum Creatinine | 2.9247387 | 16.8167300 |
Age.c | 0.9191504 | 0.9814366 |
as.factor(EF)1:TIME.c | 1.0075532 | 1.0408464 |
as.factor(EF)2:TIME.c | 0.9911966 | 1.0294466 |
TIME.c:Serum Creatinine | 0.9981325 | 1.0231982 |
An increase in a unit of Creatinine Phosphokinase (CPK) decreases a patients of odds of surviving by 0.331-folds. A unit increase in Age decreases a patients odds of surviving by a range of 0.0808 to 0.01856-folds.
preds = predict (final_model_1, se.fit = T)
pred.df = cbind.data.frame (survival, as.data.frame (preds))
pred.df$lwr = pred.df$fit - 1.96 * pred.df$se.fit
pred.df$upr = pred.df$fit + 1.96 * pred.df$se.fit
pred.df$fit.pr = round (exp (pred.df$fit) / (1 + exp (pred.df$fit)), 3)
pred.df$lwr.pr = round (exp (pred.df$lwr) / (1 + exp (pred.df$lwr)), 3)
pred.df$upr.pr = round (exp (pred.df$upr) / (1 + exp (pred.df$upr)), 3)
head(pred.df)
Prediction for Patients over 70
# Predictions for patients in their 70s and older.
pred.df[c(2, 17, 88, 48, 40),c(1,2,3,8,17:20)]
TIME | Survival | Gender | Age | rstudent | TIME.c | Age.c | fit | |
---|---|---|---|---|---|---|---|---|
2 | 180 | 1 | 1 | 73 | 0.6080 | 49.7391304 | 12.16611 | 1.1497924 |
17 | 28 | 0 | 1 | 85 | -0.5314 | -102.2608696 | 24.16611 | -2.4535981 |
88 | 130 | 0 | 0 | 80 | -1.3707 | -0.2608696 | 19.16611 | 1.6151530 |
48 | 109 | 0 | 1 | 80 | -1.4417 | -21.2608696 | 19.16611 | 0.7575546 |
40 | 24 | 0 | 0 | 95 | -0.7606 | -106.2608696 | 34.16611 | -2.2647920 |
Predictions for patients in their 40s and 50s
pred.df[c(103, 66, 83, 21, 11),c(1,2,3,8,17:20)]
TIME | Survival | Gender | Age rstudent | TIME.c | Age.c | fit | |
---|---|---|---|---|---|---|---|
103 | 61 | 0 | 1 | 45 | -2.1708 | -69.26087 | -15.833893 |
66 | 11 | 0 | 1 | 45 | -0.5346 | -119.26087 | -15.833893 |
83 | 196 | 0 | 0 | 54 | -2.1218 | 65.73913 | -6.833893 |
21 | 192 | 1 | 1 | 50 | 0.1934 | 61.73913 | -10.833893 |
11 | 119 | 1 | 1 | 50 | 0.4907 | -11.26087 | -10.833893 |
par (mfrow=c(1,2))
plot (jitter (Survival, 0.2) ~ predict (final_model_1), data=survival)
logit.9 = predict (final_model_1)
ord9 = order (logit.9)
pihat.9 = exp (logit.9) / (1 + exp (logit.9))
lines (logit.9 [ord9], pihat.9 [ord9])
# Lowess fit
lines (lowess (logit.9, survival$Survival), col='red', lty=2)
plot (final_model_1, which=1)
pchisq (final_model_1$null.deviance - final_model_1$deviance,
final_model_1$df.null - final_model_1$df.residual, lower.tail = F)
pvalue = 1.553978e-32
The pvalue (1.553978e-32) is less than 0.05, therfore we reject the null hypothesis and conclude that at least one of our parameters is different from zero.
# Goodness of fit test
pchisq (final_model_1$deviance, final_model_1$df.residual, lower.tail = F)
pvalue = 0.9999696
The p_value(0.9999696) is bigger than (0.05), suggesting that there is no significant lack of fit in the model.
plot(final_model_1,which = 5)
# ROC Curve for model final_model_1
# ROC curve - install package ROCR
par (mfrow=c(1,1))
library(ROCR)
pred1 <- prediction(final_model_1$fitted.values, final_model_1$y)
perf1 <- performance(pred1,"tpr","fpr")
auc1 <- performance(pred1,"auc")@y.values[[1]]
auc1
plot(perf1, lwd=2, col=2)
abline(0,1)
legend(0.6, 0.3, c(paste ("AUC=", round (auc1, 4), sep="")), lwd=2, col=2)
# Extract the X and Y values from the ROC plot, as well as the probability cutoffs
roc.x = slot (perf1, "x.values") [[1]]
roc.y = slot (perf1, "y.values") [[1]]
cutoffs = slot (perf1, "alpha.values") [[1]]
auc.table = cbind.data.frame(cutoff=pred1@cutoffs,
tp=pred1@tp, fp=pred1@fp, tn=pred1@tn, fn=pred1@fn)
names (auc.table) = c("Cutoff", "TP", "FP", "TN", "FN")
auc.table$sensitivity = auc.table$TP / (auc.table$TP + auc.table$FN)
auc.table$specificity = auc.table$TN / (auc.table$TN + auc.table$FP)
auc.table$FalsePosRate = 1 - auc.table$specificity
auc.table$sens_spec = auc.table$sensitivity + auc.table$specificity
# Find the row(s) in the AUC table where sensitivity + specificity is maximized
auc.best = auc.table [auc.table$sens_spec == max (auc.table$sens_spec),]
auc.best
# Plot the maximum point(s) on the ROC plot
points (auc.best$FalsePosRate, auc.best$sensitivity, cex=1.3)
The AUC is 0.916, meaning that the model correctly predicted for 91.6% of the cases in our dataset. This suggests that our final model predicts surviving cardio-vascular disease better than simply guessing.
Actual Positive | Actual Negative | |
---|---|---|
Predicted Positive | TP 171 | FP 15 |
Predicted Negative | FN 32 | TN 81 |
Cutoff | sensitivity | specificity | FalsePosRate | sens_spec |
---|---|---|---|---|
TP/(TP+FN) | TN/(TN+FP) | 1-specificity | sensitivity+specificity | |
0.6412728 | 0.8423645 | 0.84375 | 0.15625 | 1.686115 |
Our cutoff for determining predictions of failure or no failure was 0.6412. This has a false positive rate of 0.156 and a sensitivity (probability of a true positive prediction) of 0.84.
We can conclude that, among the predictor variables from our dataset, the presence of Ejection Fraction, Time, Creatinine Phosphokinase (CPK), Serum Creatinine and Age were helpful in explaining patients' survival Cardiovascular Heart Disease (CHD). We also found a significant interaction effect between Time and Ejection Fraction. The AUC of this model is 0.916, showing that these five variables and the four interaction plots help predict patients' survival Cardiovascular Heart Disease (CHD) better than simply guessing, and the low p-value of 1.553978e-32 from the Likelihood Ratio test confirmed the significance of the model. The model’s sensitivity was 0.84 and specificity was 0.84375, suggesting that the model does a much better job predicting non-failures than failures. This analysis begs the question of what other variables can explain dam failure, as the variables from this set could not fully account for dam failure.
Some follow-up questions to this analysis could include:
- Our analysis shows that an increase in age decreases a patient's chance of survival; it also indicates that if a patient is older than 75 years, with EF of 1 or 2, no renal dysfunction and has a follow-up time after 100 days(not sick). Those patients will have a better survival rate than those who have the same age but have renal dysfunction and EF of 0. Will creating an educational program that prevents renal dysfunction and maintains an EF level of 1 or 2 to improve the survival rate for patients older than 75 years and who have a follow-up time after 100 days.
- Are there other factors that could be measured, such as obeisty, cholestrol, family history and physical inactivity that could improve the predictability of the model?