2

For some reason my model keeps showing up to be a poor model when checking its accuracy through confusion matrix and AUC ROC. This is the model i stuck with after doinf backward elimination this is the logistic output:

`Call:
glm(formula = DEATH_EVENT ~ age + ejection_fraction + serum_sodium + 
    time, family = binomial(link = "logit"), data = train, control = list(trace = TRUE))

Deviance Residuals: Min 1Q Median 3Q Max
-2.1760 -0.6161 -0.2273 0.4941 2.6827

Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept) 15.741338 7.534348 2.089 0.03668 *
age 0.063767 0.018533 3.441 0.00058 *** ejection_fraction -0.080520 0.019690 -4.089 4.33e-05 *** serum_sodium -0.111499 0.053639 -2.079 0.03765 *
time -0.020543 0.003331 -6.167 6.95e-10 ***


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1`

This is the confusion matrix output

glm.pred Survived Dead
   0       46   10
   1        5   14

The auc is showing up as 0.178

    library(pROC)

Calculate predicted probabilities for test set

glm.probs <- predict(glm9, newdata=test, type="response")

Create prediction object for test set

pred <- prediction(glm.probs, test$DEATH_EVENT)

Create ROC curve for test set

roc.perf <- performance(pred, measure = "tpr", x.measure = "fpr")

Plot ROC curve for test set

plot(roc.perf, legacy.axes = TRUE, percent = TRUE, xlab = "False Positive Percentage", ylab = "True Positive Percentage", col = "#3182bd", lwd = 4, print.auc = TRUE)

Add AUC to ROC curve

auc <- as.numeric(performance(pred, measure = "auc")@y.values) text(x = 0.5, y = 0.3, labels = paste0("AUC = ", round(auc, 3)), col = "black", cex = 1.5) abline(a=0, b= 1)

Can someone help please project is due very soon and i cant get past this problem

rua
  • 29
  • Significant p-values are no guarantee for high AUC or a good ROC curve. Neither is there a guarantee that the data allow for anything better. Do you have many observations? Maybe you get a better prediction quality without eliminating variables (which is often not good). – Christian Hennig Apr 01 '23 at 23:21
  • @ChristianHennig i have 299 observations and 13 variables – rua Apr 01 '23 at 23:23
  • The easiest step might be to relax the linearity assumption, for at least some of the predictors which all seem to be continuous. And the easiest way to allow for flexible, non-linear relationships is to use splines. – dipetkov Apr 02 '23 at 11:35

1 Answers1

1

The p-values are calculated as if you did not do the backward elimination for feature selection. However, you did do feature selection. Therefore, the p-values are not valid for your model. This is related to issues $2$, $3$, $4$, an $7$ posted here (which are based on statistical theory and do not rely on any particular software, despite the source being a Stata website).

It seems that you overfit the feature selection to your training data, and you picked features that are solid predictors in the training data but turn out not to be in the test data.

Note that stepwise feature selection can be competitive when it comes to pure prediction problems, but the usual p-values and confidence intervals printed by software functions do not account for the feature selection and, thus, are too optimistic in favor of nonzero effects (rejection of null hypotheses that the parameters are zero).

Dave
  • 62,186