I have a dataset with multiple variables (~20) that I could use in order to diagnose an illness. In my datset, I know if the individuals are sick or not.
I intend to create a model using classification trees. Before that, I want to select the best variables for this diagnosis.
I have tried 2 methods :
- Visualisation of ROC and calculation of AUC
- Running a LDA using all my standardised variables
I then ordered my variables by AUC and by LDA coefficient (I have only one axis). The order is different depending on the method.
(in R) :
structure(list(AUC = c(1, 0.846, 0.805, 0.798, 0.767, 0.767,
0.757, 0.757, 0.683, 0.676, 0.673, 0.67, 0.665, 0.665, 0.664,
0.65, 0.639, 0.634, 0.624, 0.614, 0.568, 0.566, 0.563, 0.562,
0.559, 0.545, 0.523, 0.49, 0.479, 0.443), coeffs.LDA = c(NA,
0.91, 0.53, 0.22, 0.73, -0.04, -0.12, 0.27, 0.66, -0.35, -0.07,
-0.08, 0.45, -0.78, -0.24, 0.05, 0.09, 0.22, -0.08, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, -30L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x000001291db35420>)
#NA LDA coefficients are for variables I eliminated before running the LDA.
Aren't these two method supposed to mesure the same thing ? Why is the order so different ? What indicator should I use ?
P.S. : I already saw this, but got no answer from it, because the difference was from an error in the AUC calculation.