0

I have a dataset with multiple variables (~20) that I could use in order to diagnose an illness. In my datset, I know if the individuals are sick or not.

I intend to create a model using classification trees. Before that, I want to select the best variables for this diagnosis.

I have tried 2 methods :

  1. Visualisation of ROC and calculation of AUC
  2. Running a LDA using all my standardised variables

I then ordered my variables by AUC and by LDA coefficient (I have only one axis). The order is different depending on the method.

(in R) :

structure(list(AUC = c(1, 0.846, 0.805, 0.798, 0.767, 0.767, 
0.757, 0.757, 0.683, 0.676, 0.673, 0.67, 0.665, 0.665, 0.664, 
0.65, 0.639, 0.634, 0.624, 0.614, 0.568, 0.566, 0.563, 0.562, 
0.559, 0.545, 0.523, 0.49, 0.479, 0.443), coeffs.LDA = c(NA, 
0.91, 0.53, 0.22, 0.73, -0.04, -0.12, 0.27, 0.66, -0.35, -0.07, 
-0.08, 0.45, -0.78, -0.24, 0.05, 0.09, 0.22, -0.08, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, -30L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x000001291db35420>)

#NA LDA coefficients are for variables I eliminated before running the LDA.

Aren't these two method supposed to mesure the same thing ? Why is the order so different ? What indicator should I use ?

P.S. : I already saw this, but got no answer from it, because the difference was from an error in the AUC calculation.

Dimitri
  • 41

0 Answers0