Appropriate feature selection for classification

Question

in univariate feature ranking for classification, it is common to use the χ²-test(MATLAB-sklearn) to calculate importance scores based on the negative log of the associated p-value:

$Importance = -\ln(p)$

Is it possible to define an importance value that is associated with p = 0.05 to exclude variables that had a p value above that limit? In this case, $-ln(0.05) ≅ 3$

I am doing the selection so far by means of noise injection, however I noticed that the selection tends to change from run to run. Would it be possible to take the mean importance score of many random varibles and define the threshold there?

Thanks for any hint!

it is common to use the χ² ... citation needed, this procedure looks as arbitrary as the rest of them. — user2974951, Nov 08 '23 at 09:43
@user2974951 citation added to a well documented function in MATLAB, thanks for the hint — Tino D, Nov 08 '23 at 09:46
Note that just because something is implemented in (even well known) software does not mean it makes sense. This holds especially for statistical functionality. Just as statisticians are not software engineers, software engineers or mathematicians are not statisticians. There is a lot of "Wikipedia-grade" statistics implemented out there, and we regularly get questions where the best answer is "you should not be doing that in the first place". ... — Stephan Kolassa, Nov 08 '23 at 09:54
... In the present situation, what p-value is used? Why should we care about it? The relationship between a predictor and the outcome may only happen in an interaction, which we may or may not be modeling. A p-value from a $\chi^2$ test does not account for such interactions, but if our classifier is (pretty much) any ML method, and we have enough data, interactions will be modeled by the classifier. This thread is very much recommended. — Stephan Kolassa, Nov 08 '23 at 09:56
That said, for your first question, you could simply use a threshold on either the importance or the p value. Since the log is monotone, this amounts to precisely the same thing. For your second question, "noise" is random, so of course there will be variations between runs. Yes, you could in principle take averages over multiple runs. (But this actually tells you that the signal is not very strong in your data, so "hard" feature selection may not be the best approach. Have you considered some way of regularization?) — Stephan Kolassa, Nov 08 '23 at 10:00
While you are at it have a look at Variable selection for predictive modeling really needed in 2016? and/or Do I discard all my dependent variables as proved by chi-squared test of independence?. — user2974951, Nov 08 '23 at 10:02
@StephanKolassa Thanks for the answer, at the end I guess I will have to implement them and check for differences. — Tino D, Nov 08 '23 at 10:11
Variable selection is generally a bad idea, but if you are going to do it, do it in the context of what is really going on — a ranking and selection problem. And use measures that expose how truly difficult this task is. See this. — Frank Harrell, Nov 08 '23 at 12:41

score 0 · Accepted Answer · answered Feb 22 '24 at 15:29

I ended up generating many random predictors and seeing what their importance score was looking like. I got at the end a range of importance levels stemming from random variables, that I saw like this:

conservative selection of variables: take the highest importance level achieved by a random variable and set it as a threshold for the selection
non-conservative selection of variables: take the lowest importance
Average: in the middle

I ended up taking the most conservative approach to reduce the size of my dataset from 2000 predictors to about 200, but the range did not differ that much from the non-conservative approach, which would have led to about 300 predictors.

Appropriate feature selection for classification

1 Answers1