2

I have a classification problem. The actual outcomes are binary (0 or 1), but I want to predict probabilities, rather than predicting simply 0 or 1. I also want something with feature selection, since there are a lot of predictors. One approach that I want to try is L1-regularized logistic regression (specifically this implementation in statsmodels). One has to find a value for $\alpha$, the weight of the L1-regularization. I plan to do this in the following way:

  1. Select some potential values of $\alpha$, say 0.001, 0.01, 0.1, 1, 10 and 100.

  2. Employ 5-fold cross validation: Fit the model on the union of the four training folds (using the aforementioned method) and then calculate the mean absolute error (MAE) on the test fold. A toy example: If the actual outcomes in the test fold are [1, 0, 1, 0] and the predicted probabilities are [0.9, 0.2, 0.8, 0.7], then the MAE is 0.2 (= (0.1 + 0.2 + 0.2 + 0.7) / 4).

  3. Repeat step 2. for each of the five cross-validation runs and then calculate the mean MAE. A toy example: If the MAEs of the cross-validation runs are 0.2, 0.1, 0.3, 0.3 and 0.1, then the mean MAE is 0.2 (= (0.2 + 0.1 + 0.3 + 0.3 + 0.1) / 5).

  4. Repeat steps 2. and 3. for each value of $\alpha$ given in step 1.

  5. Choose the value of $\alpha$ with the least mean MAE.

Is this a sensible approach? Is it theoretically sound or would an information criterion such as the AIC be better? There is this nice guide from sklearn, but it is for linear regression, rather than logistic regression; in any case, they use the mean squared error. The AIC takes the number of parameters into account (the fewer the better), but the cross-validation approach does not. Since I want feature selection, I would be willing to sacrifice some predictive accuracy for the sake of having fewer features in the model.

To give a rough picture: The data contains approximately 120 features and 10000 rows. I have scaled the data. And to avoid any confusion: The approach uses the MAE only for hyperparameter tuning, not for the model fitting itself.

EDIT: Another potential approach would be calculate the likelihood of the test-fold predictions:

$$ \prod_{\text{outcome is 1}}\text{predicted probability} \; \times \prod_{\text{outcome is 0}}1 - \text{predicted probability} $$

Would this be a better scoring method than the MAE?

Stephan Kolassa
  • 123,354
dwolfeu
  • 620
  • 4
    Why use MAE instead of the crossentropy loss (log loss) or even square loss? Absolute loss is problematic for probability predictions. $//$ Why do you want to use LASSO to zero-out features? Except in a narrow sense, LASSO doesn’t really do feature selection. You still consider many features but just estimate their effects to be zero. – Dave Mar 05 '23 at 17:06
  • By "feature selection" here I just mean having some coefficients set to zero. (I have already removed a number of columns prior to any model fitting.) The idea to use the MAE came from adapting the approach in Introduction to Statistical Learning, specifically equation (5.4) where the indicator function is used. I had completely overlooked the cross-entropy – thank you! – dwolfeu Mar 05 '23 at 17:18
  • But why do you want those coefficients set to zero? – Dave Mar 05 '23 at 17:20
  • I want to interpret the model, so whittling down the features to those that are relevant would be helpful. Is feature selection not one of the advantages of L1-regularization or have I misunderstood? – dwolfeu Mar 05 '23 at 17:25
  • LASSO definitely does tend to zero-out many coefficients. I dispute that this is feature selection (at least in a strong sense), since you still consider all of the features in your model training. – Dave Mar 05 '23 at 17:28

1 Answers1

5

Per the thread Dave links to, minimizing the MAE will incentivize you towards biased "hard classifications": if 60% of samples with a given predictor configuration are of class A, then the MAE-optimal classification would be to predict a 100% (not 60%) probability of them to be of class A.

Minimizing the MAE is thus equivalent to maximizing accuracy, which has major problems.

I sympathize with your goal of sparse probabilistic classification, but minimizing the MAE is not the way to go about it.

Stephan Kolassa
  • 123,354