MAE to find tuning parameter for lasso logistic regression

Question

I have a classification problem. The actual outcomes are binary (0 or 1), but I want to predict probabilities, rather than predicting simply 0 or 1. I also want something with feature selection, since there are a lot of predictors. One approach that I want to try is L1-regularized logistic regression (specifically this implementation in statsmodels). One has to find a value for $\alpha$, the weight of the L1-regularization. I plan to do this in the following way:

Select some potential values of $\alpha$, say 0.001, 0.01, 0.1, 1, 10 and 100.
Employ 5-fold cross validation: Fit the model on the union of the four training folds (using the aforementioned method) and then calculate the mean absolute error (MAE) on the test fold. A toy example: If the actual outcomes in the test fold are [1, 0, 1, 0] and the predicted probabilities are [0.9, 0.2, 0.8, 0.7], then the MAE is 0.2 (= (0.1 + 0.2 + 0.2 + 0.7) / 4).
Repeat step 2. for each of the five cross-validation runs and then calculate the mean MAE. A toy example: If the MAEs of the cross-validation runs are 0.2, 0.1, 0.3, 0.3 and 0.1, then the mean MAE is 0.2 (= (0.2 + 0.1 + 0.3 + 0.3 + 0.1) / 5).
Repeat steps 2. and 3. for each value of $\alpha$ given in step 1.
Choose the value of $\alpha$ with the least mean MAE.

Is this a sensible approach? Is it theoretically sound or would an information criterion such as the AIC be better? There is this nice guide from sklearn, but it is for linear regression, rather than logistic regression; in any case, they use the mean squared error. The AIC takes the number of parameters into account (the fewer the better), but the cross-validation approach does not. Since I want feature selection, I would be willing to sacrifice some predictive accuracy for the sake of having fewer features in the model.

To give a rough picture: The data contains approximately 120 features and 10000 rows. I have scaled the data. And to avoid any confusion: The approach uses the MAE only for hyperparameter tuning, not for the model fitting itself.

EDIT: Another potential approach would be calculate the likelihood of the test-fold predictions:

$$ \prod_{\text{outcome is 1}}\text{predicted probability} \; \times \prod_{\text{outcome is 0}}1 - \text{predicted probability} $$

Would this be a better scoring method than the MAE?

Why use MAE instead of the crossentropy loss (log loss) or even square loss? Absolute loss is problematic for probability predictions. $//$ Why do you want to use LASSO to zero-out features? Except in a narrow sense, LASSO doesn’t really do feature selection. You still consider many features but just estimate their effects to be zero. — Dave, Mar 05 '23 at 17:06
By "feature selection" here I just mean having some coefficients set to zero. (I have already removed a number of columns prior to any model fitting.) The idea to use the MAE came from adapting the approach in Introduction to Statistical Learning, specifically equation (5.4) where the indicator function is used. I had completely overlooked the cross-entropy – thank you! — dwolfeu, Mar 05 '23 at 17:18
I want to interpret the model, so whittling down the features to those that are relevant would be helpful. Is feature selection not one of the advantages of L1-regularization or have I misunderstood? — dwolfeu, Mar 05 '23 at 17:25
LASSO definitely does tend to zero-out many coefficients. I dispute that this is feature selection (at least in a strong sense), since you still consider all of the features in your model training. — Dave, Mar 05 '23 at 17:28

score 5 · Accepted Answer · answered Mar 05 '23 at 17:59

5

Per the thread Dave links to, minimizing the MAE will incentivize you towards biased "hard classifications": if 60% of samples with a given predictor configuration are of class A, then the MAE-optimal classification would be to predict a 100% (not 60%) probability of them to be of class A.

Minimizing the MAE is thus equivalent to maximizing accuracy, which has major problems.

I sympathize with your goal of sparse probabilistic classification, but minimizing the MAE is not the way to go about it.

answered Mar 05 '23 at 17:59

Stephan Kolassa

123,354

Very clear. The log-loss would be a more suitable scoring method, right? – dwolfeu Mar 05 '23 at 19:07
2

We have a thread comparing different proper scoring rules here. I personally am a big fan of the log loss. However, simply minimizing it will not give you a sparse model. Unfortunately, we don't seem to have a useful thread on this - everything carrying both tags seems to consider "sparsity" about predictors, not the model. – Stephan Kolassa Mar 05 '23 at 20:10
I am confused then. How is one to find a value for alpha for regularized logistic regression? – dwolfeu Mar 05 '23 at 20:16
1

You minimize the sum of the negative log-likelihood (which is just the log loss) and a scaled $\ell_1$ penalty on the coefficient vector, which gives you a sparse solution, see e.g., section 3.2 in Hastie, Tibshirani & Wainwright, Statistical Learning with Sparsity. Did you have that $\ell_1$ regularization term in mind? If so, you are all good, sorry, I may have misunderstood. – Stephan Kolassa Mar 05 '23 at 20:31
No need to apologise! To avoid confusion (and to move this out of the comments), I have posted a new question. – dwolfeu Mar 05 '23 at 20:37

MAE to find tuning parameter for lasso logistic regression

1 Answers1