Deciding between the L1 and L2 penalty for a Sklearn Logistic Classifier

Question

I have a classification problem with the following example independent features:

recommendations	comment_count	comment.
0.663	. 0.382	'yes', 'trump'

The dependent variable is whether the comment is likely to receive a reply or not:

get_reply
0.

I want to apply regularisation to a the logistic regression model but I can't decide between L1 and L2.

I want to do this for three different datasets, one for online comments on sports articles, one for magazine and one for politics(national).

I then want to interpret the top e.g. 10 largest coefficients from these models. The following diagrams show this.

The first diagram is with the L1 penalty(has a test f1-score of 0.85): The second diagram is with a L2 penalty(has a test f1-score of 0.60):

I am struggling to decide between the two models, and which would create a more interesting discussion. I understand the L2 diagram more, such that comments in the magazine with a number of recommendations is likely to receive a reply. So I'm favoring L2, but the diagram of L1 offers more interesting text words that appeared in the comments.

I aim to identify features that vary across the different news groupings, sports, politics, and magazine. To point out similarities or differences that could be of importance.

Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. — Community, Mar 29 '22 at 11:58
What is your purpose? If you only want to classify how does test set performance compare? Why interpretability is important? — Tim, Mar 29 '22 at 12:30
@Tim interpretability is important cause I want to point how the coefficients range across the news topics — Holly, Mar 29 '22 at 12:40
@Tim the f1-score I added was for the test data. I updated the question. As L1 handles outliers, I think that removes the important features, like number of recommendations, therefore I am thinking l2 is better — Holly, Mar 29 '22 at 12:41
How do you know that this feature is truly important? If you care about interpretability, why use regularization at all? — Tim, Mar 29 '22 at 12:52
@Tim should I not try use regularization to reduce overfitting/underfitting and so on? I was considering the features with the highest coefficent value negative or positive to be important, as it as more in influence in predicting the dependent variable — Holly, Mar 29 '22 at 12:56
Do you have a reason to believe that the model without regularization overfitted? With regularization, you get biased estimates for the parameters so it makes interpreting them harder. How much data do you have? — Tim, Mar 29 '22 at 13:01
It was finding it hard to tell if it was overfitting, so I just assumed based on the amount of data. For one news grouping, e.g. magazine, there is 21175 entries. So 21175 comments, but a comment could have around 10 to 50 words if not more. Would I be better of disregarding regularization? — Holly, Mar 29 '22 at 13:05
I believe L1 is useful when you are interested in the most important features, which is true for my case. But I also believe it removes outliers, which can be seen in the plot as it removes some of the features I engineered, e.g. comment count and recommendations. i would be tempted to go with L1 so it removes all the less important features, but I dont want it disregard the features I engineered which are influential based on the L2 plot — Holly, Mar 29 '22 at 13:14
Many similar Qs here, for instance https://stats.stackexchange.com/questions/184019/when-will-l1-regularization-work-better-than-l2-and-vice-versa/184023#184023 — kjetil b halvorsen, Mar 30 '22 at 01:55

Deciding between the L1 and L2 penalty for a Sklearn Logistic Classifier

0 Answers0