Is my R logistic regression model ok? (NOVICE)

Question

I'm currently training a model for spam detection on a training set with 4757 observations with 3000 variables counting word frequencies. It is taking forever and I have a deadline coming up so I wanted to just run the code by you guys to ensure I'm not waiting for nothing.

R code for logistic regression model

Any advice would be much appreciated.

Do you really need all 3000 words as variables? You can't trim them down? — Allan Cameron, May 23 '22 at 14:19
I figured I would get a better model if I kept all of the variables. Maybe I could remove the words which occur in both spam and non-spam, leaving only words which occur in one or the other. Does that sound reasonable? I think that would leave about 500 remaining as there is a lot of overlap — , May 23 '22 at 14:24
If you remove the words that are in both, then you will be left with only-spam and only non-spam, so there is no point in doing a logistic regression - you will get a singular fit, with 100% probabilities. If anything, you could do a logistic regression on the overlapping words. With this type of set-up, I might be more inclined to just do a simple table of proportions or odds. If you see a particular word in an email, what are the odds that the e-mail is spammy? This takes much less computation, and is an easily interpretable result. — Allan Cameron, May 23 '22 at 14:34
Can you [edit] to be specific about what in particular you want to know? — Sycorax, May 23 '22 at 15:47
For a logistic regression with n=4757, p=3000 I would suggest that you would be better off using a penalized regression via glmnet - there may be some more startup cost in figuring out how to use the machinery, but it will be better (and probably faster) in the long run. — Ben Bolker, May 23 '22 at 17:05
As you have several hundred words restricted to one or the other class, your model presumably suffers from perfect separation and its associated numerical problems. The suggestion from @BenBolker would overcome that; glmnet is evidently supported by caret. There's also a Firth penalization of logistic regression models implemented in the R logistf package. I don't know if that's directly supported by caret but I understand that you can write your own methods to call that. — EdM, May 23 '22 at 19:02

Is my R logistic regression model ok? (NOVICE)

0 Answers0