0

I'm currently training a model for spam detection on a training set with 4757 observations with 3000 variables counting word frequencies. It is taking forever and I have a deadline coming up so I wanted to just run the code by you guys to ensure I'm not waiting for nothing.

R code for logistic regression model

Any advice would be much appreciated.

  • 1
    Do you really need all 3000 words as variables? You can't trim them down? – Allan Cameron May 23 '22 at 14:19
  • I figured I would get a better model if I kept all of the variables. Maybe I could remove the words which occur in both spam and non-spam, leaving only words which occur in one or the other. Does that sound reasonable? I think that would leave about 500 remaining as there is a lot of overlap –  May 23 '22 at 14:24
  • 1
    If you remove the words that are in both, then you will be left with only-spam and only non-spam, so there is no point in doing a logistic regression - you will get a singular fit, with 100% probabilities. If anything, you could do a logistic regression on the overlapping words. With this type of set-up, I might be more inclined to just do a simple table of proportions or odds. If you see a particular word in an email, what are the odds that the e-mail is spammy? This takes much less computation, and is an easily interpretable result. – Allan Cameron May 23 '22 at 14:34
  • 3
    Can you [edit] to be specific about what in particular you want to know? – Sycorax May 23 '22 at 15:47
  • 3
    For a logistic regression with n=4757, p=3000 I would suggest that you would be better off using a penalized regression via glmnet - there may be some more startup cost in figuring out how to use the machinery, but it will be better (and probably faster) in the long run. – Ben Bolker May 23 '22 at 17:05
  • 2
    As you have several hundred words restricted to one or the other class, your model presumably suffers from perfect separation and its associated numerical problems. The suggestion from @BenBolker would overcome that; glmnet is evidently supported by caret. There's also a Firth penalization of logistic regression models implemented in the R logistf package. I don't know if that's directly supported by caret but I understand that you can write your own methods to call that. – EdM May 23 '22 at 19:02
  • The name of the famous software is R not r! Edited ... – kjetil b halvorsen May 23 '22 at 19:23

0 Answers0