2

I'm currently in the midst of running several logistic regression models to test for effect modification (i.e., testing interaction terms) between two categorical variables (sex and age as a categorical variable).

I realized that I'm not quite sure if I should factor all categorical variables or not? It seems reasonable that a categorical variable should be made into a factor rather than left as an integer but I don't fully understand what the potential implications of factoring vs not factoring are? And I assume that factoring is a common term across all languages but I'm referencing R programming.

If anyone could add some mathematical clarity it would be greatly appreciated.

Notably, I referenced logistic regression but I assume the implications would be similar across other distributions/links. Also, I played around with the model before posting and it didn't make much of a difference (save for interpretation if I left age category numeric) but I'm sure this is not always the case.

Brennan Beal
  • 23
  • 1
  • 1
  • 3
  • How many levels do your categorical variables have? When it is only two levels then the difference between categorical/scalar does not matter. – Sextus Empiricus Jun 02 '20 at 16:47

2 Answers2

3

Assuming your categorical features are stored as numbers, R will treat the values as interval data, which means that 3>2>1 and 1+2=3. If 1 represents "male", 2 represents "female", and 3 represents "not specified", then you can see that thinking of the variable as numeric makes no sense. If R identifies a coefficient to represent the effect of gender, then the difference in the effect between "not specified" and "male" will be twice the size of the effect between "female" and "male". That is not what you want in that case. When you make gender a factor, R creates dummy variables that represent each of the possible states, "male", "female", and "not specified" and individually estimates a coefficient for the effect of each. This is what you want.

Some notes:

  1. If you only have two levels to your variable (eg. you only have male and female) then turning the variable into a factor will actually not make any difference in performance or predictions versus representing the variable as a number. However, if you aren't using 0 and 1 to represent the two categorical levels, then the interpretation of the model coefficients will be more difficult. Thank you for the comment below pointing this out.

  2. Making a variable into a factor treats it as nominal feature which means it does not consider the options as being ordered in any way. Age group is ordinal, which means the order matters, but the differences between options are somewhat arbitrary. For an ordinal variable, it is occasionally better to represent the different values as integers which preserve the original order. I imagine there are other ways to deal with ordinal features as well. Converting them to factors may very well be the best option, however, especially if you have a lot of data and not many distinct values for age range.

Ryan Volpi
  • 1,888
  • 1
  • is not true. It will make a difference if you use integers rather than factors. If the features are categorical, they should be converted to factors no matter how many levels there are. If you leave them as integers, then their numerical values take on meaning which you don't want.
  • – mlofton Jun 02 '20 at 03:22
  • My mistake, @mlofton is right in the sense that it will affect the coefficient estimates, and would affect interpretability. However, I maintain that using other numbers will not change the predictions and therefore will not have a performance impact. – Ryan Volpi Jun 02 '20 at 03:27
  • Hi Both,

    I think that was my concer, @mlofton. If not turned to factors, I was worried that numeric data would cause problems. So, rule of thumb is always factor, regardless? (and order if necessary).

    – Brennan Beal Jun 02 '20 at 15:48
  • @Brennan Beal: Unfortunately, I don't have time to look at below at the moment so maybe Ryan can help you with that. But, generally speaking, if you want to think of you're variables are categories in the sense that the numbers don't have any other meaning except to differentiate between the categories, then definitely code them as factors. OTOH, if you want to use the numbers in the sense that you think the numbers mean something, ( for example, say you had temperature, or rainfall levels ) then don't make them factors. I hope that helps a little. – mlofton Jun 03 '20 at 05:26
  • 1
    @Ryan Volpi: Thanks for confirming. Note that I think you are correct that if you just have one variable and two levels and no interactions with other variables, then you can get away with coding the zero and one as numerical. But that confuses the issue IMHO so, better, atleast for a beginner, to even code the 0 and 1 in that case as factors. Also, I'm not clear on what won't change predictions and not have a performance impact but, if you don't think it's that important, then don't worry about it. Thanks again. – mlofton Jun 03 '20 at 05:30
  • @Brennan Beal: I was mistaken in that below is Sextus Empiricus comments-answer rather than your follow up question so hopefully that is satisfactory. It definitely looks nice and thorough at a glance. Sextus Empiricus: Thanks for nice answer. – mlofton Jun 03 '20 at 05:33
  • @RyanVolpi and all: Thanks for your responses! Sincerely appreciate the time. – Brennan Beal Jun 03 '20 at 16:29