25

I just realized I have always worked regression problem where the independent variables were always numerical. Can I use linear regression in the case where all independent variables are categorical?

famargar
  • 851

1 Answers1

29

Just some semantics and to be clear:

  • dependent variable == outcome == "$y$" in regression formulas such as $y = β_0 + β_1x_1 + β_2x_2 + ... + β_kx_k$
  • independent variable == predictor == one of "$x_k$" in regression formulas such as $y = β_0 + β_1x_1 + β_2x_2 + ... + β_kx_k$

So in most situations the type of regression is dependend on the type of dependent, outcome or "$y$" variable. For example, linear regression is used when the dependent variable is continuous, logistic regression when the dependent is categorical with 2 categories, and multinomi(n)al regression when the dependent is categorical with more than 2 categories. The predictors can be anything (nominal or ordinal categorical, or continuous, or a mix).

(The remark below might be redundant for you, but I add it anyway)

However, do note that most software requires you to recode categorical predictors to a binary numeric system. This just means coding sex to 0 for females and 1 for males or vice versa. For categorical variables with more than 2 levels you'll need to recode these into $L-1$ dummy variables where $L$ is the number of levels and these dummies contain a 0 or 1 when they are in the corresponding category. This way each individual (sample) should be represented by having a 1 for the dummy variable he/she is part of and a 0 for the others, or a 0 for all dummies when he/she is part of the reference group.

famargar
  • 851
IWS
  • 2,754
  • 1
    thanks. as I write in the question title, the dependent variable is continuous. So I take your answer as "you can use linear regression, provided you do dummy encoding". Please correct me if I am wrong. – famargar Mar 13 '17 at 09:29
  • yes that is what I was saying. – IWS Mar 13 '17 at 09:37
  • Thanks! This is brilliant. However, I am still puzzled that in this case the values I would predict are discrete, while my dependent variable is actually continuous. Is there any way to "smooth" my predictions? – famargar Mar 13 '17 at 09:39
  • 2
    I see you have edited the question to add a second question, and posted a similar quesiton here: http://stats.stackexchange.com/questions/267137/achieve-continuous-predictions-in-linear-regression-with-all-categorical-indepen . Additionally, I'd ask you what you mean by smoothing your predictions, or what you mean by predicting discrete values. AFAIK a linear regression will give you the mean value of the continuous dependent based on your predictor variables (through the regression formula). Please elaborate – IWS Mar 13 '17 at 10:08
  • 1
    I deleted the second question as you fully answered the original one. To answer your question, if I feed $n$ new "events" ($x_i$) to the model, I would get $n$ different $y$ values that would all take one of four regressed values. I guess I am saying that if the categorical variables were actually ordinal, I would like to introduce some (logit?) smoothing between values. – famargar Mar 13 '17 at 12:44
  • What purpose is the 'smoothing' for? Just for visually trying to interpret the output as a (continuous) curve? If there are K input binary categoricals, there will only be 2^K discrete output points. Likewise if they were ordinals, you would multiply together the number of levels. – smci Mar 13 '17 at 12:47
  • 1
    In the case of an ordinal variable one can always chose to assume it is "continuous enough" to use it as if it were a continuous predictor (by simply not using dummies, but entering the variable as a numerical version). However, if you do this and you have only few levels, you are fitting a straight line (thus assuming linearity) through only a few points (so note that the amount of levels is important here). A Likert scale is a good example of a variable used this way, which regrettably creates problems on various occasions. – IWS Mar 13 '17 at 12:50
  • @IWS What happens if I have an independent variable that is categorical with which the dependent variable does not vary monotonically. For example, to predict the price of a toy, one of the features is colour (which can take the values blue, red or yellow). If blue=0, red = 1 and yellow =2, but price of a blue toy is higher than that of an equivalent red toy and price of a yellow toy is higher than that of an equivalent blue (and red toy). Can I still use colour as a feature? – rahs Oct 27 '18 at 23:40