0

I am trying to run a simple regression using a categorical predictor for the first time. I have the following model: lm(income ~ gender, data = df)

I have two questions:

  1. What are the assumptions that need to be met? I know when the predictor is continuous there are assumptions of linearity, normality of residuals, homoscedasticity and independence of residual error terms. Some of these don't apply to categorical data, so what test should I be running? (I am using R)

  2. What is an alternative in the event that these assumptions don't hold?

Rnovice
  • 45
  • If gender has two levels, this model is equivalent to a t-test (not Welch's test though). If the categorical variable has more than 2 levels, it's analogous to an ANOVA. The assumptions you listed all apply (except linearity). There are multiple alternatives, one example would be a permutation test. – COOLSerdash Aug 25 '22 at 07:39
  • 2
  • 1
    You essentially want to compare two groups. You don't need regression for that. Most likely it's not appropriate for your data -- and neither is a t-test -- because income tends to be highly skewed. The Wilcoxon signed-rank test is another option to a permutation test. – dipetkov Aug 25 '22 at 07:43
  • @dipetkov A common workaround for the skew (typically a tail with a Pareto distribution) is to consider the logarithm of income. In this case, a Welch test should be reasonable. – cdalitz Aug 25 '22 at 07:57
  • The assumptions under which the hypothesis tests are derived don't distinguish the form of the predictors; i.e. they are are the same for categorical predictors as anything else -- but linearity would be automatically satisfied for binary indicators. Normality is not usually particularly critical for approximate correctness of significance levels as long as sample sizes are not small and skewness / heavy-tailedness are not very strong. However, I'd support the prior comments in relation to dealing with income. Taking logs or using a model with a log link might work better. – Glen_b Aug 25 '22 at 07:57
  • @Glen_b Having seen data on income, I don't expect it to be under the category "not small and skewness / heavy-tailedness are not very strong " – dipetkov Aug 25 '22 at 07:58
  • Neither do I. . . – Glen_b Aug 25 '22 at 07:58
  • @Glen_b I wrote the comment taking into account the specific context: comparing income for men and women. Of course, more general statements are interesting too. – dipetkov Aug 25 '22 at 08:01
  • @cdalitz Why did you tag me? What's going on with you and Glen_b? I wrote a comment to suggest one option for making the comparison and to point out a common feature of income that makes it trickier to analyze than sticking it into lm(income ~ gender). I never implied or said that there are no other options!! – dipetkov Aug 25 '22 at 08:10
  • @dipetkov Sorry if you misinterpret my comment as a critic. I merely tagged you because there is no other way to make a comment to another comment. My comment was simply meant as an addition. Concerning the question about "me and Glen_b": I honestly do not understand your question. – cdalitz Aug 25 '22 at 08:14
  • @cdalitz It's helpful to point the OP in the direction of other ways to analyze skewed data. I'm still not sure it was necessary to tag me; I didn't like being tagged twice in quick succession to read something that I already knew. – dipetkov Aug 25 '22 at 09:12
  • Why do you think those usual assumptions don’t apply for categorical features?
  • – Dave Aug 25 '22 at 10:36