1

Is it possible to do simple linear regression with non-normally distributed data? In my data set ($n=25$) some IVs have a normal distribution and some IVs do not show the normal distribution in the data itself.

I want to know whether I can do simple linear regression, with one X variable and one Y variable if the data itself is non-linear (both X and Y) to make interpretation with final $R^2$ values and the p-value.

Nick Cox
  • 56,404
  • 8
  • 127
  • 185
  • 3
    You express common misconceptions. The distribution of the dependent variable is not important for regression, neither is the distribution of independent variables (except that these could result in influential values). What is important is the distribution of the residuals. That is the distribution assumed to be a normal distribution in deriving statistics for ordinary least squares regression models. – Roland Oct 12 '21 at 07:01
  • Hi Roland, Thank you, now I understand. It helps me to do simple linear regression. But, if my distribution of the residuals are not normally distributed, is that possible to interpret p-value and R2? – RoshelPanther Oct 12 '21 at 07:10
  • 1
    You can always interpret the Pearson correlation coefficient. You should not interpret p-values if assumptions are violated. I would suggest to fix the model to a model with different assumptions. That could mean simply transforming the DV, using a generalized linear model, ... – Roland Oct 12 '21 at 07:21
  • Hi Roland, thank you very much for your important explantion. – RoshelPanther Oct 12 '21 at 07:37
  • 2
    Can you clarify what you mean by "the data itself is non-linear (both X and Y)" ? – Glen_b Oct 12 '21 at 13:59
  • You can do all kinds of things. Whether this is any good is a different matter. – Christian Hennig Apr 07 '23 at 10:03
  • A simple counter-example to the myth that predictors (here called IVs) must have normal distributions is the use of binary predictors with values say 0 or 1, often called dummy or indicator variables. if normality were needed, using such predictors would be utterly wrong, but every decent regression text or course explains them as a highly valuable device. – Nick Cox Apr 07 '23 at 10:31

1 Answers1

1

$R^2$ does not care about your errors being normal. If you do an OLS regression with an intercept, then $R^2$ has the interpretation of being the proportion of variance explained.

The p-values do, however, care about the errors being normal. Usually, such as the summary or an lm linear model in R software, p-values assume that the minimization of square loss is equivalent to maximum likelihood estimation of the parameters, which happens for normal error terms. One nice feature of the t-tests that generate these p-values is that they are fairly robust to deviations from normality, particularly as the sample size gets large. However, with only $25$ observations, that might not be enough to appeal to this kind of asymptotic argument, particularly if the residuals have a considerable departure from a normal-looking shape.

There are limits to this robustness.

Dave
  • 62,186
  • The normal distributions mentioned in the question focus on the independent variables. Indeed, it also mentions a non-linear relationship, indicating that the marginal response has some different distribution and, moreover, giving no information at all about the conditional response (the "errors"). Considering (and dealing with) all those facts would seem preliminary to any consideration of $R^2.$ – whuber Apr 07 '23 at 12:48