8

Suppose we take the classical linear regression model:

$$y_i = \beta_0 + \beta_1 x_i + \epsilon_i$$

Over the years, I have heard so many people say that such an interpretation can be drawn from this model:

  • On average, a one unit increase in $x_i$ "causes" a $\beta_1$ unit increase in $y_i$

However, we are also told that this model can not directly imply this type of "causation". As a matter of fact, there is a whole field of Statistics that studies causality called "Causal Inference" (https://en.wikipedia.org/wiki/Causal_inference).

In general, is there any language that can be used to interpret this kind of model without suggesting any misleading claims about causality?

stats_noob
  • 1
  • 3
  • 32
  • 105
  • 14
    I have rarely heard anyone use the interpretation you gave. In my experience, people are extremely hesitant to use causal language directly, though their words may imply it. – Noah May 09 '23 at 05:05
  • 1
    @Stats_noob See here: https://stats.stackexchange.com/questions/493211/under-which-assumptions-a-regression-can-be-interpreted-causally/493905#493905 – markowitz May 09 '23 at 06:47
  • 1
    A naive reding of the algebra would say that if xi increases by one unit, yi increases by β1 except for some term such as delta(ϵi) .... so the increase would be real. All the problem is in the terminology, the word "causes" .... if xi is a voltge controlled by a knob I can turn and yi a voltage output, "causes" would seem to be fair and appropriate, but we would have to know the context of the equation and the relationship between yi and xi, and wether epsilon is negligible (I assume epsilon is small, such as in calculus class)... – JosephDoggie May 10 '23 at 15:40
  • wait, what? In what context can the result from regression not be interpreted causally? The RHS is non-stochastic and the LHS is stochastic. – user603 May 11 '23 at 10:21
  • 1
    @user603 You can regress anything against anything, it doesn't mean that X causes Y. The classic example is ice cream consumption being highly correlated with thefts, but you find spurious correlations everywhere. My favourite is this cabbage consumption vs. covid mortality plot: https://www.medrxiv.org/content/medrxiv/early/2020/07/17/2020.07.17.20155846/F2.large.jpg?width=800&height=600&carousel=1 – mkt May 12 '23 at 10:29

2 Answers2

20

On average, a one unit increase in $x_i$ causes is associated with an increase in $y_i$ of $\beta_1$ units.

mkt
  • 18,245
  • 11
  • 73
  • 172
  • 3
    "Association" implies uncertainty, at least to my ear. But you already use "on average" to imply it, so I feel one of these terms is redundant. I'd either simply say "a one unit increase in $x_i$ is associated with an increase of $\beta_1$ units in $y_i$" or, to make the uncertainty more explicit, "on average, a one unit increase in $x_i$ corresponds to an increase of $\beta_1$ units in $y_i$." – Igor F. May 10 '23 at 07:13
  • 3
    Since regressing $x_i$ on $y_i$ would usually give a coefficient smaller in magnitude than $1/\beta_1$, you might move the average so "a one unit increase in $x_i$ corresponds to an average increase of $β_1$ units in $y_i$." – Henry May 10 '23 at 09:24
  • 2
    I like both answers to this question, but this one is better as a heuristic considering it's parsimonious wording. I do however agree that Igor's wording is perhaps better to avoid assumptions about this answer. – Shawn Hemelstrand May 10 '23 at 16:47
  • @IgorF. Would you prefer "correlated" rather than "associated", with the immediate caveat that "correlation is not causation!"? – pjs May 10 '23 at 20:09
  • 2
    A unit increase already implies an increase of $1$. Also, I agree with IgorF and Henry that corresponds is a better fit. – Frans Rodenburg May 11 '23 at 09:44
  • 1
    Regarding the comment by @IgorF.: The "on average" signifies statistical uncertainty, while "associated with" (or "correlated with") signifies causal uncertainty. They are very different and not at all redundant! – Scriddie May 22 '23 at 13:26
20

There is a very careful formulation in Gelman, Hill, and Vehtari Regression and Other Stories:

From the data alone, a regression only tells us about comparisons between units, not about changes within units. Thus the most careful interpretation of regression coefficients is in terms of comparisons, for example [...] "Comparing two items $i$ and $j$ that differ by an amount $x$ on predictor $k$ but are identical on all other predictors, the predicted difference $y_i - y_j$ is $\beta_k x$, on average.

This is of course a bit of a mouthful.

einar
  • 4,272
  • 6
    I like this formulation better than the one in mkt's answer. This one steers away from the notion of changes in $x_i$, the use of which tempts the reader to think about it causally (as in $P(\cdot|\text{do}(x))$ rather than $P(\cdot|x)$). – Richard Hardy May 09 '23 at 09:02