Is there a case that we practically need to run regression on y with x and then run regression on x with y? Why do we want to do this?

Question

What is the difference between linear regression on y with x and x with y?

Effect of switching response and explanatory variable in simple linear regression

Many good discussions explaining the difference between these two regressions have already been posted on this topic. My question is if this problem is useful in solving any problems in real life. Or it's just an exam question...

Its importance goes back to the beginning of regression and the use of the word itself. Galton showed that extreme height in parents is usually only partly passed on to their children (so there is regression to the mean), but he also showed that extreme height in children is usually only partly explained by the heights of their parents. — Henry, Oct 09 '22 at 00:18
I'm not sure anyone has ever suggested you need to do both. This question seems to be based on some misunderstanding. Since the options differ, that means you need to figure out which version is appropriate for your situation before moving ahead. — gung - Reinstate Monica, Oct 09 '22 at 02:11

Shawn Hemelstrand · Accepted Answer · 2022-10-09T03:57:24.713

Some Plots and A Sprinkle of Math

I think your question seems to be specific to why it matters. I finish this answer with the most practical answer to your question, but I think this is where data and theory also tend to merge. Let's say you have a theory that drinking lots of alcohol makes you sleep more on average. Your theory stipulates that the alcohol should be the predictor. It should be apparent that sleeping doesn't make you drink more, but a regression may make it seem so if there is a strong enough correlation between the two variables.

If we think of the most simple algebraic reason why, we have to remember that the estimated outcome between two variables is a function of x, or $y=f(x)$. As such, there is a relationship we are trying our best to capture in a regression that predicts y, and we can only do that by having a relationship that makes sense. Using R, we can simulate this. Let's say we have a variable that has a negative parabolic relationship with y (x first causes y to increase, levels off, then decreases). We can simulate this below:

#### Create Estimated Y Dependent on X ####
y.hat <- function(x){
  y <- -x^2
  return(y)
}
Make Random 1000 Values of X
x <- rnorm(n=1000)
Convert X Values to Their Y Value
y <- y.hat(x)
Plot
df %>% 
  ggplot(aes(x,y))+
  geom_point()+
  labs(x="Age",
       y="Memory",
       title = "Conditional Relationship of Age and Memory",
       subtitle = "y = -x^2")+
  geom_smooth(method = "loess")+
  theme_bw()+
  theme(axis.text = element_blank(),
        plot.subtitle = element_text(face = "italic"))

Ignoring the x and y axis labels as well as the actual raw data because I'm being lazy here, we know that theoretically this relationship should exist with age and short term memory...as we are young we have terrible memory and as we get old we have terrible memory, but this memory increases and stabilizes as we approach a certain age. Fitting a loess regression line here perfectly emulates this relationship:

Flipping it would make no sense...while memory is likely predictive of age, you can see here that plotting it makes this relationship visually confusing:

This is additionally why fitting this relationship matters. If we fit a vanilla regression line to this, it would be similarly erroneous because the relationship between the two is parabolic, not additive like typical linear regressions:

The Practical Side

Additionally, running a regression where x is the parabolic function of y, this would almost certainly be wrong if we kept the data the same.

#### Incorrect Fit to Data ####
false.fit <- lm(x ~ poly(-y^2),
   data = df)
summary(false.fit)
Correct Fit to Data
true.fit <- lm(y ~ poly(-x^2),
               data = df)
summary(true.fit)

Check out the comparisons between an exponentiated x and an exponentiated y...one is almost not predictive at all as the R2 is nearly zero:

Call:
lm(formula = x ~ poly(-y^2), data = df)
Residuals:
    Min      1Q  Median      3Q     Max 
-5.1529 -0.6162  0.0059  0.7233  1.8945
Coefficients:
            Estimate Std. Error t value Pr(>|t|)

(Intercept)  0.06526    0.03101   2.104 0.035613 *

poly(-y^2)  -3.76537    0.98074  -3.839 0.000131 ***

Signif. codes:  0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9807 on 998 degrees of freedom
Multiple R-squared:  0.01455,   Adjusted R-squared:  0.01357 
F-statistic: 14.74 on 1 and 998 DF,  p-value: 0.0001311

Versus the true fit here. The R2 is exactly 1. It actually gives a warning that the fit is too perfect because...it is:

Call:
lm(formula = y ~ poly(-x^2), data = df)
Residuals:
       Min         1Q     Median         3Q        Max 
-1.018e-15 -1.190e-16 -6.500e-17 -1.400e-17  3.391e-14
Coefficients:
              Estimate Std. Error    t value Pr(>|t|)

(Intercept) -9.784e-01  4.400e-17 -2.224e+16   <2e-16 ***
poly(-x^2)   4.214e+01  1.391e-15  3.029e+16   <2e-16 ***

Signif. codes:  0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.391e-15 on 998 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 9.172e+32 on 1 and 998 DF,  p-value: < 2.2e-16
Warning message:
In summary.lm(true.fit) :
  essentially perfect fit: summary may be unreliable

You may now be asking yourself why any of this matters in a real world scenario. Going back to the alcohol and sleep example...lets say we found a strong correlation between the two and a regression fit is shown to be significant when alcohol is the dependent variable. Therapists around the country may then incorrectly conclude that they should decrease alcohol consumption by reducing people's sleep rather than reduce alcohol consumption in other ways. This would likely have the opposite effect of helping behavior...as people sleep less they would become more stressed and thus drink more. This could potentially lead to even less sleep on average as well because chaotic sleep mixed with more alcohol would probably lead to compound effects on sleep quality. This may have catastrophic outcomes that lead to injury, spousal abuse, and a litany of other alcohol-related social harms. So from a practical perspective, this actually matters a lot.

Summary

In summary, this relationship matters for three reasons: first, your predictor should be theoretically valid. Second, your regression outcome is contingent upon the relationship between the two. Third, there are certainly real-world outcomes at stake that make this predictive relationship sensible.

Is there a case that we practically need to run regression on y with x and then run regression on x with y? Why do we want to do this?

1 Answers1

Some Plots and A Sprinkle of Math

Make Random 1000 Values of X

Convert X Values to Their Y Value

Plot

The Practical Side

Correct Fit to Data

Summary