Is it invalid to adjust my independent variable based on a regression model fit?

Question

I have a dataset and I want to visualize the relationship between Outcome and Variable1, after adjustment for Variable2 and Variable3. So I've created a linear model in R to obtain fitted values based on Variable2 and Variable3:

model   <- lm(Outcome~Variable2+Variable3, data=dat)  
fitted  <- model$fitted.values   
average <- mean(dat$Outcome)

Where fitted is the model fit values, and average is the overall average value for my Outcome. I can then create adj.Outcome like so:

adj.Outcome <- Outcome*(average/fitted)

Am I doing something invalid here? This makes sense to me but I cannot find a reference anywhere. I realize the adj.Outcome values don't have a real tangible meaning per se, but I think a plot of adj.Outcome vs. Variable1 would show the relationship I'm interested in.

Dave · Answer 1 · 2023-04-21T11:47:24.673

In a simple linear regression with just one predictor variable, the visualization is easy to do with a scatterplot. When you have two predictors, such a graph has to move to three dimensions and creates challenges when it comes to visualization, but there are good software packages for doing 3D plots. Then there are serious issues when it comes to visualization when you have many predictor variables, so some alternative must be developed.

While it might be viable to plot your points in three dimensions to examine for patterns and strength of a relationship, even that can be challenging compared to the 2D case. However, a plot of interest to you might be a scatterplot of the true and predicted outcomes. This exists in just two dimensions. However, the predictions contain information about how all of the predictors work together to make a prediction. Despite the fact that correlation between true and predicted values has some serious issues, such as those I discuss here and demonstrate visually here, seeing a strong relationship between the true and predicted value is a good sign that you have captured something about the relationship (subject to all the usual caveats in predictive modeling related to overfitting).

For instance, in the graph below, it is easy to tell which model captures a stronger relationship between the predictor variables and the observed outcomes. If you modify my code, you can do this same kind of plot with hundreds (or more) of predictor variables, and you still would get a visualization in two dimensions.

library(ggplot2)
set.seed(2023)
N <- 500
x1 <- runif(N)
x2 <- runif(N)
y <- x1 + x2 + 7*x1*x2
L1 <- lm(y ~ x1 + x2)
L2 <- lm(y ~ x1 + x2 + x1:x2)
d1 <- data.frame(
    predictions = predict(L1),
    observations = y,
    Model = "No Interaction Term"
)
d2 <- data.frame(
    predictions = predict(L2),
    observations = y,
    Model = "With Interaction Term"
)
d <- rbind(d1, d2)
ggplot(d, aes(x = predictions, y = observations, col = Model)) +
    geom_point()

Is it invalid to adjust my independent variable based on a regression model fit?

1 Answers1