2

A friend recently told me about a technique to remove the effects of an unwanted feature $x$ in a response variable $y$. He mentioned an example from genetics. By regressing $y$ on $x$ (or a polynomial in $x$ to capture higher order interactions in the data), the residuals represent the variation unexplained by the "nuisance" predictors. As a simple example, say you wanted to remove the effects of tire pressure and car age (and their interactions) on gas milage performance. The errors soak up all of the information that is left after explaining away tire pressure & car age. It's not hard to see how this kind of approach would be useful for reducing large datasets, like the ones you'd see in genetics, to key features.

This approach flips conventional wisdom about regression on its head. I was wondering: Is this is a common practice? Are there any examples where it has been applied or, better yet, any theory on the subject? From what I can tell, it seems somewhat unique to genetic data (since noisiness in the observations makes fitting a full linear model difficult).

jluchman
  • 978
  • See https://en.wikipedia.org/wiki/Partial_correlation. – user2974951 Apr 19 '22 at 06:45
  • 2
    But to comment, this is exactly what is done when you include multiple variables in a linear model. You do not need to perform a "pre-process" with some variables and then use the rest. In fact this may be wrong as you may end up getting biased results. – user2974951 Apr 19 '22 at 06:51
  • Thanks for your response! One issue with genetic data is the dimensionality. Removing features (age, sex, BMI, whatever) may help isolate signal and help deal with numerical instability issues that arise when the design matrix gets too big. Also, we may not care about the regression coefficients - but regressing out certain factors can help us identify the conditional relationships in subsequent analyses (which may not be another regression). – Mete Yuksel Apr 19 '22 at 06:56

1 Answers1

1

As @user2974951 says in a comment, this is just what you do with multiple regression to control for other variables. This two-step approach doesn't really "remove features" in a way that avoids the dangers of overfitting from a "too big" design matrix and its low case/predictor ratio.

Once you start using outcomes in your analysis, as this approach does, you have to take that use of the outcomes into consideration in downstream analysis or you risk overfitting. Frank Harrell's course notes and book go into these issues in some detail, in particular in Chapter 4 and in several detailed examples. You don't win anything with the two-step approach.

In the regression context, penalizing the coefficients of the "nuisance features" while keeping the main factor of interest unpenalized is a principled way to proceed, controlling for those features while minimizing overfitting. This paper describes and illustrates the approach.

How to proceed with a non-regression downstream analysis would depend on the nature of that analysis. As @user2974951 also says, any pre-processing via linear regression could lead to problems in downstream analysis (in particular if the linear model isn't properly specified). At the least, to document that you aren't overfitting, you would need to repeat the entire modeling process (including that pre-processing to remove nuisance predictors) on multiple bootstrap samples of your data set.

EdM
  • 92,183
  • 10
  • 92
  • 267