0

Question

Say I have a dataset $D$ with $N$ features that are trying to predict a target $y$. I would like to build a model from $D$ and part of that process is removing correlated columns to reduce redundancy.

If $D$ remains constant, would changing the target $y$ ever change the method I use to check for correlation within the dataset $D$?

Redundancy

For an example of what I mean by redundancy see: https://arxiv.org/abs/1908.05376. I'm not interested in the relevancy part of the paper.

Example

Say I'm using dataset $D$ to train a classification model. As part of preprocessing I check for correlations using method $M$, which could be any type of correlation algorithm, provided $M$ is unsupervised.

I choose one column from each correlated group at random. In other words, I select columns in an unsupervised fashion.

Should I ever change $M$ if I switch from a classification to a regression model, changing $y$ in the process?

Pre-empting XY

This is intended as a general question, which will lead to a specific question. The content of the specific question will depend on the answer to this question. Therefore, I believe it is not XY.

Connor
  • 625
  • Of course. Consider two circumstances: (1) $y$ is uncorrelated with $D$ and (2) $y$ is a multiple of a single feature. This is discussed at length in threads about principal components (especially those that might reference regression), including https://stats.stackexchange.com/questions/9590, https://stats.stackexchange.com/questions/444545. But your question does sound like an XY problem because you don't explain what your "noise" might be, why it might be a problem, or why you believe the solution is to remove some columns outright. – whuber May 22 '23 at 17:53
  • 1
    @whuber It depends on what they mean by "removing correlated columns", it's unclear. If they want to remove features that are redundant (correlated with each other, but not necessarily correlated with the target), that can be done in total absence of the target and will of course not be affected by the $y$ values. Of course there will be dependence if they are selecting variables with respect to $y$, though. – Nuclear Hoagie May 22 '23 at 18:03
  • @Nuclear The link is profound, as I had hoped was evident by contemplating the two circumstances I mentioned. For a linear model, it comes down to the relationship between $y$ and the space orthogonal to that generated by $D.$ You also implicitly assume the model will be linear in the features -- but many ML models are not. – whuber May 22 '23 at 18:06
  • @Whuber I've altered my question to make it clear I'm asking about correlations within $D$. Not between $D$ and $y$. Does that change your answer? – Connor May 22 '23 at 18:39
  • @NuclearHoagie Your re-statement of what I meant is accurate. Is my re-wording clearer? – Connor May 22 '23 at 18:41
  • No, it doesn't. The entire point is that for your purposes, the correlations between $D$ and $y$ are what matter; and whether or to what extent correlations among $D$ are relevant depends on the circumstances. – whuber May 22 '23 at 19:30
  • 1
    @whuber I'm confused by your assertion, one can identify and eliminate redundant features even if there is no target variable. Not only do we not care what values $y$ takes, we don't even care if $y$ exists. I agree it's often useful to select features that will be informative for the particular downstream prediction task, but the OP seems to be explicitly selecting features irrespective of how they relate to $y$. – Nuclear Hoagie May 22 '23 at 20:09
  • @Nuclear Is it possible you are interpreting "correlated" in the question as meaning perfectly correlated ($\rho=\pm 1$)? If so, I can understand your remarks; otherwise, not. – whuber May 22 '23 at 21:19
  • @whuber I think you're answering a different question to the one I'm asking! I agree that selection based on the target $y$ would change with $y$. But would the method *used for correlations within the dataset $D$* change? If so, could you provide an example? – Connor May 22 '23 at 21:55
  • 1
    It absolutely would change, because there are many methods and what you would do would depend on what model you are thinking of and what you know about $y.$ Apart from identifying perfect linear relations among $D,$ which is a purely mathematical exercise, the kinds of near-correlations ("multicollinearity") you might look for and address depend on all that context. One difficulty I have is that you seem to be posing a counter-factual situation in which you aim to model $y$ but then tell us to ignore $y.$ At that point it seems pointless to do anything. – whuber May 22 '23 at 22:40
  • @Whuber Okay, but how would it change and why? If my dataset $D$ is exactly the same and I want to check how similar two features $f_1$, and $f_2$ are in in it, how would the process change from a linear regression to a linear classification task. Why would it have to change? I really don't mind if it does change, but you have said it does without explaining why. – Connor May 23 '23 at 06:24
  • I have explained in many posts here on CV concerning PCA and regression, such as the two I linked in my first comment. – whuber May 23 '23 at 13:49
  • @whuber I tried reading the posts you put in the comments, they don't seem related to this problem. – Connor May 23 '23 at 19:05

0 Answers0