2

Let $X_1, ..., X_n$ be a series of idd random variables. Imagine we are training a machine learning classifier over these variables with the task of predicting a feature $F$. A typical scenario might be trying to classify subjects into healthy control or diagnosed based on some clinical data, which we'll assume from now onwards for simplicity.

A usual task for data scientists, I gather, is feature engineering. This is, constructing new features, somehow relevant to the problem at hand, from already existing ones. I have a hypothetical question on this.

Say we hypothetize there is a difference in a certain trait between subjects. Say $g(X_i)$ is a sensible estimation of that trait. If adding $g(X_i)$ to the model improves classification, are we justified in interpreting this as evidence in favor of our hypothesis?

It certainly seems reasonable to say so. If including a new feature improves classification considerably, it seems reasonable to say there must be some relevant differences in the trait that feature represents across subjects. If such difference did not exist, what could the model learn from the new feature?

I hope this question is not too soft. Thanks in advance.

lafinur
  • 235
  • Assuming the improved classification is on previously unseen data and the improvement is in some sense "big enough" then it may be reasonable to conclude that this improves classification. Inference about relationships is a very different step – Henry Dec 11 '22 at 01:29

1 Answers1

1

Making your question slightly broader, one could ask if the correlation between some variables means that they are related. There are different answers, depending on what you mean. They are definitely related in the sense of being correlated. This can be a spurious correlation though. Correlation also does not imply causation.

Now, correlation measures a pretty simple, linear relationship between two variables. When you enter a variable into a machine learning model, it becomes a part of a complicated, non-linear function, so it's hard to interpret how exactly does it play its role in the predictions. So you could say that the variable somehow helps to predict the other one, but it's a rather meaningless and circular statement to make.

Usually, it wouldn't be considered as evidence for the reasons above. Another reason is closely related to the last paragraph: the result would depend on your methodology. It definitely can be the case that the variable would help to make better predictions when using one algorithm, but not when using another, or when taken together with different variables, or after different feature engineering, or with different hyperparameters, after other random initialization, etc. All those things commonly happen in machine learning.

Tim
  • 138,066
  • Thank you for your answer. Is it fair to conclude, then, that machine learning models have a strictly practical bearing (in the sense that they produce good results), but provide no epistemological or scientific insight into the problems they solve? – lafinur Dec 10 '22 at 16:42
  • Check https://stats.stackexchange.com/q/6/35989 – Tim Dec 10 '22 at 18:53