How to check randomness in a machine learning dataset?

Question

given a standard machine learning dataset, is it possible to check whether the relationship between the inputs and outputs is random or not? If the relationship between inputs and outputs is random (or nearly random) then we should not expect much performance from any machine learning algorithm. Thanks

Yes, this question is almost the same as mine question. But I'm not sure I understood the answers, so I would like to make my question more specific. Assume two continuous valued variables x and y. Is it possible to show that the relation between x and y is random (or almost random)? For example we can check the correlation between x and y, if there is no correlation then can we say the relationship is random? But is being correlated the opposite of being random? Any ideas?

Possible duplicate of How to know that your machine learning problem is hopeless? — Stephan Kolassa, Feb 05 '19 at 10:42
Unfortunately, unless we have domain knowledge, we can never know for sure. I do not think there is much more to do here than is explained at the proposed duplicate. (Aside: typically people believe there is far less randomness and far more explainability in a dataset than there really is.) — Stephan Kolassa, Feb 05 '19 at 10:43

score 1 · Answer 1 · answered Feb 05 '19 at 11:07

For pairwise linear relationships between a few predictors and a response variable you can use statistical inference to estimate the probability that each predictor is independently related w/ the response. You can also use model fit measures to quantify the how much the response can be predicted with a model using a set of predictors. The problem is that the more variables you include, the better the fit, but this does not necessarily mean that there is a relationship because you might be overfitting. In my opinion, such check must use cross-validation to quantify generalization to unseen data. It is difficult to tell apart that there is no relationship or whether your model is incorrect (eg, too simple). This would show-up in cross-validation as underfitting, ie, similar train and test errors. Another useful thing is to randomly switch labels and see that the test error behaves the same. But as I said, it is difficult to tell apart from when using an incorrect model

Can you be more specific about what type of statistical inference we can use? Actually I extended my question above and asked whether can we use correlation? — Sanyo Mn, Feb 06 '19 at 08:31
Sure. In case of linear relationship, you want to test that the coefficient associated with the predictor is different than 0 up to a certain confidence level. This can be done with a statistical test called t-test. More details here. In summary: 1) you compute the coefficient and the standard error 2) you obtain the t-value and 3) you compute the associated probability that the coefficient is different than 0. You can use R as explained here — gsanroma, Feb 06 '19 at 12:52

el Josso · Answer 2 · 2019-02-06T09:50:56.483

0

The question, as it's written is : "Is it possible?" I would say NO.

Because the whole differents stages of a machine learning algorithm is to find what is the relationship between the two sets of data. At this point, if can explain what kind of relation there is, you can have a model to predict it.

But you can check for some easy correlation.

edited Feb 06 '19 at 09:50

answered Feb 06 '19 at 09:42

el Josso

438
2
16

How to check randomness in a machine learning dataset?

2 Answers2