To what extent does the quality of data play in the accuracy of a model?

Question

I’ve worked with businesses where I am provided with a data set that I know is inadequate in terms of its signal with respect to the measure in which I am trying to predict. No matter the extent to which I perform data engineering to create new variables, or the extent to which I experiment with new models or new model parameters, the performance metrics of the models I use will always approach a ceiling point due to the limit of the nature of the variables that is being used. Hence the question I pose: whether one should attempt to perform feature selection, model selection, hyper parameter optimisation IF there exist this constraint?

The discussion has mostly focussed on a limit to predictive ability. But a model that predicts weakly is not the worst case scenario: bias in data can make model predictions arbitrarily bad. When confronted with bad data, the best decision can sometimes be to avoid fitting a model till better data is made available, and to stick to your priors in the meantime. — mkt, Oct 07 '22 at 07:10
What is data quality? Does it mean the amount and type of noise? That can be solved by having more data. — Sextus Empiricus, Oct 07 '22 at 12:36
"whether one should attempt to ... IF there exist this constraint?" What constraint exactly? This is not so clearly described. — Sextus Empiricus, Oct 07 '22 at 12:43

score 15 · Answer 1 · edited Oct 14 '22 at 16:04

15

It’s a garbage in, garbage out scenario. What machine learning models do, is they learn to recognize patterns in the data and act when finding the patterns at prediction time. If you have garbage data, the model would make garbage predictions no matter how sophisticated your machine learning model is. This is what Andrew Ng means by data-centric AI, when he talks that our major concern should be the qualify of the data, rather than the models. If you know that the quality of the data is low, you should be spending most of the time getting better data, as working on improving the model is an unlikely cure.

As others noticed in the comments, the above statement may be too strong. Indeed, our usual assumption is that the data is noisy and most of the models would be able to overcome some degree of noise, mislabeled samples, etc. We even have specialized models like the errors-in-variables model. Still, if there are known issues with data quality, the usually more efficient approach would be to gather better data (or improve it by re-labeling it, etc) than hoping that the model would be able to overcome the issues by itself.

edited Oct 14 '22 at 16:04

Alexis

29,850

answered Oct 06 '22 at 21:02

Tim

138,066

I agree. How I explain this to people is that you are training it to find garbage, and thus it will find garbage. – End Anti-Semitic Hate Oct 07 '22 at 12:40
2

What is garbage data? Can you give an example? Is it noisy data? Or data sampled with bias that doesn't allow to compute a consistent estimator? In statistics it is common that the knowledge about the population that we can gain from data improves by having more data. The data might be garbage, but as long as there is some connection with the population, then it might be useful. – Sextus Empiricus Oct 07 '22 at 12:41
@SextusEmpiricus The term is not that precisely defined, usually it's interpreted as data that doesn't quite actually predict what you're trying to predict, or that its relationship to what is being predicted is not apparent (I usually take it as needing domain knowledge or that the relationship isn't actually strong enough to build a predictor with it) – Diego Queiroz Oct 07 '22 at 14:11
2

@DiegoQueiroz given the nature of the question it might be interesting to elaborate a bit more on what it means for data to be low quality data or garbage data. It is obvious that useless data is useless. That is a tautology, saying the same thing in different words. The interesting issues are: When does data become useless or garbage? How much does low quality of data influence the accuracy of the model? – Sextus Empiricus Oct 07 '22 at 14:31
3

"Inadequate with respect to the signal" does not imply "garbage." A classic example is the use of field screening measurements in geophysical and environmental investigations: a large number of low-precision measurements blanketing a spatial region can deliver superior information about parameters of interest (such as the spatial mean) and often at lower cost. Calling such measurement systems "garbage" just propagates the misconception that measurements need to be highly precise to be useful. – whuber Oct 07 '22 at 15:39
@SextusEmpiricus I agree wholeheartedly with you, what I meant is that these questions (what, when, for which values, etc.) don't seem to have a complete OR definitive answer, in many problems you'd need a domain specialist to realize that and in others even a specialist wouldn't be sure. This link that mkt posted in the original post is relevant to what I'm saying: https://stats.stackexchange.com/q/222179/121522 as the same thing said about forecastability can be said about predictability. – Diego Queiroz Oct 07 '22 at 16:57
1

I’ll just add that predictive validity is just one indicator of data “quality” or more specifically, measurement validity. But the strength / relevance of validity evidence is specific to the field of study, so I doubt we’ll ever get a general consensus on what makes garbage and non garbage data – Rick Hass Oct 13 '22 at 20:25
1

Also @whuber gives an example of the fact that several low reliability measures in aggregate can yield high reliability information which I always find interesting – Rick Hass Oct 13 '22 at 20:28
If you have categorical variables in your dataset which may have incorrect values, you can try to detect these with algorithms that can identify label errors in classification datasets (and then fix those values or omit those examples from your dataset) – Jonas Mueller Jan 10 '23 at 23:54

Dave · Answer 2 · 2022-10-06T21:15:47.330

An extreme example might be determining the name of a dog’s owner based on a photo of the dog’s tongue. You’re missing critical information from the veterinary records that associate the dog with a human. With such information, you might be able to get the right answer every time.

It can be the case that you simply lack the information to make accurate predictions.

Consider an outcome that is totally determined by the two feature variables (so this outcome is entirely predictable, in some sense), which are independent on each other. If you only have measurements for one of those variables, you’ll never reliably make accurate predictions. Since the two features are independent, you cannot even wrangle information about one out of the other. This would be the low signal-to-noise ratio that you mention.

If your features are related, perhaps you can wrestle with observed features to glean insight about what the unobserved feature would have been had it been measured, perhaps at the risk of overfitting.

However, if you feed a model garbage data (tongue picture), it should be expected to output garbage predictions (inability to predict owner’s name).

A related idea about claiming a model lacks any predictive ability. — Dave, Oct 06 '22 at 21:53
An example I loved, in an explanation of overfitting/underfitting: you can train a model to predict a person's credit card number given their phone number... and if it has enough capacity you can train it to 100% accuracy on the training set, and yet its performance on out-of-set data will stubbornly remain ~0. — hobbs, Oct 08 '22 at 00:36

score 1 · Answer 3 · answered Oct 07 '22 at 14:44

the performance metrics of the models I use will always approach a ceiling point due to the limit of the nature of the variables that is being used. Hence the question I pose: whether one should attempt to perform feature selection, model selection, hyper parameter optimisation IF there exist this constraint?

Optimization might still be necessary. Even when the data does not allow a model to measure values accurately, it will always remain that some models are better than others.

More interesting is the question what the nature of the constraint is. Why do you have this constraint and is it truly some limit or do you not have enough data or are your models not detailed enough?

For example: Some variables are just very hard to predict. For instance, when I flip a coin or roll a dice then I might be able to predict the average outcome, but for a given single instance it is extremely hard to predict the outcome. This is when nature, whose basic laws are deterministic, appears random to us.

To what extent does the quality of data play in the accuracy of a model?

3 Answers3