Machine learning without test and validation data

Question

All mainstream machine learning approaches I've seen depend on a test and usually a validation dataset to measure model accuracy during and after training.

This seems like it uses up quite a lot of data that could be used for training.

Are there any machine learning approaches that can estimate model accuracy without extra datasets? I've seen theoretical approaches like VC dimension and Occam learning, which can estimate model accuracy without the extra datasets, but I'm not aware of mainstream, practical applications of these theories.

Also, is there a reason why machine learning engineers prefer to split their datasets instead of using all the data for training and verifying model accuracy in another way?

What other way for verifying a model do you have in mind, other than collecting new data and testing the model on it? Whether the data is available from the beginning of the modeling exercise (so we can/need to split it into training/testing sets) or whether we first train the model on all the data we have and only subsequently collect more data for testing is a minor consideration. — Stephan Kolassa, May 08 '19 at 13:40
a) are you including crossvalidation? b) theoretical approaches like AIC, BIC bayesian model evidence — seanv507, May 08 '19 at 14:09
Check the "optimized approximation algorithm" mentioned in this answer — Jan Kukacka, May 08 '19 at 15:07
@StephanKolassa the methods I mention: VC and Occam, use model size, number of training samples and accuracy to estimate out of sample error. In theory you can then use all your data for training and still know how it will perform on new data. However, I've not seen real world use of these theories, so I'm wondering why. — yters, May 08 '19 at 18:06
@seanv507 yes, I refer to crossvalidation too. In regards to heuristics like regularization, the problem is they don't give precise numbers on what we can expect for out of sample error. They just allow us to say one model is better than another because it ranks better regarding the heuristic metric. So, they cannot in a practical setting take the place of using a separate dataset to estimate the error outside the training set. At least from what I understand, I could be wrong. — yters, May 08 '19 at 18:08
@JanKukacka thanks, that looks like the sort of thing I am looking for. The authors refer to using the signal to noise ratio (SNR) to know when to stop training a NN. However, this could have the same problem of using the heuristics that seanv507 mentions, which is it does not give you a specific error for data outside the training set. I don't think the SNR gives the out of sample error estimate, but I'll have to look into the paper further. — yters, May 08 '19 at 18:10
I have to admit that I am not familiar with the VC dimension or Occam learning, but from the Wikipedia articles, I see nothing that would allow us to estimate out-of-sample performance based on them. Do you know of an exposition that argues for this? Model size and number of training samples are good, and of course any model will overfit, but the model size at which it does will depend on the data. Finally, you mention accuracy - in-sample accuracy will motivate you to overfit, and out-of-sample accuracy is what you are wondering about. I still see no alternative to holdout samples. — Stephan Kolassa, May 08 '19 at 18:47
@StephanKolassa in the 'Uses' section of the VC article, it has a formula showing the relationship between testing and training error: https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_dimension#Uses — yters, May 08 '19 at 18:53
What "error" is this? MSE? MAE? Accuracy? The expected value for a scoring rule? Also, how do we calculate the VC dimension $D$ and $N$ for, say, an ARIMA forecasting model whose orders are, say, chosen through a greedy AIC minimization? Sorry, but this formula is not really helpful. — Stephan Kolassa, May 08 '19 at 18:59
It's simple accuracy. Yes, as I mentioned, not aware of any practical usage of this formula. Just mentioning that such approaches exist, at least in theory. — yters, May 08 '19 at 19:45

Dave · Answer 1 · 2023-11-19T10:50:24.707

Out-of-sample testing simulates the real way machine learning is used. Think about how Amazon designs Alexa. The goal is to analyze future speech, sentences and speakers she hasn't necessarily heard.

The goal of out-of-sample testing is to determine if your model has overfit to the data on which it was trained. In other words, we don't want to memorize patterns in the training data or model mere coincidences. We want something that will generalize to new data, much like Alexa should.

I do agree that out-of-sample testing is an excellent way to assess performance, since it's kind of the ultimate test of the generality in which a model works. However, in the regression setting (think OLS), there is adjusted $R^2$ which gives a penalty for having many parameters, with the assumption being that throwing all kinds of parameters at a model will end up memorizing the data. Remember, if you have $N$ points in the plane with different $x$ coordinates, you can hit all of them with a polynomial of order $N-1$ (e.g., a parabola connects any $3$ such points) and achieve perfect accuracy on your training data, but that's little more than playing connect the dots.

Biostatistics and epidemiology do have a variant of out-of-sample testing where the model is repeatedly trained on bootstrap samples and then tested on the normal dataset, if that counts as what you mean (though even such an approach does test the trained model on points on which the model was not fit). This is not without its critics. Even Harrell has remarked that splitting the data to have a designated test set is reasonable once the data set gets to be quite large (the number he tends to give, such as in the link, is $20000$ observations).

score 5 · Answer 2 · answered May 08 '19 at 17:01

It's a simplistic description to say that the data is split into training and validation. In reality often you end up using all data for training. For example, you can split the dataset into two equal parts, and run two validations with each using one of the samples as training and the other as testing (aka leave-one-out cross validation). Or, you could validate the model on 80%/20% split, then for the production model you simply re-estimate the model on the whole dataset. These are just two examples where effectively entire dataset is used for training. Then you have cases where the data keeps coming. You don't re-estimate data every time new batch arrived, but you can use the new data to monitor ongoing performance.

Summarizing, the way you stated, surely, it would appear that training/validation approach is not optimal. However, the practice is not so straightforward as you describe. So, the purported inefficiency is not as significant, and maybe even negligible.

My understanding is the more you mix the testing and training data, the less you can depend on the out-of-sample error estimate you get from the testing data, even if this is what happens in the real world. — yters, May 08 '19 at 18:12
@yters, you're absolutely right. That's why cross validation and test split exists to relieve the issue, though it doesn't solve it completely. You use cross-validation for model selection, then at the very end use the last bullet (test sample) to validate the choice, presumably, only once. — Aksakal, May 08 '19 at 18:42

score 3 · Answer 3 · answered May 08 '19 at 18:14

3

You already got two nice answers. Without repeating what was already said, let me add my three cents.

Notice that no matter what criteria for model selection you'll choose, in each case you make a number of assumptions that need to hold and if they are met, the methods give you some guarantees of optimality. Those assumptions may be more or less unrealistic.

The idea of having hold out test set, is to simulate scenario where you would make prediction on other dataset then was used for training. What follows, the only assumption that you make is that your test set is similar to the "real" data that your algorithm will see in the wild. This is a rather basic assumption, since you already gathered this sample assuming that it will be reasonably representative for the "real" data, otherwise you wouldn't use it, or at least trust it. Any other method makes much stronger (less realistic) assumptions.

answered May 08 '19 at 18:14

Tim

138,066

Is there no approach that starts from the basic assumption of similarity, and then can avoid using a test dataset to rigorously estimate out of sample error? It seems we humans can do this quite frequently. Throughout most of life we develop our mental models for the most part without separating our perceptual data into cross validation and testing, and then later find out they match reality pretty well. I wonder what allows us to make such good inferences without using the standard machine learning approach. – yters May 08 '19 at 18:31
@yters human cognition has close to nothing in common with machine learning, so I don't think that such comparisons are fruitful. – Tim May 08 '19 at 19:06
Yes, the mechanism is very different, but there is the same mathematical question. Given a sample how can we construct a model that has good out of sample accuracy? Whether human cognition or extreme gradient boosting, all the approaches have some way of answering this fundamental question. So, since humans can do it without validation and testing data, seems there is a way of doing it with a computational approach. – yters May 08 '19 at 19:47
2

@yters sure. If we knew that, we wouldn't be using cross-validation. If you don't want to overfit, then you can simply choose algorithms that will have worse fit, but be less likely to overfit (i.e. forget about gradient boosting or neural networks), much of old-fashioned statistics is about that. – Tim May 08 '19 at 20:04

Machine learning without test and validation data

3 Answers3