Training on the full dataset after cross-validation, proof, part 2

Question

I was reading the question Training on the full dataset after cross-validation?

and found a corresponding literature reference Moiser 1951, https://doi.org/10.1177/001316445101100101 who claims good practice was to (potential misquotes are my error here):

obtain performance metrics using cross-validation: training a model on one half of a dataset (for both halfes a and b) and evaluate performance on the corresponding hold out half of the dataset. This will result in two $R^2_a$ and $R^2_b$ so that our best guess of goodness of fit is $R^2_{*}=0.5(R^2_a+R^2_b)$
use the whole dataset to build a final, model (for which we have no hold-out test data), so we cannot provide a performance metric right away. However it is claimed (without proof) that the "$R^2_{big}$" of this model (trained on more data) is expected to be higher than $R^2_{*}$, when evaluated on newly gateherd data of the same process.

His last words are:

Possibly one of the other speakers knows a proof that the [coefficient of determination] R of the half sample's beta is always lower than of the full sample beta [where beta is the coefficients of a model]

Questions:

Does somebody have a proof by now (in the year 2023)? I guess we need to deal with the expected $R^2$s here (as was pointed out in the discussion)
How would the expected $R^2$ of a linear model improve with sample size N when the model is used to estimate the conditional mean?

I am interested in simulations or a formal proof which show that a model trained on a bigger dataset (here the combined dataset) has an (on average) higher $R^2_{big}$ value on new, unseen data than if a model is trained only on a fraction (a half) of the dataset.

I will try to setup a simulation study:

drawing 10000 datasets ($Y =aX +b +\epsilon$, with 100 samples for a maximal train set + 100 samples which are "new" and hold out till the end)
For each of the datasets, fit a linear model on a) a fraction of the train set samples b) all samples of the train set
evaluate how the average (over all datasets) of $R^2$ on the "hold out new data" changes with the number of samples used for training

I expect the simulation to show that the average $R^2$ on the new data increases with the fraction of samples used from the train set (I will post the results in an update), so that $E[R^2_a]<E[R^2_{big}]$. I am wondering how one could explain the results analytically (by the central limit theorem??)

Why does this matter?

In a typical data science problem you train your model on a train/test split and evaluate performance on the hold-out test set. For deploying a model, should you go for this model trained on the smaller train set (for which you have some clue about its performance from the test set) or go for the model trained all data (for which you have no performance metrics at hand).

Update with simulation result

We see that the more data are used for the training, the better the $R^2$ on new (unseen) test data. Errorbars indicate the standard error of the estimated mean $R^2$ (where the mean is taken over different regression datasets).

This would imply that a data scientist should always retrain his model on all data to get optimal performance on new (unseen) data.

Notes:

A very similar plot is obtained when using regression trees.
Using a hubermodel (which minimizes huber loss) still results in a similar plot, although it is not mean squared loss which is minimized

PS: I observe in the plot that the "signal-to noise ratio" (in an abstract sense) plays an important role in the achievable average R^2 (at maximum fraction of the training set being used).

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
import random
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score
coefficients_of_determination = {0.1: [], 0.2: [], 0.3: [], 0.4: [], 0.5: [], 0.6: [], 0.7: [], 0.8: [], 0.9: [], 1.0: []}
for i in range(2000):
    # Generate synthetic dataset
n_samples=200 #these get halved into a &quot;new&quot; test set and a train set (which is used to some degree)
#X, y = make_regression(n_samples=n_samples, n_features=1, noise=std_err)
random_slope = np.random.uniform(-20,20)
std_err=abs(random_slope)
X = np.expand_dims(np.linspace(-random_slope/2,random_slope/2, num=n_samples), axis=1)

y= random_slope*X+np.random.normal(loc=0, scale = std_err, size=(n_samples,1))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)


for train_fraction in coefficients_of_determination.keys():
    random_indices=random.sample(range(len(X_train)), int(len(X_train)*train_fraction)) #without replacement
    X_train_fraction=X_train[random_indices]
    y_train_fraction=y_train[random_indices]
    model = LinearRegression()

    #model = DecisionTreeRegressor()
    # Fit the model to the training data
    model.fit(X_train_fraction, y_train_fraction)

    # Make predictions on the test set
    y_pred = model.predict(X_test)

    # Compute the R2 score on the test set
    r2 = r2_score(y_test, y_pred)
    coefficients_of_determination[train_fraction].append(r2)




Plot the last dataset
plt.scatter(X_train, y_train, s=30, edgecolor="k", label="complete train data")
plt.scatter(X_test, y_test, s=30, edgecolor="k", label="test data")
plt.xlabel("Feature (X)")
plt.ylabel("Target (y)")
plt.title("Synthetic Regression Dataset")
plt.legend()
plt.show()
av_coefficients_of_determination = {}
std_err_coefficients_of_determination ={}
for key in coefficients_of_determination:
    av_coefficients_of_determination[key]=np.average(coefficients_of_determination[key])
    std_err_coefficients_of_determination[key]=np.std(coefficients_of_determination[key], ddof=1)/np.sqrt(len(coefficients_of_determination[key]))
plt.errorbar(av_coefficients_of_determination.keys(),av_coefficients_of_determination.values(), yerr =list( std_err_coefficients_of_determination.values()))
plt.xlabel("fraction of the train dataset used")
plt.ylabel("R2 on new data");

Remaining questions:

How can we derive analytically that R^2 increases with number of used training samples?
How is the scaling of R^2 on new, unseen data with the number of used training samples
What are the restrictions? Does it only hold for models which estimate the conditional mean (i.e. minimize mean squared error)? No: this is not the case, it looks very similar for the Huber loss.
Does it only hold for linear models? No, the increase in $R^2$ on new, unseen data is also true for decision trees
Does it only work with datasets where the underlying generating process is linear? I guess no.

Could you please explain what your right arrows mean? They make no sense either as limits or equalities, so what is intended? BTW, the assertion you quote clearly is not always true: imagine a situation where, perhaps quite by accident, the estimate from the training data is a perfect fit to the test data (or a truly horrible one). — whuber, Jul 24 '23 at 13:55
Very true, these arrows and the insertions make little sense to me as well - I was trying to play a bit to get to some result how the expected R^2 might converge for a conditional mean estimator with growing sample size. If you prefer I will delete everything below the "=====" because it is confusing — Ggjj11, Jul 24 '23 at 14:30
Unfortunately, that doesn't tell us clearly enough what Mosier was referring to and the paper is behind a paywall. To make your question comprehensible you will need to supply that background information yourself. — whuber, Jul 24 '23 at 14:37
I suspect the place to start would be leave-one-out cross-validation, via the "hat" matrix/leverage - https://en.wikipedia.org/wiki/Projection_matrix . Unfortunately I can't access the paper either (at least from my burrow, it may be available at work), and it is not clear what "more accurate" means over different datasets. I would take it to mean that generalisation performance was better, but that doesn't really map that clearly onto accuracy on the calibration set. I suspect the claim is true, but I can immediately see why it is useful/comforting. — Dikran Marsupial, Jul 24 '23 at 15:28
I updated the problem statement and removed the hand wavy formulas. I will post another update with the simulation results — Ggjj11, Jul 24 '23 at 17:58
I doubt the claim is generally true, but for any least squares methodology it must be, since both "training" procedures optimize $R^2$ but the first one is more highly constrained. The result is immediate. — whuber, Jul 24 '23 at 19:36
I added the simulation results, the code to reproduce them and some refined questions. — Ggjj11, Jul 24 '23 at 20:50
I don't think Moiser is saying that there is a proof that the model trained on half of the data always gives worse predictions than a model trained on all of it, I think by "R of the half sample's beta" he is referring to the uncertainty in estimating the parameters of the regression model, which isn't quite the same thing. I am fairly sure if he means accuracy on observations, it will be the calibration sample not the test sample, and @whuber's comment addresses that one. — Dikran Marsupial, Jul 25 '23 at 11:15

Training on the full dataset after cross-validation, proof, part 2

Plot the last dataset

0 Answers0

Linked