I was reading the question Training on the full dataset after cross-validation?
and found a corresponding literature reference Moiser 1951, https://doi.org/10.1177/001316445101100101 who claims good practice was to (potential misquotes are my error here):
- obtain performance metrics using cross-validation: training a model on one half of a dataset (for both halfes a and b) and evaluate performance on the corresponding hold out half of the dataset. This will result in two $R^2_a$ and $R^2_b$ so that our best guess of goodness of fit is $R^2_{*}=0.5(R^2_a+R^2_b)$
- use the whole dataset to build a final, model (for which we have no hold-out test data), so we cannot provide a performance metric right away. However it is claimed (without proof) that the "$R^2_{big}$" of this model (trained on more data) is expected to be higher than $R^2_{*}$, when evaluated on newly gateherd data of the same process.
His last words are:
Possibly one of the other speakers knows a proof that the [coefficient of determination] R of the half sample's beta is always lower than of the full sample beta [where beta is the coefficients of a model]
Questions:
- Does somebody have a proof by now (in the year 2023)? I guess we need to deal with the expected $R^2$s here (as was pointed out in the discussion)
- How would the expected $R^2$ of a linear model improve with sample size N when the model is used to estimate the conditional mean?
I am interested in simulations or a formal proof which show that a model trained on a bigger dataset (here the combined dataset) has an (on average) higher $R^2_{big}$ value on new, unseen data than if a model is trained only on a fraction (a half) of the dataset.
I will try to setup a simulation study:
- drawing 10000 datasets ($Y =aX +b +\epsilon$, with 100 samples for a maximal train set + 100 samples which are "new" and hold out till the end)
- For each of the datasets, fit a linear model on a) a fraction of the train set samples b) all samples of the train set
- evaluate how the average (over all datasets) of $R^2$ on the "hold out new data" changes with the number of samples used for training
I expect the simulation to show that the average $R^2$ on the new data increases with the fraction of samples used from the train set (I will post the results in an update), so that $E[R^2_a]<E[R^2_{big}]$. I am wondering how one could explain the results analytically (by the central limit theorem??)
Why does this matter?
In a typical data science problem you train your model on a train/test split and evaluate performance on the hold-out test set. For deploying a model, should you go for this model trained on the smaller train set (for which you have some clue about its performance from the test set) or go for the model trained all data (for which you have no performance metrics at hand).
Update with simulation result
$R^2$ on new (unseen) data, as a function of the fraction of training being used" />
We see that the more data are used for the training, the better the $R^2$ on new (unseen) test data. Errorbars indicate the standard error of the estimated mean $R^2$ (where the mean is taken over different regression datasets).
This would imply that a data scientist should always retrain his model on all data to get optimal performance on new (unseen) data.
Notes:
- A very similar plot is obtained when using regression trees.
- Using a hubermodel (which minimizes huber loss) still results in a similar plot, although it is not mean squared loss which is minimized
PS: I observe in the plot that the "signal-to noise ratio" (in an abstract sense) plays an important role in the achievable average R^2 (at maximum fraction of the training set being used).
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
import random
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score
coefficients_of_determination = {0.1: [], 0.2: [], 0.3: [], 0.4: [], 0.5: [], 0.6: [], 0.7: [], 0.8: [], 0.9: [], 1.0: []}
for i in range(2000):
# Generate synthetic dataset
n_samples=200 #these get halved into a "new" test set and a train set (which is used to some degree)
#X, y = make_regression(n_samples=n_samples, n_features=1, noise=std_err)
random_slope = np.random.uniform(-20,20)
std_err=abs(random_slope)
X = np.expand_dims(np.linspace(-random_slope/2,random_slope/2, num=n_samples), axis=1)
y= random_slope*X+np.random.normal(loc=0, scale = std_err, size=(n_samples,1))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
for train_fraction in coefficients_of_determination.keys():
random_indices=random.sample(range(len(X_train)), int(len(X_train)*train_fraction)) #without replacement
X_train_fraction=X_train[random_indices]
y_train_fraction=y_train[random_indices]
model = LinearRegression()
#model = DecisionTreeRegressor()
# Fit the model to the training data
model.fit(X_train_fraction, y_train_fraction)
# Make predictions on the test set
y_pred = model.predict(X_test)
# Compute the R2 score on the test set
r2 = r2_score(y_test, y_pred)
coefficients_of_determination[train_fraction].append(r2)
Plot the last dataset
plt.scatter(X_train, y_train, s=30, edgecolor="k", label="complete train data")
plt.scatter(X_test, y_test, s=30, edgecolor="k", label="test data")
plt.xlabel("Feature (X)")
plt.ylabel("Target (y)")
plt.title("Synthetic Regression Dataset")
plt.legend()
plt.show()
av_coefficients_of_determination = {}
std_err_coefficients_of_determination ={}
for key in coefficients_of_determination:
av_coefficients_of_determination[key]=np.average(coefficients_of_determination[key])
std_err_coefficients_of_determination[key]=np.std(coefficients_of_determination[key], ddof=1)/np.sqrt(len(coefficients_of_determination[key]))
plt.errorbar(av_coefficients_of_determination.keys(),av_coefficients_of_determination.values(), yerr =list( std_err_coefficients_of_determination.values()))
plt.xlabel("fraction of the train dataset used")
plt.ylabel("R2 on new data");
Remaining questions:
- How can we derive analytically that R^2 increases with number of used training samples?
- How is the scaling of R^2 on new, unseen data with the number of used training samples
- What are the restrictions? Does it only hold for models which estimate the conditional mean (i.e. minimize mean squared error)? No: this is not the case, it looks very similar for the Huber loss.
- Does it only hold for linear models? No, the increase in $R^2$ on new, unseen data is also true for decision trees
- Does it only work with datasets where the underlying generating process is linear? I guess no.