Is pretraining on test set texts (without labels) ok?

Question

Edit: after skimming this paper⁶, I narrowed the scope of this question to NLP problems. Relevant excerpt from the abstract (emphasis my own):

We demonstrate that unsupervised preprocessing can, in fact, introduce a substantial bias into cross-validation estimates and potentially hurt model selection. This bias may be either positive or negative and its exact magnitude depends on all the parameters of the problem in an intricate manner.

Motivation

It's obviously wrong to train on test set features with test set labels. But in many ML competitions, it's standard to release test set features and allow participants to train on them. One example is the Real-world Annotated Few-shot Tasks (RAFT) benchmark in NLP.¹ Here's an excerpt from the RAFT paper (emphasis my own):

For each task, we release a public training set with 50 examples and a larger unlabeled test set. We encourage unsupervised pre-training on the unlabelled examples and open-domain information retrieval.

In the RAFT competition, you submit predictions by running your model on the same set of unlabeled texts which you may train on. In NLP, a common way to train on unlabeled text is to train a language model which predicts tokens conditional on other tokens.

I understand that releasing test set features is helpful for those hosting the competition, as it allows participants to submit predictions rather than models/code. I also understand that in real-world model development, you may have observed lots of unlabeled text. But I think the critical difference is that in the real world, you don't have access to out-of-sample text.

Question

Is training a model on (in-sample) test set texts, and then evaluating that model on the same test set an optimistic estimator of out-of-sample performance?

A reasonable-sounding hypothesis is that training on (in-sample) test set texts results in correlation between test set predictions and test set labels, which is an optimistic estimator (at least for linear regression, see equation 7.21 in ESL²). But I don't have an argument for how exactly that dependence arises from training on test set texts without test set labels.

The result of my experiment with PCA here has an important implication for ML competitions: if there are few test set observations and features exhibit high rank, then one can artificially reduce error on the test set by fitting a PCA on test set features.

I'm curious to see if a similar type of result can be observed in NLP, where it's standard practice to train language models on unlabeled text before classification tasks.³ I have a feeling that part of the answer lies somewhere in the paper On Causal and Anticausal Learning⁴ or its child Causal Direction of Data Collection Matters: Implications of Causal and Anticausal Learning for NLP⁵. These papers establish that semi-supervised learning should only help for data where text causes the target.

References

Alex, Neel, et al. "RAFT: A real-world few-shot text classification benchmark." arXiv preprint arXiv:2109.14076 (2021).
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. Vol. 2. New York: springer, 2009.
Gururangan, Suchin, et al. "Don't stop pretraining: Adapt language models to domains and tasks." arXiv preprint arXiv:2004.10964 (2020).
Schölkopf, Bernhard, et al. "On causal and anticausal learning." arXiv preprint arXiv:1206.6471 (2012).
Jin, Zhijing, et al. "Causal direction of data collection matters: Implications of causal and anticausal learning for NLP." arXiv preprint arXiv:2110.03618 (2021).
Moscovich, Amit, and Saharon Rosset. "On the cross-validation bias due to unsupervised preprocessing." Journal of the Royal Statistical Society Series B: Statistical Methodology 84.4 (2022): 1474-1502.

Digging more re PCA, I found this question. Clearly, there is still confusion on this topic. I checked that the ESL passage is missing a reference, a simulation, or math. — chicxulub, Apr 07 '23 at 02:55
It's s tricky business. I don't know if i have an answer about it somewhere, but you create a dependence between data, that can lead to spurious below chance of above chance performance. — rep_ho, Apr 07 '23 at 08:34

chicxulub · Answer 1 · 2024-02-04T05:29:37.853

TLDR: seems mostly fine, especially if you have more test data.

This answer contains empirical experiments similar to my experiment with PCA here. We'll run a few experiments with BERT and GPT-2 on real classification tasks. BERT is trained on unlabeled text using masked language modeling loss (see section 3.1 Task #1 here¹). GPT-2 is trained on unlabeled text using the usual "causal" loss (see equation (1) here²).

Experiment design

Code for this experiment is available here. The experiment procedure is implemented in the function, experiment, here. Here it is written out:

Split a sample into 3 subsamples:
- train: 100 observations for supervised/classification training
- test: $n$ ($200$ or $500$) observations which will be used to report accuracy
- extra: $n$ observations which are optionally used for unsupervised training.
Compute the following 3 accuracies:
- $\text{acc}_{\text{extra}}$: train BERT/GPT-2 on extra features using masked/causal language modeling loss, and then train this model on train features and labels using cross entropy loss by adding a linear layer whose output dimension is the number of classes. (For BERT, the linear layer transforms the [CLS] token embedding. For GPT-2, the linear layer transforms the last token's embedding.) Compute the accuracy of this model on test.
  - This score is clearly an unbiased estimator of out-of-sample accuracy because it never trains on data from test.
- $\text{acc}_{\text{test}}$: train BERT on test features using masked language modeling loss, and then train this model on train features and labels using cross entropy loss. Compute the accuracy of this model on test.
  - This score represents what you'd see in an ML competition like RAFT. It's unclear whether or not the score is unbiased, because it was trained and evaluated on the same set of test features. My hypothesis is that it's not unbiased; it's optimistic.
- $\text{acc}_{\text{base}}$: don't train BERT using masked language modeling loss; just train it out-of-the-box using cross entropy loss on train features and labels. Compute the accuracy of this model on test.
  - This score is a control. If $\text{E}[\text{acc}_{\text{extra}} - \text{acc}_{\text{base}}] = 0$, then we shouldn't be surprised that there's no effect going from $\text{acc}_{\text{extra}}$ to $\text{acc}_{\text{test}}$. Training on unlabeled text doesn't help for this data.

The 3 accuracy estimators are paired, as the supervised training and test sets are identical. The only difference is the source of unsupervised training data. For $\text{acc}_{\text{extra}}$, the source is independent of test set texts. For $\text{acc}_{\text{test}}$, the source is exactly the test set texts. An important source of variation in this experiment is the particular subsample/sample splits, so the experiment procedure was repeated on 50 random subsamples for each dataset for $n = 200$, 20 for $n = 500$.

If The Elements of Statistical Learning³ is right—

initial unsupervised screening steps can be done before samples are left out . . . Since this filtering does not involve the class labels, it does not give the predictors an unfair advantage.

—then $\text{E}[\text{acc}_{\text{test}} - \text{acc}_{\text{extra}}] = 0$, i.e., there is no overestimation of out-of-sample accuracy despite unsupervised training on test.

Results

The experiment was ran on 25 text classification datasets. The dataset inclusion criteria was:

I'm pretty sure BERT can do better than guessing
Not a super high number of classes (since the design is limited to a couple hundred observations)
Texts were not so long that too much useful signal gets truncated to fit in BERT's context window.

Raw accuracy scores are available here.

Analysis

For analyses of individual datasets, see this notebook. For each dataset, that notebook contains a plot like the one below (black line = majority accuracy):

(Note that it's probably ok if accuracies are worse than majority accuracy. The design stratify-samples the 100 classification training observations to ensure every class is represented, and to reduce variance across subsamples. It doesn't stratify-sample the test data. The only statistic we're interested in is the difference between accuracies, which is effectively the difference model likelihoods when the priors are identical. TLDR: all we care about is the difference between accuracies.)

Visualization of differences

Usually, most of the mass of $\text{acc}_{\text{extra}} - \text{acc}_{\text{base}}$ is on the positive side, while $\text{acc}_{\text{test}} - \text{acc}_{\text{extra}}$ is usually centered around 0. An unintended and kinda inevitable source of variance is BERT's training instability.

Model

The model is a multilevel one which stratifies by the type of LM:

$$ \begin{align*} Y_{ijkl} \sim \text{Binomial}(n, \lambda_{ijkl}) && \text{number of correct predictions} \\ \text{logit}(\lambda_{ijkl}) = \mu + \alpha z_i + U_j + V_{jk} + \beta x_{ijkl} && \text{additive effects} \\ \mu \sim \text{Normal}(0, 1) && \text{prior for intercept} \\ \alpha \sim \text{Normal}(0, 5) && \text{prior for LM type effect} \\ U_j \sim \text{Normal}(0, \sigma_{U}) && \text{effect of dataset} \\ V_{jk} \sim \text{Normal}(0, \sigma_{V}) && \text{(nested) effect of dataset subsample} \\ \beta \sim \text{Normal}(0, 1) && \text{prior for treatment effect} \\ \sigma_{U}, \sigma_{V} \sim \text{HalfNormal}(0, 1) && \text{prior for standard deviations}. \end{align*} $$

$n = 200$ or $n = 500$ depending on the dataset of scores you want to analyze. $n = 200$ isn't a strawman, but most test splits are in the 100s. Other research has found that the bias from pretraining / unsupervised pre-processing converges to $0$ as $n \rightarrow \infty$⁴ (also seen in my answer for PCA). So we should also try $n = 500$.

$i = 1, 2$ for BERT and GPT-2, respectively.

$z_i = 0$ if $i = 1$ else it's $1$.

$j = 1, 2, \dots, 20$ for the dataset.

$k = 1, 2, \dots, 50$ (or $20$ for $n = 500$) for the subsample of dataset $j$.

$l = 1, 2$ for control and treatment, respectively.

$x_{ijkl} = 0$ if $l = 1$ else it's $1$. Inference on $\beta$ is performed via MCMC.

Note: I'm still learning how to do this type of analysis. I'd appreciate feedback.

Posterior predictive distributions

Below are the distributions of $\bar{\hat{Y}}_{\cdot \cdot \cdot 1} - \bar{\hat{Y}}_{\cdot \cdot \cdot 0}$, i.e., the difference between the treatment ($1$) and control ($0$) grand means. The mean is taken across LM types, classification tasks, and their subsamples. We could produce conditional plots for each of these groups, but right now I want to summarize the results.

The pretraining boost is akin to $\text{acc}_{\text{extra}} - \text{acc}_{\text{base}}$. The evaluation bias is akin to $\text{acc}_{\text{test}} - \text{acc}_{\text{extra}}$.

There's strong evidence that pretraining helps. Despite this, there's a negligible amount of bias for $n=200$ and no bias for $n=500$.

Discussion

Training BERT and GPT-2 on unlabeled test set texts does not result in strong, optimistic estimates of out-of-sample classification performance.

TODO: can the results be explained by the ICM principle? Does the boost in pretraining or the bias from pretraining depend on whether the classification task is causal vs. anti-causal?

Meta-analysis

Above, I said:

An important source of variation in this experiment is the particular subsample/sample splits, so the experiment procedure was repeated on 50 random subsamples for each dataset for $n = 200$, 20 for $n = 500$.

Let's see if that's true. We'll run the analysis code on 500 slices of the $n=500$ dataset of accuracies such that exactly 1 subsample per classification task is included. The sliced version of the data is all you usually get from ML papers and benchmarks. What is the distribution of the posterior mean of $\beta$ for $n=500$?

The distribution of my conclusions from non-replicated data is pretty variant. There's a ~46% chance that I would've found a non-negligible positive or negative bias from pretraining on the test set.

The lesson learned is that this sort of technical replication / repeated subsampling can sometimes be important. (Collecting more classification tasks—the "biological" replicates in this experiment—is still vital of course. It just takes me more time to find and vet them.) Modern LM training procedures are highly unstable—we actually already saw that in the visualization for BERT accuracies. This instability is exacerbated when studying few-shot methods.

Instead of hiding behind random seeds and claims that conclusions are perfectly reproducible, it's almost always useful to expose the variance of a training/analysis procedure. A concrete recommendation for future few-shot learning studies/benchmarks is to include and score a model on multiple, independent subsamples for each task.

References

Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).
Radford, Alec, et al. "Improving language understanding by generative pre-training." (2018).
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. Vol. 2. New York: springer, 2009.
Moscovich, Amit, and Saharon Rosset. "On the cross-validation bias due to unsupervised preprocessing." Journal of the Royal Statistical Society Series B: Statistical Methodology 84.4 (2022): 1474-1502.

score 0 · Answer 2 · answered Jan 01 '24 at 10:54

We need more studies and more thinking such as what @chicxulub wrote above. Speaking in general, unsupervised learning (UL) can lead to false comfort in the structures learned by UL (e.g., principal component loadings may change if computed on a new sample). But the way in which UL leads to supervised prediction makes this primarily unrelated to overfitting. Rephrased, we may not always need to include the UL inside a resampling loop that is used to get an unbiased estimate of model performance (and please don't use classification accuracy as a performance measure). But I need to see studies that actually verify that.

"But I need to see studies that actually verify if" +1. I am on the pessimistic side and would always assume that the UL on test data has an adverse effect on the honesty of my test results - I would also only believe the opposite if the studies evaluated the performance on multiple big dataset and with variois metrics — Ggjj11, Jan 04 '24 at 18:36

Is pretraining on test set texts (without labels) ok?

Motivation

Question

References

2 Answers2

Experiment design

Results

Analysis

Visualization of differences

Model

Posterior predictive distributions

Discussion

Meta-analysis

References

Linked