There is a theoretical justification for selecting a model with the best test set performance while also guaranteeing something about the selection procedure's performance. You just need a lot of test data and/or a higher tolerance for poorly estimating error. As a reference, see the relevant slides (mainly slide 52) in this deck. If you're familiar w/ statistics, this analysis is similar to applying the Bonferroni correction to confidence intervals.
Selection procedure
- $m$ models are trained and tuned on the training set
- Each of these models is evaluated on the test set
- The model with the highest test set performance is selected.
Definitions
- $n$ is the number of test set examples.
- $m$ is the number of models trained on the training set.
- $Y_i$ is the response/label/outcome r.v. for test set example $i$.
- $\hat{Y}_{ij}$ is model $j$'s prediction r.v. for test set example $i$. Assume, as is usual, that model $j$ was trained, tuned, etc. on data independent of all $Y_i$'s.
- $\text{err}$ is an error function. For classification problems, $\text{err}(Y, \hat{Y}) = I(Y = \hat{Y})$ is accuracy, and $\text{err}(Y, \hat{Y}) = (Y - \hat{Y})^2$ is squared error (valid for classification too). For this analysis, let's assume $\text{err}$'s range is $[0,1]$.
- $Z_{ij} = \text{err}(Y_i, \hat{Y}_{ij})$ is an r.v. for the error of model $j$ on test set example $i$. A consequence of training on data independent of test data is that for each $j$ and all $i \neq i', Z_{ij} \perp Z_{i'j}$.
- $\bar{Z}_j = \frac{1}{n} \sum_{i=1}^{n} Z_{ij}$ is an estimator of model $j$'s error.
- $\mu_j = \text{E}(\bar{Z}_j)$ is the expected or "true" error of model $j$.
- $\epsilon$ is the minimum acceptable difference between a model's estimated error and its true error. This quantity must be specified for the analysis to have any value.
Performance guarantee
The procedure above is ran once. The model $j^* = \text{argmin}_j \bar{Z}_j$ is selected. A quantity amenable to analysis is the probability that $j^*$'s error estimator misses $j^*$'s true error by more than $\epsilon$, i.e., $p := \Pr(|\bar{Z}_{j^*} - \mu_{j^*} | > \epsilon)$.
To break $p$ down, first note that the event, $|\bar{Z}_{j^*} - \mu_{j^*} | > \epsilon$, is a subset of the event that $\bigcup_{j=1}^{m} |\bar{Z}_{j} - \mu_{j} | > \epsilon$. Using this fact, we can bound $p$ by:
$$
\begin{align}
p &= \Pr(|\bar{Z}_{j^*} - \mu_{j^*} | > \epsilon) \\
&\leq \Pr \Bigg( \bigcup_{j=1}^{m} |\bar{Z}_{j} - \mu_{j} | > \epsilon \Bigg) && \text{monotonicity} \\
&\leq \sum_{j=1}^{m} \Pr(|\bar{Z}_{j} - \mu_{j} | > \epsilon) && \text{union bound} \\
&\leq \sum_{j=1}^{m} 2\exp(-2 \epsilon^2 n) && \text{Hoeffding's inequality} \\
&= 2m\exp(-2 \epsilon^2 n).
\end{align}
$$
From now on refer to this upper bound on $p$ as $p'$.
Interpretation
Wrong: I have a dataset. I ran the procedure and computed the minimum test error estimate (0.69 in your case). The probability that this estimate misses my selected model's expected error by at least $\epsilon$ is (at most) $p'$.
Right: I ran this procedure on $100$ randomly sampled, independent datasets. Across these datasets, I expect the test set error estimate to miss the selected model's expected error by at least $\epsilon$ in (at most) $p' \cdot 100$ of them. This fact is the only one I can back up; I don't know for which datasets it missed, or how many times it missed. I guarantee nothing about the individual performance of my 100 selected models.
Here are things that this procedure does not allow for:
An unbiased estimator of model $j^*$'s performance. If you want this, get a bunch more independent data and evaluate model $j^*$ on it.
(As with almost all model evaluation procedures) observing the test set error estimate, training a new/corrected model, and then repeating this procedure hoping for the same guarantee. That's b/c once a model's predictions become dependent on the labels, Hoeffding's inequality cannot be applied. It may go w/o saying, but in general, avoid inducing any dependence between predictions and labels when evaluating a model or a selection procedure. So don't re-fit or re-tune models and then re-evaluate on the test set.
Unbounded $\text{err}$ functions. We need it to be bounded to apply Hoeffding's inequality. So regression problems will be excluded.
Example
I've decided I'm going to train at most $m = 10$ models. I've split my data so that $n = 10,000$. For my problem, I'll accept $\epsilon = 0.02$. (Note that the performance guarantee was 2-sided. There is a 1-sided version of this analysis if you're unconcerned with overestimating error, which could be ok in some applications.) If I ran the procedure 1000 times, then I'd expect the test set error estimate to be at least $\epsilon$-away from the model's true error in at most 7 cases. That's because $p' \approx 0.0067$.
You can also apply this idea to calculate the test set's sample size. It assumes you've pre-specified all of its arguments. Here's some R code for that:
hoeffding_sample_size = function(m, eps, p) {
(log(2) + log(m) - log(p)) / (2*eps^2)
}
Simulation
Here's a link to a simulation which checks that the performance guarantee is valid for $\epsilon = 0.01, m = 2$.

Note that the bound is useless (for $\epsilon = 0.01, m = 2$) until the test set size gets past ~7k examples.
Should this procedure be used?
If you'd prefer a performance guarantee about a procedure b/c you're solving a whole bunch of prediction problems, and have a lot of test data for each, then maybe. But practitioners will almost always want an unbiased estimate of the selected model's out-of-sample error, rather than one about the procedure used to select it. That's why we prefer reserving an independent test set for a single evaluation of a single model.
Adapting the bound for adaptive procedures
Everything above only applies to this procedure: pre-specify $m$ models, train them, evaluate them, and pick the best one. This procedure is not as realistic as an adaptive one: train a model, observe its test set error, train an adjusted model, observe its test set error, repeat until you're satisfied or tired. A performance guarantee can still be made in this case, as discovered here:
Blum, Avrim, and Moritz Hardt. "The ladder: A reliable leaderboard for machine learning competitions." International Conference on Machine Learning. PMLR, 2015.
A core idea behind the Ladder Mechanism, in contrast to the plain bound $p'$, is that a model must significantly outperform previous ones in order for you to buy its test set error estimate.