In short
Assume there is a normally distributed population, from which I take a sample with $N_1=5$ elements. Then I obtain a separate number $x$ from somewhere. I want to test whether the value $x$ came from the above mentioned distribution.
Is a two-sample $t$-test with the assumption of common variance and sample sizes 5 and 1, respectively, appropriate here? Can one of the samples have size 1 in a $t$-test? I've read about "extremely" small sample sizes but never $N=1$. While having both samples with just $N=1$ sounds obviously silly, I feel the variability observed among the 5 elements of the first sample makes it possible to assess whether $x$ "fits in well" among those 5 values.
Background
I am analyzing a machine learning process that, at its end, spits out an evaluation metric (describing how good the trained model performs on the test set). The training process is stochastic due to random initialization, random shuffling of examples before each epoch and random data augmentation.
I am performing ablations to compare how changing hyperparameters affects the evaluation metric. Sometimes the difference is small, sometimes larger. To get an idea of whether these differences are meaningful or just variation induced by randomness of the training process, I first need to see how random the training process is: I've repeated the main experiment with fixed configuration 5 times (same experiment, the differences in the final evaluation metric result only from noise).
Then I make a change in the configuration, train and evaluate it and obtain 1 value. I want to be able to quantitatively say something about how meaningfully different this value is from the 5 other ones.
Is there any value in trying a formal hypothesis testing approach or should I just rely on intuitive eyeballing of the numbers?
Do note that obtaining one value costs about a day of computing, which can be put to better use than repeating the same training job again and again so it is not realistic to obtain more than 5 values for the same configuration nor to obtain more than a single value for any configuration other than the aforementioned one. Out of necessity, I am willing to assume that the training procedure yields evaluation metrics that have the same variance for all configurations.