Is a t-test between a dataset and its subset meaningful

Question

I have two data sets A and B. B is a subset of A. Is it meaningful to test if A and B are significantly different with t-test?

No. You should want to compare B with its complement within A. The overlap between B as subset and A as containing set is fatal here. If you attempt to write down the implied generating process, that will become clear. — Nick Cox, Nov 15 '15 at 14:30
Since B is already the same as itself, if B doesn't differ from the rest, it doesn't differ from the whole. That is, comparing B to {A}-{B} is the same as comparing it to A. The comparison between non-overlapping subsets is the standard way to do it. — Glen_b, Nov 15 '15 at 15:10

score 2 · Accepted Answer · answered Oct 31 '18 at 23:20

Contrary to the sloppy way it is sometimes expressed in statistical testing, there is no such thing in statistics as two numbers being "significantly different". There is only such a thing as unknown numbers having "significant evidence of a difference" on the basis of some observed data. (This is what we test in a classical hypothesis test for mean differences.) So if you are using a subset of a dataset for a T-test, this suggests you are using the subset as a sample of the unknown mean of some larger group, presumably the dataset from which it was taken. If that is the case, then you are apparently testing whether the mean of the full dataset differs from itself. (It doesn't.)

If you have some other meaning in mind for what you are actually testing (i.e., what mean is being compared to what mean?), you will need to be clearer about the actual hypotheses for your test. However, it is worth bearing in mind that there are established probabilistic rules for how a random subset of a population, sampled via simple random sampling, relates to the larger population. (In your case the "population" is your full dataset and the "data" is the subset.)

Relationship between population and simple random sample: Let $X_1,...,X_N$ be the full data set (your population) and define its empirical distribution $F_X : \mathbb{R} \rightarrow [0,1]$ by:

$$F_X(x) = \frac{1}{N} \sum_{i=1}^N \mathbb{I}(X_i \leqslant x).$$

If the subset was chosen via simple random sampling then this is equivalent to assuming that the original data set is exchangeable and the subset is the first $n$ values $X_1,...,X_n$. The condition of simple random sampling gives $X_i \sim F_X$ which means it has mean:

$$\mathbb{E}(X_i|F_X) = \mathbb{E}(\bar{X}_n|F_X) = \bar{X}_N.$$

Thus, under simple random sampling, the true expected value of a single sampled value, or the sample mean of the values, are both equal to the sample mean of the larger set (the population). This is guaranteed by the sampling mechanism and so it does not need to be established via hypothesis testing.

score 1 · Answer 2 · answered Oct 31 '18 at 22:37

Answered in comments: No. You should want to compare B with its complement within A. The overlap between B as subset and A as containing set is fatal here. If you attempt to write down the implied generating process, that will become clear. – Nick Cox

Since B is already the same as itself, if B doesn't differ from the rest, it doesn't differ from the whole. That is, comparing B to {A}-{B} is the same as comparing it to A. The comparison between non-overlapping subsets is the standard way to do it. – Glen_b

See also How to compare sub-sample mean with the sample mean?

Is a t-test between a dataset and its subset meaningful

2 Answers2

Related