Arbitrariness in statistical tests

Question

I am new to statistics and find the following procedure unnatural and arbitrary. Could someone point out what I miss, and how I should be thinking?

Assumptions

I have two vectors of real numbers $x_{i}$ and $x'_{i}$ with length $N$, and I assume that they are respectively sampled from one-dimensional normal distributions $N(0,\sigma^2)$ and $N(0,\sigma'^2)$. The goal is to tell how different $\sigma$ and $\sigma'$ are. A standard approach seems to be the F-test, described as follows:

F-test

Define the statistics $T(\{X_i\}, \{X'_i\})$ to be the ratio of the sample variances: $$\frac{\frac{1}{N}\sum_i X_i^2}{\frac{1}{N}\sum_i X_i'^2}.$$ If $\sigma$ and $\sigma'$ are close, the statistics $T$ should be close to $1$. With the hypotheses

H0: \sigma  = \sigma'
H1: \sigma != \sigma'

we assume that H0 is true. Then the cdf of $T$ is the regularized imcomplete beta function; lets just call it $F = F_{N,N}$. Now, we use the sampled statistics $t = T(\{x_i\}, \{x'_i\})$ (which is a number, assumed to be larger than $1$) to construct the p-value as the probability of the "extremes":

$$p = P(T > t) + P(T < \frac{1}{t})$$

Since $t$ is just a number, we can use $F$ to compute such $p$ given the observed data $x_i$. The idea is that, the more $t$ deviates from $1$ (the expected answer under H0), the smaller $p$ is. Therefore, if $p$ is small enough (normal practice $< 0.05$) we can reject H0 and accept H1.

Questions

I get that $T$ is a good enough statistics for this task. But is it universal in any sense? Can't one pick any other $T$ to fulfill the job? For example, $$\tilde{T} = \frac{1}{N}\sum_i X_i^2 - \frac{1}{N}\sum_i X_i'^2.$$ This is just an example: my point is not to argue this as an alternative, but to stress the arbitrariness.
Why should we construct $p$ in that way, while there are many other ways to "measure" how far $t$ deviates from $1$. For example, $$\tilde{p} = P(T > t) + P(T < 2-t).$$ I get that its more natural to use multiplicative scale here, but it still feels arbitrary. Again, this is just an example.
My true incentive is really to tell how much the $\sigma$'s differ. But now the test only allow me to tell if they are different, even if they only differ by $10^{-9}$.. How can one build a hypothesis test in order to quantify such difference? For example,
```
H0: |\sigma - \sigma'| <  10
H1: |\sigma - \sigma'| >= 10
```
Can one even build a test where $10$ is abstracted?

Conclusion

I believe I have pointed out some arbitrariness in the F-test. However, every statistical test seems to be arbitrary in the same way. A different test statistics and a different way to construct $p$ will change the final result. So how should I understand our scientific results nowadays?

Questions of the form "how much" are estimation questions rather than testing questions. — Glen_b, Oct 15 '22 at 13:08
The sampling distribution of the variance differences under the null is way too complicated to calculcate by-hand. Not a problem if you use resampling-based statistics, but I wouldn't be surprised if the results were completely equivalent to the $F$-test. — AdamO, Oct 15 '22 at 13:17
This have to do with optimality of tests, you could ask a question of why the F-test for comparing variances is optimal, and in which sense ... — kjetil b halvorsen, Oct 15 '22 at 13:56
@kjetilbhalvorsen Is it optimal or is it just a way that happens to be calculatable? Would you mind point out in which way it is optimal? — Student, Oct 15 '22 at 14:05
I don't have time now, but some of the refs mentioned at https://stats.stackexchange.com/questions/17899/looking-for-mathematical-account-of-anova might help you — kjetil b halvorsen, Oct 15 '22 at 23:11

score 1 · Answer 1 · answered Oct 15 '22 at 12:58

You can feel free to construct a test statistic this way. An advantage, however, of using a ratio is that there is a sense in which that is more natural for a measure of spread: “the data are twice as spread out as they used to be” might mean more than “the data are one unit if spread higher than they used to be”.

3)

This is related to something called equivalence testing. With a standard F-test, you might want to test if the variances are within a multiplicative factor of $2$ of each other, and as long as either variance isn’t more than double the other, the variances are “practically” the same. You can do this with an F-test for a multiplicative factor. For an additive comparison, you might have to use some kind of simulation, though it should be do-able.

The easiest type of equivalence test is called TOST: two one-sided tests. You also might be interested in calculating a confidence interval for the ratio of (or difference between) your two variances.

Finally, note that the F-test is quite sensitive to violations of the assumption of normality. If you’re really convinced that you have normal distributions, great! If you think you can rely on some kind of argument based on the central limit theorem like g’night for a t-test of means, however, that is a mistake. JBStatistics has a YouTube video about this.

Though it's not answering my questions, I'd like to thank you for your informative answer. Our approach is quite sloppy. We assume our data come from a multivariate normal distributions and blindly use PCA (as many other groups do). Given two such data, we want to argue that the largest eigenvalues are the same. To get the largest eigenvalue, we perform PCA anyways by project all points to the sampled principal component, and use the lengths as $x_i$ as in OP. -- Are there other tests on $\sigma/\sigma'$ that's less sensitive to normality? — Student, Oct 15 '22 at 15:01
@Student In what way does it not answer your questions? Perhaps I can make an edit to better answer your inquiries. // A variance test that is less sensitive to normality warrants its own question. // What you’re doing with PCA and why that solves problems for you might warrant its own question, too. That strikes me as an XY problem. — Dave, Oct 15 '22 at 16:10
Thank you for your help.. For 1, 2 I was asking for a formal justification of why such $T$ and $p$ are "optimal"; or if there's none, please state "yes, it's arbitrary". For 3, I was looking for a test procedure - it's not clear what the procedure in your answer is. // For PCA and what we're doing, I may just postpone to another question. Appreciated! :) — Student, Oct 15 '22 at 17:11

Arbitrariness in statistical tests

Assumptions

F-test

Questions

Conclusion

1 Answers1