I am new to statistics and find the following procedure unnatural and arbitrary. Could someone point out what I miss, and how I should be thinking?
Assumptions
I have two vectors of real numbers $x_{i}$ and $x'_{i}$ with length $N$, and I assume that they are respectively sampled from one-dimensional normal distributions $N(0,\sigma^2)$ and $N(0,\sigma'^2)$. The goal is to tell how different $\sigma$ and $\sigma'$ are. A standard approach seems to be the F-test, described as follows:
F-test
Define the statistics $T(\{X_i\}, \{X'_i\})$ to be the ratio of the sample variances: $$\frac{\frac{1}{N}\sum_i X_i^2}{\frac{1}{N}\sum_i X_i'^2}.$$ If $\sigma$ and $\sigma'$ are close, the statistics $T$ should be close to $1$. With the hypotheses
H0: \sigma = \sigma'
H1: \sigma != \sigma'
we assume that H0 is true. Then the cdf of $T$ is the regularized imcomplete beta function; lets just call it $F = F_{N,N}$. Now, we use the sampled statistics $t = T(\{x_i\}, \{x'_i\})$ (which is a number, assumed to be larger than $1$) to construct the p-value as the probability of the "extremes":
$$p = P(T > t) + P(T < \frac{1}{t})$$
Since $t$ is just a number, we can use $F$ to compute such $p$ given the observed data $x_i$. The idea is that, the more $t$ deviates from $1$ (the expected answer under H0), the smaller $p$ is. Therefore, if $p$ is small enough (normal practice $< 0.05$) we can reject H0 and accept H1.
Questions
I get that $T$ is a good enough statistics for this task. But is it universal in any sense? Can't one pick any other $T$ to fulfill the job? For example, $$\tilde{T} = \frac{1}{N}\sum_i X_i^2 - \frac{1}{N}\sum_i X_i'^2.$$ This is just an example: my point is not to argue this as an alternative, but to stress the arbitrariness.
Why should we construct $p$ in that way, while there are many other ways to "measure" how far $t$ deviates from $1$. For example, $$\tilde{p} = P(T > t) + P(T < 2-t).$$ I get that its more natural to use multiplicative scale here, but it still feels arbitrary. Again, this is just an example.
My true incentive is really to tell how much the $\sigma$'s differ. But now the test only allow me to tell if they are different, even if they only differ by $10^{-9}$.. How can one build a hypothesis test in order to quantify such difference? For example,
H0: |\sigma - \sigma'| < 10 H1: |\sigma - \sigma'| >= 10Can one even build a test where $10$ is abstracted?
Conclusion
I believe I have pointed out some arbitrariness in the F-test. However, every statistical test seems to be arbitrary in the same way. A different test statistics and a different way to construct $p$ will change the final result. So how should I understand our scientific results nowadays?