To see if the test is "good" you need to analyse the properties of the power function
As you point out, it is possible to derive a test statistic ---and thereby derive the "evidentiary ordering" defining what is "extreme" in the test--- using formal methods like the likelihood-ratio method. It is also possible to formulate a test statistic on a more intuitive basis. Both of these are allowed, but ultimately the quality of the test is checked by looking at its properties in relation to correct inference under different parameter values. In particular, this generally involves an analysis of the frequentist properties of the power function.
For context, this explanation builds on my explanation of a hypothesis test in this related answer.
Suppose your test has an unknown parameter $\theta$ and disjoint hypothesis spaces $\Theta_0$ and $\Theta_1$ corresponding to the two hypotheses. Given a stipulated evidentiary ordering $\succeq$ there is a resulting p-value function $p$ and the corresponding power function (allowing for variable significance level and sample size) is:
$$\text{Power}(\theta, \alpha, n)
\equiv \mathbb{P}(\text{Reject } H_0 | \theta)
= \mathbb{P}(p(\mathbf{X}_n) \leqslant \alpha | \theta).$$
The power function fully determines the probabilities of Type I/Type II error in the test under each possible parameter value. Consequently, the way we find out if the hypothesis test is any good is by analysing the properties of the power function to see if it gives us nice low probabilities of error. In particular, we look at all the frequentist properties of the test by looking at what happens to the power function over all possible values of $\theta$. One of the most important things we would look at is the consistency of the test, which is a property that depends on what happens to the power as $n \rightarrow \infty$. At a minimum, we want to see the following property:$^\dagger$
$$\begin{align}
\lim_{n \rightarrow \infty} \text{Power}(\theta, \alpha, n) &\leqslant \alpha
\quad \quad \quad \text{for } \theta \in \Theta_0. \\[12pt]
\lim_{n \rightarrow \infty} \text{Power}(\theta, \alpha, n) &= 1
\quad \quad \quad \text{for } \theta \in \Theta_1. \\[6pt]
\end{align}$$
This property says that with enough data, the test gets very good at rejecting a false null hypothesis (the probability of Type II error goes to zero in the limit) but it also respects the significance level of the test as the size of the test. There are other properties of the power function that are also valuable properties and we would also look at these. We may even compare two hypothesis tests that use a different definition of "extreme" and find that one has much better properties than another (e.g., one test "dominates" the other in terms of having lower probabilities of Type I and Type II error).
Example of power analysis for a bespoke test: In case it interests you, you can find an example of power analysis for a bespoke test in O'Neill (2020). This paper puts forward a new kind of hypothesis test (to test for periodic signals in data/residuals) where the test statistic is formulated "intuitively" (but based on some related work) and the p-value function is approximated using permutation sampling. This test is sufficiently complex that its power function is difficult to compute exactly, and so it is computed using simulation methods over a set of points of interest.
Section 3 of this paper (pp. 9-13) shows a power analysis of the test to check that the test actually "works" --- i.e., that it does indeed detect periodic signals in data (with a high enough sample size) and it doesn't say that they're there when they're not (at least, not beyond the expected rates of Type I error). As you will see from that section, the analysis involves showing the power of the test over a range of sample sizes and values for the parameter in both the null and alternative regions (under one or more a stipulated significance levels) to see if it is doing what it should be doing. There is also some deeper simulation analysis showing the distribution of the p-value in these cases. The probability of Type I error in the test is held at its appropriate rate by construction and the probability of Type II error is analysed by simulation, by computing the power function at a set of points of interest. What we are looking for in the latter case is to confirm that the power of the test tends towards one under every parameter value in the alternative space as we get more and more data.
In this particular paper, the power function exhibits the type of properties you would want it to have, which gives us confidence that the test is "good". In particular, in Figure 5 (p. 13) you can see that the power of the test increases up towards one as the sample size increases under each parameter value in the alternative space. Moreover, as should be expected, the rate of increase of the power is much higher when the parameter is far away from the null value. Now, that gives us a basic "sense check" of the test, but it doesn't guarantee that there isn't some other test that will dominate the present test. If someone else were to formulate an alternative test for periodic signals in data, it would be possible to compare the power functions of the two tests to see if one of them is unambiguously better than the other (or if they are each better/worse at certain parameter values in the alternative space).
(Another simulated power analysis for a different bespoke test can be found in O'Neill (2023) (or arXiv version, pp. 23-28). I know I'm bombarding you with my own papers, so excuse the self-indulgence; these are just the examples that I'm most familiar with.)
$^\dagger$ It is also worth noting that the power of a test can be computed using a generating mechanism for the data that is outside the stipulated model form for the test. In this case there might not be a clear parameter $\theta$ but there should still be some way to decide whether the null hypothesis is true or false under the alternative model. In this case you would formulate some alternative model $\mathscr{M}$ and compute the power $\text{Power}(\mathscr{M}, \alpha, n)$ by simulating data from this model. You can then compare the computed power to what you would want it to be (depending on whether the null hypothesis is true or false under your new model) to see if the test has good properties when the true model $\mathscr{M}$ falls outside the scope of what was stipulated as the model form when you created the test. This broader kind of power analysis provides information about the "robustness" of the hypothesis test against a failure of its model assumptions. This analysis is just as easy to do as regular power analysis within the model; the only difference is that we simulate the data from a different model instead.