Permutation test comparing nested non-linear models with an exchangeable dummy variable

Question

This question is closely related to an earlier question, but I realized my case was actually a lot more specific than the way I formulated it there, in ways that I think merit a separate answer.

I have two non-linear models, one nested inside another. My data violates parametric assumptions (specifically independence), so I want to do a non-parametric model comparison. The nesting of these models is such that I think I have a simple solution for this. Specifically, if the simple model is this: $$ M_0:Y_i=f(X_i,\theta)+\varepsilon_i $$ $$ \theta_K=0 $$ where $i$ indexes observations, $Y_i$ is the dependent variable, $X_i$ is are the independent variables, $\theta$ are parameters, $f$ is a nonlinear function and $\varepsilon_i$ is noise, then the more complex model is obtained as follows: $$ M_1:y_i=f(X_i,Z_i,\theta)+\varepsilon_i $$

$$ \theta_K= \left\{\begin{matrix} -\alpha, & \text{if } Z_i=0 \\ \alpha, & \text{if } Z_i=1 \end{matrix}\right. $$ In other words, the complex model has $K$ parameters, and reduces to the simple model by setting $\theta_K=0$. The complex model says that the value of this parameter $\theta_K$ depends on the value of an additional binary variable $Z$, that plays no role in the simple model. Thus, under the null hypothesis that $M_0$ is true (i.e. that $\theta_K=0$), the values of $Z$ are exchangeable.

This suggests to me a simple permutation test for testing whether $M_1$ fits the data significantly better:

Fit both models to the observed data $\{X,Y,Z\}$ and compute a goodness-of-fit statistic (let's say $R^2$)
For each of (say) 10,000 iterations, randomly shuffle the observations of $Z$, refit the models and compute their $R^2$
Compare the fit-improvement of $M_1$ w.r.t. $M_0$ on the observed data, to the thus-obtained null distribution of $R^2$-improvements. Calculate the p-value for the test as the fraction of $R^2$ improvements in the null-distribution that exceed the observed improvement.

Is this procedure correct (I'm 99% sure it is, but I'd like to verify this)? And is there a reference that I could cite for such an approach?

Yes, having observations of Z is a key difference from your previous question. I think this situation might call for a regular old permutation test, depending on what exactly your null hypothesis says. I need to brush up on the details but I think that's even simpler than what you are describing. I will think about how to rephrase your situation to make the resemblance to a permutation test more obvious. — eric_kernfeld, Apr 26 '19 at 15:02
Can your problem be cast as a test of $\theta=0$ versus $\theta=\alpha$ where $Z_i$ is encoded as $\pm 1$, the models say
$$Y_i = g(X_i, \theta Z_i) + \varepsilon_i $$

and $X_i$, $Y_i$, and $Z_i$ are all observed? — eric_kernfeld, May 01 '19 at 23:32
Yes, you could certainly rephrase it like that (and that would probably be clearer - maybe I'll edit the question that way). — Ruben van Bergen, May 02 '19 at 14:07
Shoot - I just realized I omitted an important analysis step that might change the answer to my question, but it also makes things more complicated if you don't know the wider context of the problem. I'll see if I can rephrase my question again in a way that's still comprehensible. Sorry about that. — Ruben van Bergen, May 02 '19 at 14:35

score 2 · Answer 1 · answered May 02 '19 at 00:00

Can your problem be cast as a test of $\theta=0$ versus $\theta=\alpha$ where 1) $X_i$, $Y_i$, and $Z_i$ are all observed, 2) $Z_i$ is binary and encoded as $\pm 1$, and 3) the models say

$$Y_i = g(X_i, \theta Z_i) + \varepsilon_i $$

? Furthermore, is the distribution of $\varepsilon$ independent of $Z$ given $g(X, \theta Z)$? (For instance, this would happen if the $Y_i$'s were independent Poisson draws with mean $g(X_i, \theta Z_i)$, or if the $Y_i$'s were jointly multivariate Gaussian with all pairwise correlations 0.01, means $g(X, \theta Z)$, and standard deviations $g(X, \theta Z)$. In other words, arbitrary dependence is allowed as long as Z_i is not sneaking information into $Y_i$ through $\varepsilon_i$.)

If so, then $\theta=0$ implies that $Y_i$ is independent of $Z_i$ given $X_i$, and for any permutation $\sigma$, the conditional distribution $Y_i | X_i, Z_i$ is the same as the conditional distribution $Y_i | X_i, Z_{\sigma(i)}$. A typical permutation test applies in this scenario. The general procedure is to pick a statistic, such $T_{\sigma} = \sum_i |\hat Y_{i, \sigma} - Y_i|$, and compute it across many random permutations of $Z$. Compute the p-value as you say: the fraction of $T_\sigma$'s below $T_{noperm}$.

If I understand your question right, you chose $T_{\sigma} = \frac{cor(Y_{\sigma} - Y)^2}{T_0}$, where $T_0$ is the goodness of fit with $\theta=0$. Dividing by $T_0$ affects the permuted and actual values the same way, so it would come out the same if you just omit the $T_0$. The same is true for any monotone transformation that does not depend on $\sigma$, if you meant to subtract $T_0$, the same advice applies.

It's still a nice idea to fit the model once with $\theta=0$ and see how it compares to the full model. But, it does not fit tidily into the permutation testing framework.

score 2 · Answer 2 · answered May 02 '19 at 07:41

Stating your model and hypotheses in their proper form: Firstly, let me note that your own formulation of your model is problematic, insofar as the "parameter" $\theta_K$ depends on $z_i$ and thus depends on the index $i$. It therefore cannot be a single value, and must instead be a set of values $\theta_{K1},...,\theta_{Kn}$, which then feeds into the vector $\theta$ to create a similar set of vector "parameters". This is quite a horrible way to write the model, and it can be greatly simplified by substituting each $\theta_{Ki}$ with its underlying expression in terms of $z_i$ and $\alpha$. If we let $\phi = \theta_{-K}$ be the remaining parameters, then we can write the regression function as:

$$g(x_i, z_i, \phi, \alpha) = f(x_i, z_i, \theta_{-K}, \theta_{Ki}(z_i,\alpha)).$$

You can then write your more general model in standard nonlinear regression form as:

$$Y_i = g(x_i, z_i, \phi, \alpha) + \varepsilon_i \quad \quad \quad \quad \quad \varepsilon_i \sim \text{IID Noise Dist},$$

and your hypotheses of interest are:

$$H_0: \alpha = 0 \quad \quad \quad \quad \quad H_A: \alpha \neq 0.$$

Since $z_i$ affects the regression function only through $\theta_{Ki}$, according to your specified expression, your null hypothesis is that the response variable $Y_i$ is unrelated to the explanatory variable $z_i$ conditional on the other explanatory variable $x_i$, which is a standard regression hypothesis. To test this hypothesis, all you need to do is to formulate a test statistic that measures how conducive the data is to the alternative hypothesis, and then compute the p-value of the test from the null distribution of that test statistic.

Attempted goodness-of-fit tests via permutation simulation: What you propose in your question appears to be an attempt to conduct a kind of goodness-of-fit test, where you use the "fit improvement" from the coefficients-of-determination for your test statistic. (Your question does not specify the specific test statistic you propose, but I assume it is the statistic $R_1^2 - R_0^2$, with larger values more conducive to the alternative hypothesis.) You propose to simulate the null distribution of the test statistic using random permutations of the vector $\mathbf{z}$.

This type of simulation method is set out in various papers, but I would recommend reading the summary paper Anderson and Robinson (2001) for an example of how it applies in multiple linear models (nonlinear models are a simple extension). This paper gives a good explanation of the requirements of a permutation test in multivariate regression. Unfortunately, the procedure you have proposed does not appear to me to work correctly, since it does not take account of the relationship between $\mathbf{y}$ and $\mathbf{x}$. Merely permuting the elements of the vector $\mathbf{z}$ does not simulate the uncertainty in $\mathbf{Y}$ arising from differences in $\mathbf{x}$, so I don't think this will work.

If you want to develop this method and check that it works, one thing you will definitely have to do is to spell out the method more clearly, including specifying the test statistic, p-value function, and your simulated estimator of the p-value. To advance this process, I will attempt to do that here. Given your model and the proposed test statistic (which I am assuming, since it was not specified clearly), we can write the true p-value function for your test as:

$$\begin{equation} \begin{aligned} p \equiv p(\mathbf{y}, \mathbf{x}, \mathbf{z}) &= \mathbb{P} \Big( R_1^2(\mathbf{Y}, \mathbf{x}, \mathbf{z}) - R_0^2(\mathbf{Y}, \mathbf{x}) \geqslant R_1^2(\mathbf{y}, \mathbf{x}, \mathbf{z}) - R_0^2(\mathbf{y}, \mathbf{x}) \Big| H_0 \Big) \\[6pt] &= \mathbb{P} \Big( R_1^2(\mathbf{Y}, \mathbf{x}, \mathbf{z}) - R_0^2(\mathbf{Y}, \mathbf{x}) \geqslant R_1^2(\mathbf{y}, \mathbf{x}, \mathbf{z}) - R_0^2(\mathbf{y}, \mathbf{x}) \Big| \alpha = 0 \Big). \\[6pt] \end{aligned} \end{equation}$$

Your proposed procedure generates random permutations $\stackrel\frown{\mathbf{z}}_1,...,\stackrel\frown{\mathbf{z}}_M \sim \pi(\mathbf{z})$ and then estimates the true p-value function as:

$$\begin{equation} \begin{aligned} \hat{p} \equiv \hat{p}(\mathbf{y}, \mathbf{x}, \mathbf{z}) &= \frac{1}{M} \sum_{k=1}^M \mathbb{I} \Big( R_1^2(\mathbf{y}, \mathbf{x}, \stackrel\frown{\mathbf{z}}_k) - R_0^2(\mathbf{y}, \mathbf{x}) \geqslant R_1^2(\mathbf{y}, \mathbf{x}, \mathbf{z}) - R_0^2(\mathbf{y}, \mathbf{x}) \Big) \\[6pt] &= \frac{1}{M} \sum_{k=1}^M \mathbb{I} \Big( R_1^2(\mathbf{y}, \mathbf{x}, \stackrel\frown{\mathbf{z}}_k) \geqslant R_1^2(\mathbf{y}, \mathbf{x}, \mathbf{z}) \Big). \\[6pt] \end{aligned} \end{equation}$$

This does not seem to me to be a valid approximation of the true p-value function, and I see no reason that it would even possess basic consistency properties for $M \rightarrow \infty$. It does not appear to appeal to any known pivotal quantity, and the fact of averaging over permutations of $\mathbf{z}$ does not appear to me to incorporate the effect of $\mathbf{x}$.

Thanks for the detailed answer. I don't understand what you mean by $\theta_K$ not being a single value. Would you say that a linear regression coefficient is not a single parameter, because it multiplies an independent variable and thus becomes a set of values? Because the same is true here. If it helps, Z could also be thought of as taking values of -1 or 1, and then $\theta_K$ simply equals $\alpha$ (under the alternative hypothesis). The two models only differ in this one parameter being 0 or $\alpha$, there isn't a vector of additional parameters in the larger model. — Ruben van Bergen, May 02 '19 at 14:06
At the core, it's just an issue of how the model is represented. I had the same objection as Ben ($\theta$ having multiple values). That's why I asked the question in my second comment. — eric_kernfeld, May 02 '19 at 18:46
@Ben, it seems the disagreement between our answers is fundamental. Have you considered papers that discuss permutation of independent variables? Or only permutation of residuals? Also, for your expression for the true p-value, you put the same things on either side of the inequality. This might seem like a small quibble but I think the form and justification for the true p-value is actually key to our disagreement. — eric_kernfeld, May 02 '19 at 19:15
@Ruben: You have given a formula for $\theta_K$ that makes it a function of $z_i$. There is not one single value $z_i$; there are a whole bunch of them for $i=1,...,n$. Thus, by the formula you have given, you are going to get different values of $\theta_K$ for the different binary indicators. — Ben, May 02 '19 at 23:34
@Ben: I can see how the piece-wise definition could be confusing, but you can easily restate the problem such that $Z$ takes values of -1 or 1, and then $\theta_K$ has a constant value (because the sign of its effect in the model is now specified by multiplying with $Z$ - see also eric_kernfield's comment & answer). Or, in the piece-wise definition you can also see that there is only one additional degree of freedom in the complex model: the value of $\alpha$. — Ruben van Bergen, May 03 '19 at 00:12
@Ruben: Yes, you can (and should) restate the problem in a different way that does not cause problems --- that is what I have done in the answer, where $\alpha$ is the relevant model coefficient that replaces $\theta_K$. — Ben, May 03 '19 at 00:18

Permutation test comparing nested non-linear models with an exchangeable dummy variable

2 Answers2

Linked