Fitting data: "most normal residuals" vs maximum-likelihood

Question

I have a data set to which I fit a model and examine the residuals. In one case (a) I maximize likelihood, and in another case (b) I impose one of the model parameters. In case (a) the rms of the residuals is smaller than for (b), but case (b) gives a higher Kolmogorov-Smirnov probability for the residuals being normal (after multiple resampling of the data and keeping the median KS p-value).

This suprised me, but it got me thinking. Is there a fitting procedure that gives models parameters such that the residuals are those that are most consistent with a normal distribution?

Consider the simplest non-trivial, generally applied situation of estimating the mean from a set of data: "most consistent with a normal distribution" has every possible real number as an equally valid solution! — whuber, Mar 11 '24 at 21:12
So you're telling me that looking for most-normal model doesn't make sense? — Isambard Kingdom, Mar 11 '24 at 21:14
Yes, that's what @whuber is telling you. Think carefully about the situation he describes; you'll realize that the distribution of the residuals is exactly the same regardless of what number you choose for the mean, with the exception of a location shift. So... any choice of an estimate gives you residuals that are "most consistent" with any distribution; after all, a Normal(12,2) distribution is just as much a Normal distribution as a Normal(0,2) distribution. — jbowman, Mar 11 '24 at 22:38
Other than a location shift? Isn't that what a mean is all about? Perhaps I'm missing something, but the argument seems circular. — Isambard Kingdom, Mar 12 '24 at 01:35
I think the "fitting procedure" you're looking for is a Box-Cox regression. — Durden, Mar 13 '24 at 22:56
@whuber, consider the basic situation of estimating parameters in a linear model $Y=AX+B+\epsilon$ — we can ask which parameters give the most normal residuals, and that question probably has a unique answer. — Matt F., Mar 14 '24 at 04:58
@MattF. No, It would be invariant with respect to $B$, since we have not specified that $E[\epsilon] = 0$. Also if $X$ is normally distributed then $Y-AX$ is normal for any value of $A$ and the best fit would be purely random. — Lukas Lohse, Mar 14 '24 at 10:06
@LukasLohse, that doesn’t describe the basic situation of a linear regression, where $X$ is not normally distributed because $X$ (like $Y$) is a finite set of observations, and there are a continuum of possible values for the parameters $A$ and $B$ even under the constraint $E[\epsilon]=0$. — Matt F., Mar 14 '24 at 12:35
@Matt At a minimum you would need to standardize the residuals to a fixed sum (such as zero), for otherwise you could add an arbitrary constant to $B$ in any solution without changing the normality of the residuals. There are many other ways in which residuals can be made to look normal, too, indicating that this is not a great way to conceive of or solve an estimation problem. — whuber, Mar 14 '24 at 13:41
@MattF. I'm not sure what your 2nd comment is trying to say, but check out the simulations in my answer. Yes finite samples mean you often, it could also diverge, have a "unique" value with maximal normality, but it has little to do with the true value of $A$. — Lukas Lohse, Mar 14 '24 at 15:07

Lukas Lohse · Accepted Answer · 2024-03-14T15:15:24.587

Since there's a good amount of discussion in the comments I will demonstrate why the normality of the residuals is a poor measure. I will use Shapiro Wilks test instead of KS, as KS needs to be supplied with estimated $\mu$ and $\sigma$.

Let's consider $Y = aX + b + \epsilon$, with $a = 1, b = 0, \epsilon\sim\mathcal N (0,1)$, as well as $X \sim \mathcal N (0,1)$ and simulate with $n = 300$.

set.seed("112")
x <- rnorm(300)
y <- x + rnorm(300)

First of whubers point about the mean of the residual not mattering for normality, let us test 3 values of $b: 0, 1, 99999$, while using the coreect value for $a$. It doesn't make a difference.

> shapiro.test(y - x)
Shapiro-Wilk normality test


data:  y - x
W = 0.99404, p-value = 0.2873
> shapiro.test(y - x - 1)
Shapiro-Wilk normality test


data:  y - x - 1
W = 0.99404, p-value = 0.2873
> shapiro.test(y - x - 99999) #if you use too many 9s things get rounded
Shapiro-Wilk normality test


data:  y - x - 99999
W = 0.99404, p-value = 0.2873

Now my point about, if $X$ is normal, then in a finite sample you will have some value for $a$ that maximizes normality, but this point is random. I'll test values from -10 to 10 and plot the p-values i.e. biggest is best, true value in black and slope of the regression in dashed in red:

a_s <- seq(-10, 10, by = 0.1)
res <- sapply(a_s, FUN = function(a){shapiro.test(y - a*x)})
res <- as.data.frame(t(res))
plot(a_s, res$p.value, ylim = c(0, 1))
abline(v = 1)
abline(v = coef(lm(y ~ x))[2], lty = 2, col = 2)

The highest p-value lies around $-1$! If you repeat the simulation you will see a wide variety of shapes, some of them peaking around 1 but others dipping, which is the opposite of what you want.

Just run this code snipped over and over again and you will see how ill defined this approach is:

x <- rnorm(300)
y <- x + rnorm(300)
res <- sapply(a_s, FUN = function(a){shapiro.test(y - a*x)})
res <- as.data.frame(t(res))
plot(a_s, res$p.value, ylim = c(0, 1))
abline(v = 1)
abline(v = coef(lm(y ~ x))[2], lty = 2, col = 2)

If you replace x <- rnorm(300) with x <- runif(300) you will get better behaved stuff, but it will still be incredibly weird. Using a KS-test only makes things worse.

I’d be more convinced if I saw this with W values plotted instead of p values — Matt F., Mar 14 '24 at 19:35
@MattF. plot(a_s, res$statistic) looks identical except that the y axis goes from 0.992 to 0.998. As you can see from the p-values, this is the normal range you'd expect from pure randomness in a n=300 i.i.d. normal sample. — Lukas Lohse, Mar 14 '24 at 21:03
The graph of results is curious. The p-value is very high at about -1, but otherwise still moderately high. Shouldn't we interpret p-values near 1 as significant? If any value of a is permissible, then why is there any peak at all? Why does it rapidly change at about 0? — Isambard Kingdom, Mar 16 '24 at 00:50
@IsambardKingdom >Shouldn't we interpret p-values near 1 as significant? No, p = 0.99 doesn't "make the null hypothesis more true" than 0.5 or even 0.1 really. They are just the variations you'd expect. >Why does it rapidly change at about 0? For large values of $|a|$ the difference is dominated by $X$ so the test-statistic approaches that value. For a around 1 the mixture of the 2 samples is more even so there is more variation in the shapes. — Lukas Lohse, Mar 18 '24 at 20:46
@IsambardKingdom on the meaning of p-values I recommend this short and readable book chapter by Michale Lew: https://link.springer.com/chapter/10.1007/164_2019_286 — Lukas Lohse, Mar 18 '24 at 20:51
Lukas, I appreciate your input on this question. I will have a look at the Michael Lew article. Thank you very much. — Isambard Kingdom, Mar 19 '24 at 01:55

Fitting data: "most normal residuals" vs maximum-likelihood

1 Answers1