Which transformation in a linear regression should I use when variance is larger on the small end?

Question

Note: this is part of Exercise 5.6 in Design and Analysis of Experiments, 2nd Ed., by Dean, Voss, and Draguljic.

If you go here and download the bicycle.txt dataset, then run some R commands such as

df = read.table("bicycle.txt", header=T)
lmod = lm(rate ~ trtmt, df)
plot(df$trtmt, residuals(lmod))

you will see that the variance is larger for smaller values of the treatment variable trtmt:

Note that trtmt is the speed in mph, and rate is the crank rate required to keep the bicycle going at a constant speed on level ground at that speed. This heteroskedasticity was expected, or at least explained, because the experimenter said this:

Note the larger spread of the data at lower speeds. This is due to the fact that in such a high gear, to maintain such a low speed consistently for a long period of time is not only bad for the bike, it is rather difficult to do.

All the transformations I've worked with so far have tended to handle larger variances for larger values of the treatment variable. Box-Cox fails because it suggests $\lambda=1.$ I tried the multiplicative inverse:

lmod_inv = lmod(1/rate ~ trtmt, df)

and it does make the variance larger on the larger end. The problem then is that it's still heteroskedastic and additional transforms such as the logarithm don't appear to fix it.

What would be a good transform to use in this situation?

I looked at this thread, which helped before, but my distributions are already nearly uniform - certainly reasonably symmetric.

Many thanks!

I wouldn't bother here; the small difference in apparent range may well be largely due the fact that there's three observations on the left rather than 2 (at least that's how it looks; if there are coincident points you should use a display that shows them better) -- the range of three independent random normal values is on average about 1.5 times as big as the range of 2 of them (residuals aren't independent, but the exact value doesn't matter). Even without that effect, the impact of such a modest difference in spread would be too small to worry about. ... ctd — Glen_b, Feb 24 '23 at 07:38
... What would worry me more is how very even the spread in those values is, you should be seeing more variation. It suggests that there's something odd going on -- like that the data are invented, perhaps, or there's an unmodelled but important covariate (a blocking factor perhaps) or maybe that trtmnt is was fitted as a factor somehow... or maybe the data could be discrete? Something that should be known before giving advice is obscure, anyway — Glen_b, Feb 24 '23 at 07:38

score 2 · Accepted Answer · answered Feb 23 '23 at 23:43

Ah, I found something that works. It helps if you read the book carefully! Anyway, on page 112, the book outlines a procedure whereby you can sometimes determine a transformation $h(Y_{it})$ if there is a power relationship between the variances and the means: $\sigma_i^2=k(\overline{y}_i)^q.$ The recommendation is: $$h(y_{it})= \begin{cases} (y_{it})^{1-(q/2)} &\text{if $q\not=2$}\\ \ln(y_{it}) &\text{if $q=2$ and all $y_{it}$'s are positive}\\ \ln(y_{it}+1) &\text{if $q=2$ and some $y_{it}$'s are zero.}\\ \end{cases} $$ We can compute the necessary pieces:

mean_rate = by(df$rate, df$trtmt, mean)
var_rate = by(df$rate, df$trtmt, var)
lnMean = log(mean_rate)
lnVar = log(var_rate)
plot(lnVar~lnMean)

yields the suggestive

A further

summary(lm(lnVar~lnMean))

shows that the slope is $-1.919,$ which we'll take to be $2$ for simplicity. Finally, the new model

lmod_sq = lm(rate^2 ~ trtmt, df)

shows a reasonable homoskedasticity. Problem solved!

Which transformation in a linear regression should I use when variance is larger on the small end?

1 Answers1