0

Let's suppose that we perform Box-Cox transformation in R for the following data

library(MASS)
set.seed(1)
n <- 100
x <- runif(n, 1, 5)
y <- x^3 + rnorm(n)

run a linear model

m <- lm(y ~ x)

run the box-cox transformation

bc <- boxcox(y ~ x)

lambda <- bc$x[which.max(bc$y)])

What I'd now is to fit a new model

m2 <- lm(y^lambda ~ x)

The thing is, that $\lambda$ is chosen for the data that is centered at $1$, so should I divide all $y$ observations in the sample by the sample's median?

m2 <- lm((y/median(y))^lambda ~ x)

I know that operation only scales $y$ so it doesn't change the shape of the distribution. So is shifting the sample median to $1$ a necessity?

  • Why are you applying Box-Cox? It was useful in the old days, before robust methods were practical, but now, why not just go to robust regression or quantile reg.? – Peter Flom Nov 14 '23 at 12:40
  • 1
    @PeterFlom just for the sake of curiousity. It's not a practical application or anything like that. – Adam Bogdański Nov 14 '23 at 12:42
  • 1
    (1) The data you generate are not well suited for a Box-Cox transformation. Indeed, many will be negative. (2) There's no difference between the two models you propose because they merely change the units in which the response is expressed, which will be absorbed into the estimated coefficients. (3) Ignore the man behind the curtain who attempts to imply this is somehow old-fashioned or superseded. You are wise to learn about techniques like transformations because they are powerful, flexible, universal, and can provide insight not afforded by other methods. – whuber Nov 14 '23 at 14:10
  • 1
    @whuber Thank you! So in practice it doesn't matter much whether I use y^lambda ~ x or ((y/median(y))^lambda - 1)/lambda ~ x, right? However, I guess that the effect (of using y^lambda instead of the full transformation) for tails of y could be a lot more significant, wouldn't you agree? BTW I saw your excellent answer to this post the other day. A huge thank you for all the great work that you do! – Adam Bogdański Nov 14 '23 at 14:24
  • There's no effect on the model results. The main reason for using the more complex expression is for interpretability and comparison to fits with other values of $\lambda.$ I tend to use the simpler (first) formula in scientific applications that already suggest a power law might be applicable. – whuber Nov 14 '23 at 16:32

0 Answers0