0

I am running some least squares linear regression on nonlinear data. I considered linearising by taking the log of both the IV's and DV and also by only taking the log of the DV.

Both linearise the data fairly well (R^2>0.95), with loglog performing slightly better. However when I check for heteroskedasticity using the white test, the semilog transformation shows less heteroskedasticity.

What is this saying about the underlying structure of my data and which transform should I use? I know for loglog the relationship should be y=ax^k and for semilog the relationship should be y=ba^(cx) but what does it mean in this case where both are acceptable?

Many thanks!

Tom Waits
  • 101
  • There is a principled way to go about identifying transformations in a regression, as detailed in Tukey's EDA. See https://stats.stackexchange.com/questions/35711 for a simple example. – whuber Aug 02 '23 at 13:48
  • 1
    If $x$ has a relatively narrow distribution away from $0$ then taking its logarithm may not make much apparent difference to the regression model within the range of the data. For example, consider $(10, 2)$, $(11,5)$, $(12,12)$ – Henry Aug 02 '23 at 13:50

1 Answers1

1

Welcome to the site. A few points:

  1. Data can't be linear. Only relationships can.

  2. It's hard to believe that both log and semilog make a nonlinear relationship linear. Can you show plots? Also, what is the R^2 measure you give? What does it relate to?

  3. I wouldn't transform at all, unless it makes substantive sense (which seems unlikely, given that you are doing different things). For instance, it often makes sense to take logs of variables related to money, because we think about most money variables multiplicatively rather than additively. For instance, if you make \$25,000 per year then a \$5,000 raise is huge. If you make \$250,000, it is small. And if you make $2,500,000 then it is rounding error.

Instead of transforming the data to fit the model, better to use a model that is appropriate for the data. If the relationship between Y and X is nonlinear, you could take a spline of X, if the nonlinearity is unclear, or X^2 if it is quadratic, or something else (I've even seen trig functions used, but very rarely).

Peter Flom
  • 119,535
  • 36
  • 175
  • 383
  • 2
    You can fix the broken formatting by putting a \ in front of each dollar sign. – Nobody Aug 02 '23 at 13:05
  • 2
    (2) is easy to believe, because it merely suggests the relative range of the explanatory variable is short. Using $R^2$ might be misleading in light of the concerns about heteroscedasticity and nonlinearity. Finally, "doing different things" misconstrues the two models: they differ only in whether an explanatory variable is expressed directly or in terms of its logarithm. – whuber Aug 02 '23 at 13:50