4

Assume that I have a log transformed model as follows:

Model 1: $Y = a + b\ln(X)$. Interpretation: a 1% increase in $X$ is associated with an average $b/100$ units increase in $Y$.

If I add $1$ to $X$ to avoid having $0$ values and get:

Model 2: $Y = c + d\ln(X+1)$

Should I interpret the model as "a 1% increase in $(X+1)$ is associated with an average $d/100$ units increase in $Y$?" Or there are some better ways to interpret the model? Thanks.

  • A set of more specific, quantitative answers to this question is available at https://stats.stackexchange.com/questions/576504/. – whuber May 25 '22 at 12:21

2 Answers2

3

You could, but it's not a very intuitive thing.

Consider for simplicity, starting at $c=0$ and $d=1$.

If $x$ is 0.01, then $y = \ln(1+x) \approx 0.01$ and a 1% increase in $(1+x)$ yields about a doubling of $\ln(1+x)$ (and hence, $y$) to $y\approx 0.02$ which is about a 100% increase in $y$. Meanwhile if $x$ is 2 it's a little less than 1% increase in $\ln(1+x)$ (and hence $y$), and if $x=100$, it's close to a 0.2% increase in $\ln(1+x)$.

Changing $d$ from $1$ to something else (while $c$ remains at $0$) doesn't change this effect on $y$, because the relative increase in $y$, i.e. (ynew-yold)/yold is unaffected by the value of $d$. However, changing $c$ does affect it, because the old value is on the denominator, and once $c$ is non-zero, then $d$ also matters.

Glen_b
  • 282,281
  • 1
    Naturally I agree that this transformation doesn't allow interpretation in terms of percent change, but if there are zeros present, that is true anyway, – Nick Cox Jan 05 '23 at 12:26
1

The interpretation of the model should depend partly on the range of values of $x$ and on how it is to be applied. If most of the $x$ values are large, say more than 100, and if it is to be used to predict $y$ corresponding to such large values of $x$, then a good approximate interpretation would be: a 1% increase in $x$ is associated with an average $d/100$ units increase in $y$. For $x > 100$ the proportionate difference between $x+1$ and $x$ is small.

If however the model is to be used to predict $y$ corresponding to small values of $x$ then this approximation would be unhelpful and your interpretation would be more appropriate, although as Glen_b says it's not very intuitive.

If most of the $x$ values are small, then a better way to avoid the zeroes might be to add a different constant, much less than 1.

Adam Bailey
  • 1,642
  • 3
    I agree on disliking log(x + 1) as a transformation on various grounds, but the last suggestion here is dubious, if not dangerous. ln(zero + epsilon) may sound more conservative than ln(zero + 1), but the smaller epsilon is, the larger the negative logarithm created. Using log 10 so the numbers are easy, log10(1/1000) = -3, log10(1/1000000) = -6. log(x + epsilon) thus is all too likely to create outliers out of zeros, and outliers whose values depend crucially on an arbitrary choice of epsilon. This is why 1 was suggested in the first place, presumably. – Nick Cox Mar 15 '13 at 14:07
  • @NickCox I only said 'might'! However you are right to highlight that the choice of a constant to add to avoid zeroes is not straightforward. As you show, it isn't a case of the smaller the better. The constant should be chosen with regard to the range of the $x$ values and there is nothing special about 1. If, say, the range was between 0 and 1 then adding 0.1 would be better than adding 1. A previous question which addresses this issue in detail is http://stats.stackexchange.com/questions/30728/how-small-a-quantity-should-be-added-to-x-to-avoid-taking-the-log-of-zero. – Adam Bailey Mar 15 '13 at 20:00
  • As this thread is still active, I note briefly that my line on log(x + 1) or log(y + 1) for that matter has softened: sometimes it does seem a transformation that should be considered. Pragmatically, often neglected steps are (1) plotting log (x + 1) against x (2) plotting the distribution of log (x + 1) (3) scatter plots of log (x + 1) and other relevant variables. More generally, log(x + 1) is a choice of c in log(x + c) and sensitivity to this choice can be important. In a nutshell sometimes log(x + 1) introduces massive outliers (a big deal if true). Sometimes it has only minor effect. – Nick Cox Jan 05 '23 at 12:33