7

Say that I have a variable with lots of 0 values that needs log-transforming so I do log(variable+1) to transform it. How do I write that in my methods section as opposed to just 'the data was log-transformed to increase normality' etc. Say that I transformed height and weight. How would I write it (I know I can't really have a 0 value for those but just pretend).

And then on a graph: how would I write the axis labels? Pretend I have a box plot showing height (cm) on y axis and male/female on x axis, would it be log(height+1) (cm)?

Edit: okay fine, not height. Any variable where it would work. I've just been told to write up the results and don't have control over the actual analysis, can people please just put their extensive knowledge away for a second and help me learn how to report it. Pretend that all the methods make sense from a data analysis pov

Nick Cox
  • 56,404
  • 8
  • 127
  • 185
  • 3
    Welcome to Cross Validated! Why don't you write your justification for doing the $\log(x + 1)$ transformation as your actual justification? $//$ It isn't obvious that this transformation is useful. I find the first sentence of the link answer to be a powerful statement: This is one reason why the automatic reflex to "just add 1 to values that might be zero before taking the log" is difficult to justify. – Dave Nov 06 '23 at 13:30
  • Strictly it would be something like log(height (cm) + 1) or log(height + 1 (cm)). The units of 1 should be at least tacit. – Nick Cox Nov 06 '23 at 14:48
  • 1
    I would always prefer axis labels on the original scale. Then you can explain use of log (y + 1) or log (x + 1). – Nick Cox Nov 06 '23 at 14:53
  • 1: A plot of your data distribution would probably help with some of the considerations. 2: Are the zeros real or are they a placeholder? Putting in zero in place of, for example, too low to detect in an assay can lead to problems. 3: You can use a transformation for analysis but display the data on the original scale as long as you explain. 4: You can use a non-linear scaling on the graph that nonetheless has the labels on the original scale. – Michael Lew Nov 06 '23 at 20:40
  • 1
    You really want to pick better motivating examples than height and weight; those are positive, never zero, and people already have an intuition what their typical distribution looks like (contiguous, typically with a single peak, and not very long-tailed). But the intent your question seems to be asking if this is still legit when we might be dealing with (a lot of) zero, near-zero or even slightly negative values. Or if the distribution was more unusual. – smci Nov 08 '23 at 22:37
  • Your edit isn't conducive to our taking you seriously. "Pretend that all the methods make sense from a data analysis pov". Not a good stance. The central point at issue is whether and when this transformation works, is defensible and is a better idea than alternatives. If your presumption is that it is unproblematic, you are left with one question, what to call it,, to which the answer is that some variation of log (y + 1) is the simplest name, which is where you came in. The stance that you are under orders just implies that your boss needs to read this. – Nick Cox Nov 10 '23 at 00:56
  • 1
    Please do not vandalize your question. When you posted on SE, you gave up ownership of the content under CC BY-SA 4.0. If there are no answers, you may delete your own question (see here ): just click the faint gray 'delete' at lower left (your account needs to be registered for this). Otherwise, the thread will remain according to SE's rules. – Sycorax Nov 11 '23 at 15:52

2 Answers2

10

I am not quite so negative on $\log\ (y + 1)$ or more generally $\log \ (y + c)$ for some constant $c$ as some colleagues. For $y$ read also $x$ according to taste or circumstance.

But three points seem general:

(1) $c$ is arbitrary beyond needing to be large enough to ensure that $y +c > 0$ (but not ipso facto utterly pointless)

(2) Do you have a good reason for your choice of $c$ therefore (which might involve some sensitivity analysis)?

(3) You need to show that a transformation works in the sense of achieving or getting closer to at least one specific and plausible goal.

Concretely, $\log\ ({\rm count} + 1)$ works sometimes for visualization of counts when zeros are present but otherwise the data seem to deserve something like a log scale. But that doesn't imply modelling in those terms, particularly as Poisson regression (or the same rose under another name) is in effect use of a logarithmic link function compatible with some observed zeros whenever conditional means are positive.

Height of people may be a flippant example, but it is highly unconvincing in the senses that heights are often roughly symmetrically distributed; zeros are never observed; and adding 1 cm is on the face of it just as arbitrary as adding 1 inch (or its equivalent) or 1 mm would be.

I've sometimes found cube roots useful, as accommodating zeros (and indeed negative values) as easily as positive values. It is a weaker transformation than logarithms, but has other virtues, being about right for gamma-like distributions and sometimes being appropriate generally on dimensional grounds. Hydrologists and meteorologists often use cube roots for precipitation (especially daily precipitation) where zeros may be observed, and often observed frequently.

Some would want to raise a flag for asinh, or inverse hyperbolic sine.

In general, transforming to get closer to a linear relationship with even scatter is often a much bigger deal than being obsessed with the shape of marginal distributions. Also, as already hinted, transforming to get a better visualization can be ad hoc (optimistic translation: producing something fit for purpose) -- but if it makes things easier to see or think about, that is the whole point.

Many languages or environments now include a function log1p() on other grounds. I doubt that many people (e.g. natural or social scientists) are aware of that name. If you use it, it would be best to explain it.

Note. I will mention a misunderstanding I have seen many times, the idea that $\log\ (x + {\rm smidgen})$ -- where ${\rm smidgen}$ is very small -- is a neat solution if $x$ is ever $0$. For any tiny number that is close to $\log x$ for $x \gg 0$ but it can create massive outliers inadvertently for $x$ very small. A quick numerical example uses log base 10 for convenience, but the point applies with any base. Suppose counts vary $0, 1, 2, \dots$ and we use as ${\rm smidgen}\ 0.000001 = 10^{-6}$ or $1$/million. Then $0$ gets mapped to $-6$, $1$ to almost $0$, $2$ to about $0.3010$ and so forth. As said, you've created massive outliers.

COOLSerdash
  • 30,198
Nick Cox
  • 56,404
  • 8
  • 127
  • 185
6

You wrote

Say that i have a variable with lots of 0 values that needs log-transforming so I do log(variable+1) to transform it

No, I won't say this. Doing this log transform of + 1 is not a good idea. The +1 is arbitrary (and can make a big difference from, say +0.1).

You haven't said why you want to "increase normality" but most of the reasons people think this is necessary are not really reasons. And if you have a lot of 0s, then log(x+1) won't be normal either.

If your reason is that you are going to do regression then 1) OLS regression does NOT require normal data, it assumes normal residuals but 2) That assumption is relatively weak. And 3) There are other methods of regression (e.g. robust regression and quantile regression) that don't make any assumptions about the residuals at all.

If you want normality for other reasons, please say what those are.

Peter Flom
  • 119,535
  • 36
  • 175
  • 383