17

Assume that I have a variable whose distribution is skewed positively to a very high degree, such that taking the log will not be sufficient in order to bring it within the range of skewness for a normal distribution. What are my options at this point? What can I do to transform the variable to a normal distribution?

Nick Cox
  • 56,404
  • 8
  • 127
  • 185
histelheim
  • 2,993
  • "Normalize" doesn't usually mean "transform to normality". It usually means to scale to (0,1) or (sometimes) to standardize. Using the term in this context is likely to cause confusion. – Glen_b Feb 28 '14 at 01:18
  • 2
    Just to make sure, "skewed negative" means the long tail pointing to left or right? If it's really skewed negatively (long tail left), log-transformation wouldn't work very well. – Penguin_Knight Feb 28 '14 at 01:28
  • 1
    If it's really skewed left (has negative skewness), then log will tend to make it substantially worse. Can you show a picture, such as a QQ plot or a histogram or something? Why are your transforming? – Glen_b Feb 28 '14 at 01:32
  • Sorry got +/- screwed up, this is positive and right skew, as in the righthand panel: http://kwiznet.com/px/homes/i/math/G10/Statistics/Skew.png – histelheim Feb 28 '14 at 01:45
  • take another log, like $z=\ln(\ln(x))$ – Aksakal Feb 28 '14 at 01:52
  • @Aksakal what are the consequences of taking the log 2x? Under what conditions is this admissible? – histelheim Feb 28 '14 at 01:54
  • log transform is not a tool to cure the skewness. It's used to even out the variance in economic time series, which often are exponentially growing such as prices or GDP. if you're dealing with such series, log will fix the skewness too. so, I thought that if your series are growing faster than the exponent, then you could temper it with one more log, but it could be something else like Box-Cox transform. it is not impermissible, as long as it suits your data you're free to use any transform you wich. – Aksakal Feb 28 '14 at 02:04
  • Perhaps you could provide more information like why you want to transform it and what you know about the distribution besides the skew. As it is, your question is impossible to answer well – John Feb 28 '14 at 02:39
  • 7
    Reciprocal transformation is stronger than logarithmic and often preserves meaning, as units of measurement are just inverted. For example, the reciprocal of the time to do something is a kind of speed, and vice versa. The reciprocal of miles per gallon or km per litre makes sense. Reciprocals invert the order and can be negated if that is preferable. They are naturally part of the Box-Cox scheme with that extra detail. All values should be positive for this to work well. (In principle, it would work with all values negative, but I've yet to see an example in practice.) – Nick Cox Feb 28 '14 at 07:39
  • 3
    @Aksakal $\ln(\ln())$ I can't see as a good idea. The result is statistically meaningful only for values $>1$. If values are counts, it is artificial that a transform be undefined for 0s or 1s, regardless of whether those values occur in the data. If values are measurements the restriction means that the validity of a transformation depends on the choice of units of measurement, which is absurd, as if $\ln(\ln(0.7))$ can't be done because I use cm, but $\ln(\ln(7))$ can be done because I use mm. (That logarithms yield complex results for negative arguments I don't think helps statistically.) – Nick Cox Feb 28 '14 at 10:13
  • 3
    @Aksakal Too strong to say "log transform is not a tool to cure the skewness": if skewness is the only issue, logs often work very well. If your point is that skewness of marginal distributions need not be a major problem, I tend to agree. – Nick Cox Feb 28 '14 at 10:22
  • @NickCox: Do you have a reference for the claim that reciprocal transformations preserves meaning? – histelheim Oct 19 '15 at 00:57
  • 1
    Hmm: why do you seek a reference here? The argument is just as stated and elementary. Sometimes you have to fight convention. For example, population density (people/area) is the conventional measure, but (zeros aside) area/people has as much or more meaning and is used (although less common in my experience). – Nick Cox Oct 19 '15 at 12:25
  • @NickCox: for academic work references are helpful, that's all. – histelheim Oct 19 '15 at 14:43
  • 4
    I naturally agree, but if I used squares or logarithms, I wouldn't feel obliged to offer references, and similarly here. But the usefulness of reciprocals, particularly times and speeds, was stressed by (e.g.) Tukey, J.W. 1977. Exploratory data analysis. Reading, MA: Addison-Wesley and in several of his papers. Miles per gallon and gallons per mile (or conversely litres per km and km per litre) are common place in discussions of car performance data. Densities and their reciprocals are fairly standard examples in geography and demography. – Nick Cox Oct 19 '15 at 14:55
  • I found this is a really good article as it explains which transformation to apply for different data presentations and levels of skew. A log transform isn’t always the most optimal way to go and can sometimes make the situation worse… https://www.anatomisebiostats.com/biostatistics-blog/transforming-skewed-data – Sarah Baker Aug 26 '19 at 10:00

3 Answers3

14

Try straight Box-Cox transform as per Box, G. E. P. and Cox, D. R. (1964), "An Analysis of Transformations," Journal of the Royal Statistical Society, Series B, 26, 211--234. SAS has the description of its loglikelihood function in Normalizing Transformations, which you can use to find the optimal $\lambda$ parameter, which is described in Atkinson, A. C. (1985), Plots, Transformations, and Regression, New York: Oxford University Press.

It's very easy to implement it having the LL function, or if you have a stat package like SAS or MATLAB use their commands: it's boxcox command in MATLAB and PROC TRANSREG in SAS.

Also, in R this is in the MASS package, function boxcox().

Aksakal
  • 61,310
6

For positive skew (tail is on the positive end of the x axis), there are the square root transformation, the log transformation, and the inverse/reciprocal transformation (in order of increasing severity). Thus, if the log transformation is not sufficient, you can use the next level of transformation. Box Cox runs all transformations automatically so you can choose the best one.

-5

Most software suites will use Euler's number as the default log base, AKA: natural log. You can use a higher base number to rein in excessively right-skewed data. How you do it syntax-wise depends on the software you are using.

If you need to get back out of you transformed values once estimations have been done, it might be a little easier to use this method because all you have to do is perform a exponential operator on your variable with whatever your log base was.

  • 9
    This makes no sense at all. The logarithms to two different bases differ only by a multiplicative constant and the skewness reduction by either is thus the same. Thus 1 10 100 1000 10000 is symmetrical after transforming log base 10 and it would be just as symmetrical after log base $e$ or log base 2. The only difference is a scaling factor. – Nick Cox Jun 09 '16 at 21:46