5

I have a highly right skewed data set with a large range of values (from 1 ~ 10^6) (can't share the actual data for work related reasons).

When I plot the log of the data instead, the distribution looks a lot more like a normal distribution.

Have I stumbled on a meaningful insight in the data set, or is just a general property of the log transform that it brings the distribution closer to normal?

  • 2
    I always naively assumed that the log transform works well if your data can be thought of as some constant, times many (more or less) independent factors close to 1. E.g. A guy's salary is 10% above the mean if he has a degree, 5% higher if he's living in a large town, 5% lower if he has health issues... A log transform turns that into a sum of independent small numbers, so you get a normal distribution. – nikie Mar 23 '19 at 12:49
  • @Akaikes See here, here and particularly here & here which indicate that the log-transform won't always make even a right-skewed variate less skew (in absolute terms) than it was. A simple counterexample is the Maxwell(-Boltzmann) distribution, which is mildly right skew but the log of a Maxwell-variate is more strongly (left) skew. – Glen_b Mar 24 '19 at 02:20

1 Answers1

9

For purely positive quantities a log-transformation is indeed the standard first transformation to try and is very frequently used. It is also done if for regression you want a multiplicative interpretation of coefficients (e.g. doubling/ halving of blood cholesterol).

Of course it will not always make a distribution more normal, e.g. take samples from a N(1000, 1) distribution: any transformation can only make it less normal.

Nick Cox
  • 56,404
  • 8
  • 127
  • 185
Björn
  • 32,022
  • 7
    Similarly a distribution that is symmetric or left skewed will have its skewness made worse by logarithmic transformation. Consider the not very magnificent seven 1 2 3 4 5 6 7; then their square roots are left skewed and in the logarithms of those are even more left-skewed. – Nick Cox Mar 23 '19 at 09:04