6

Just before I start the question I would like you all to know that I have checked the other threads on taking the log of variables but I still think I have a question that hasn't been touched on yet. I would also like to thank whuber for his lengthy answer to another log question here.

This question specifically relates to one of the reasons of why we take logs, namely transforming the distribution of the data. When we take the log of a variable it is usually because the distribution of the variable is skewed and we want to give it a normal distribution. A common example of this in OLS regressions in economics is a variable denoting wages, income, GDP etc. However, no one ever seems to mention the central limit theorem (CLT). The CLT says that sum of many random variables will be normally distributed even if their underlying distributions are not normally distributed. If the error is the sum of the random variables $X$ and $Y$, $\epsilon = Y - X\beta$, then surely the error will be normally distributed regardless of the distribution of $X$ and $Y$. If this holds (and the CLT seems to hold under pretty weak conditions) then why would we need to transform the variable?

EconStats
  • 865
  • Whuber's exemplary answer does mention "making the distribution more symmetric". Not necessarily normal. But even if we did it in an attempt to induce normality, have you ever estimated a sample of infinite dimensions? No. Are you certain that "a few hundred" or "a few thousand" or even "a few tens of thousands of observations" are enough so that the effects of the CLT will emmerge? If you are, you shouldn't be. The one condition in the CLT that is not weak at all, is the requirement that the sample size goes to infinity. – Alecos Papadopoulos Nov 09 '13 at 02:38
  • OK well we'll deal with symmetric then. Why would we want to make the distribution of one of the variables symmetric if the CLT says it doesn't matter about the shape of the underlying distributions? With regards to $n$, I always though that the remarkable thing about the CLT was that it's effects are apparent even as you move from say $n = 5$ to $n = 20$. Obviously we can't estimate over infinite samples but isn't that a problem for any type of analysis that involves a probability limit?? – EconStats Nov 09 '13 at 02:59
  • People take logs in response to observing that the distribution is not normal. If the empirical distribution isn't normal looking, you can't really invoke the CLT to say that it is. As to why this isn't the case, one reason is that the underlying population might be heterogenous. – quasi Nov 09 '13 at 03:48
  • @EconStats I can see that you indeed have great faith in the CLT taking effect "very quickly". Experience has shown that this is not the case as often as we would all like, especially when we move to more sophisticated estimators. – Alecos Papadopoulos Nov 09 '13 at 06:27
  • "The CLT says that sum of many random variables will be normally distributed even if their underlying distributions are not normally distributed." --- well, not quite. It's not actually a theorem (or rather a collection of theorems) about sums, but about standardized averages (such as $\frac{\bar x-\mu}{\sigma/\sqrt n}$), and it doesn't apply to every possible distribution. – Glen_b Nov 09 '13 at 09:05

2 Answers2

2

You might find this display interesting:

These are residuals from a linear regression with ten x-variables (IVs), a skewed error distribution (but one with all moments finite, to which the CLT definitely applies!), and 1000 observations (i.e. the data was simulated).

It's a normal qqplot, which if the residuals are close to normal should look reasonably close to a straight line.

normal qq plot for a skewed error distribution

Clearly, it's not remotely normal looking! The residuals are still pretty skewed.

Okay, maybe I didn't have enough variables. Here's one for 100 x-variables:

lognormal qq plot, p=100

The plot is very similar - and still very skew.

So with n=1000 and p=100, we're not seeing anything like what you say we should be seeing.

Glen_b
  • 282,281
  • You're right I meant to say that the sum will tend to be normally distributed so I was sure (as you have detailed) there will be situations when it won't work. To bring this back to the original question, if you had taken the logs of those variables, would that have made a difference to the Q-Q plot? Or are these situations where the CLT fails so dire that no transformation can save them? – EconStats Nov 11 '13 at 19:41
  • I'm puzzled; this question seems to contradict the sense of the original, which was 'why transform when we have the CLT?'. Note that my example isn't some extreme case - I deleiberately chose one where the CLT applies. That's what you get when the CLT works. You seem to be currently asking "well if we transform, does the CLT do what I thought?". With continuous variables, monotonic transformation can give you normality directly, without relying on the CLT; for $Y|X \sim F_{Y|X}(y)$, $Y^* = \Phi^{-1}[F_{Y|X}(Y)]$ will be normal. ... (ctd) – Glen_b Nov 11 '13 at 21:38
  • (ctd)... without any averaging at all. On the other hand, if you have discrete distributions, it's possible to have situations where no transformation will give you even approximate normality. – Glen_b Nov 11 '13 at 21:41
  • So my original answer responds to the question you asked, fairly clearly in the negative (there are non-extreme cases for which - even though the CLT applies to them, where large samples or large samples and large numbers of variables don't do what you asked about). The answer to this new question is ... if we can use transformation, then we don't need the CLT. The big problem is that if you're going to assume linearity and homoskedasticity and additive errors for $Y$ on the original scale (you were going to fit a regression after all), none of those things will be true after you transform. – Glen_b Nov 11 '13 at 21:48
  • There are numerous other alternatives to transformation, of course (one might, for example, fit a GLM). – Glen_b Nov 11 '13 at 21:49
  • Ah OK, so the original answer referred to a situation where even with a large $n$, large $K$ and $e$ with a defined mean and variance (a set up where the CLT should work), the CLT didn't do what I said it should? And for the final point, are you saying that if we had an OLS regression, where the residuals were not normally distributed, that transforming the $Y$ (via log etc.) variable wouldn't make a difference since we would have to use a different method (GLM) anyway? – EconStats Nov 12 '13 at 00:26
  • Pretty much - it would make a difference to the distribution (without needing CLT), but if regression were appropriate on the original scale, the transformation would break several of the other assumptions, yes. In practice any of the assumptions hold only approximately, but when transforming you have to consider the impact on linearity, conditional variance and the shape of the conditional distribution. It's often the case that none of them are satisfied and something that improves one may sometimes also improve (though often not quite fix) another. – Glen_b Nov 12 '13 at 00:31
  • Sometimes you might help one or two of them (say distributional shape and variance both get somewhat better), but at the expense of the third (making the relationship further from a straight line) – Glen_b Nov 12 '13 at 00:33
  • Thanks for that Glen_b, you have no idea how much that cleared up what I was thinking. If it doesn't matter that the unconditional distribution of $Y$ is non normal and we only need the conditional distribution of $Y$ to be normal why do you think it always happens that people take the log of dependent variables the second they see it is slightly skewed? Without even checking the fact that conditionally it could be perfectly normally distributed – EconStats Nov 12 '13 at 01:08
  • 1
    I see it all the time ... I think sometimes it happens because people don't know (or don't fully absorb) the distinction. I've just been teaching from a text (not my choice of text) which described the distinction at length, and then went ahead later in the book and committed exactly the error it cautioned against in the early chapters, and did it over and over. [However, if you understand well how the $x$'s are distributed, you can sometimes get some idea of the conditional distribution even from an unconditional display, so it's not automatically a mistake -- but it very often is.] – Glen_b Nov 12 '13 at 02:51
  • 1
    I think that we should treat it as a mistake even when it isn't, because its so often misunderstood; it generates many questions here from people who transform things that are fine - and they often try to transform their $x$'s to normality too. For some discussion of the distinction, see here or the second example here. – Glen_b Nov 12 '13 at 02:56
0

Per your comment, Lindberg-Feller CLT requires independence (but not identically distributed), along with a finite means and variance. Are you sure that the "Y can't be [independent] by definition but this is the case for all regressions" part doesn't kill your argument? Just because it's true by definition doesn't mean that it's not true (or applicable).

Wayne
  • 21,174
  • Are you sure it's not applicable? This is a quote from Greene's Econometric Analysis "Since nearly all the estimators we construct in econometrics fall under the purview of the CLT, it is obviously an important result." As for iid, I know that for most econometric models it is the LF CLT which provides the definition I gave above. So because that works for unequal variances I don't know if they have to be identically distributed. They do have to be independent, which for OLS the $X$s have to be (the $Y$ can't be by definition but this is the case for all regressions) – EconStats Nov 09 '13 at 02:49
  • You're right, and I'll edit my answer. – Wayne Nov 09 '13 at 03:17
  • 1
    They don't have to be independent. CLT's exist for dependent sequences, like martingales and mixing processes. (cc @EconStats) – Alecos Papadopoulos Nov 09 '13 at 06:30
  • @AlecosPapadopoulos: Yes, but you can't mix-n-match requirements. EconStats is appealing to the Lindberg-Feller CLT, which does require independence. – Wayne Nov 11 '13 at 21:56
  • I didn't see a reference to L-F CLT in the OP's question. – Alecos Papadopoulos Nov 11 '13 at 22:32
  • @AlecosPapadopoulos: It's not in the original posting, but in the OP's comment, the first one in this answer. – Wayne Nov 11 '13 at 23:26