11

I'm trying to understand why there tends to be correlation (as measured by the standard Pearson correlation coefficient) between $x$ and $x^2$ (for instance if $x$ is uniformly distributed).

It's my understanding that the Pearson correlation coefficient only measures linear relationships. $x$ and $x^2$ are not linearly related.

user25308
  • 141
  • 4
    Have you ever noticed that $x^2$ actually has a linear part, depending on how you look at it? After all, $x^2 = (x+1)^2 - (2x + 1)$, for instance. If neither $x^2$ is to be "linearly related" to $x$ nor $(x+1)^2$ is to be "linearly related" to $x+1$ (which implies $(x+1)^2$ must be "linearly related" to $x$), then it seems like you're in a lot of trouble, for then $x$ could not be "linearly related" to $2x+1$, could it? – whuber May 06 '13 at 18:44
  • If I simulate some data using R and calculate the correlation coefficient between x and x^2 I tend to get high values (close to 1). This is also the case if I use say the natural numbers from 1 to 100 and their squares. Interestingly enough it seems to depend on the distribution of x. If x is uniformly distributed I tend to get very high correlation. If x is normally distributed it tends to be close to 0. Why is that? – user25308 May 06 '13 at 18:45
  • 11
    Simulate some data with $x$ symmetrically spaced around $0$ and try again :-). – whuber May 06 '13 at 18:50
  • That may be true for a uniform bounded by 0 and 1, but not for any uniform and not for standard normal variables. It tends to depend on the formula for the covariance between the two variables. – John May 06 '13 at 18:52
  • Interestingly enough, I now get very small values. Is there a common element across distributions that determines whether x and x^2 are highly correlated? – user25308 May 06 '13 at 18:53
  • If you look at the scatterplot, or use the equation, then knowing the value of X tells you exactly the value of X^2. But the correlation (or R^2) is less than one, so the linear prediction gives you some information, but not perfect information, about the relationship. – Jeremy Miles May 06 '13 at 18:53
  • 5
    If you plot $x^2$ versus $x$ you obtain, of course, a portion of a parabola. When the distribution of $x$ is away from zero, the correlation is measuring the linearity of one arm of that parabola: it looks more and more linear the further from zero you get compared to the range of the values. When the distribution of $x$ is symmetric around zero, the parabola is clearly curved. That's all that's going on. – whuber May 06 '13 at 18:57
  • Just reinforcing some ideas others (@whuber) have mentioned: cor((-10:10), (-10:10)^2) The correlation is 0. – rbatt May 06 '13 at 19:11

1 Answers1

10

The Pearson correlation measures the amount of linear relationship -- it doesn't ignore variables that have a relationship that's not perfectly linear. If things increase and decrease together, some portion of their relationship is explainable as linear relationship (and some of it isn't).

For example, if $X$ is positive, then both $X$ and $X^2$ will increase or decrease together, and so be somewhat positively correlated. On the other hand if $X$ is negative, then $X^2$ will increase as $X$ decreases (becomes more negative).

Here's a case where the population mean of $X$ is large compared to its spread, and so $X$ and $X^2$ have a high Pearson correlation:

Plot of X^2 vs X -- showing high positive correlation

In this case the population correlation is about 0.99867 and the sample correlation was about 0.99868.

If $X$ is both positive and negative then there are parts where $X^2$ increases as $X$ increases and parts where $X^2$ decreases as $X$ increases. This may result in an overall positive, negative or zero correlation (depending on the extent to which they cancel out).

Glen_b
  • 282,281