5

Disclaimer: I'm not a mathematician or a statistician, but I'm studying stats now and I have only a college algebra background.

I'm truly impressed by how the correlation coefficient formula generates values between -1 and 1 and am wondering how the equation was derived. How did someone sit down and say "I only want values between -1 and 1" and come up with this equation (or what train of mathematical logic would make the final form evident).

I understand each individual component of the equation, and I have an instinct that the answer lies within the order of operations between the terms of the numerator and the denominator (e.g. multiplication of terms happening before their summation in the numerator) but am truly stumped.

User1865345
  • 8,202
A L
  • 153
  • 4
    If you're looking for historical background, see the website Earliest Known Uses of Some of the Words of Mathematics (C), and search for correlation, it gives a brief history with references. While Galton originates the term 'co-relation' and measures it, and while Edgeworth subsequently invented the term correlation coefficient, the usual product-moment form comes from Pearson in 1896; however it's not always immediately obvious what's going on in old papers. – Glen_b Dec 26 '23 at 06:41
  • 4
    At https://stats.stackexchange.com/a/513608/919 I invoke five basic principles to derive and justify the Pearson correlation coefficient; and at https://stats.stackexchange.com/a/180809/919 I show that this correlation coefficient has unique properties concerning how one quantifies "linearity." Neither follows the historical development, but both answer your question about "what train of mathematical thought." Pearson was no mathematician, however, and followed a different set of ideas related to ordinary least squares regression. – whuber Dec 26 '23 at 15:27
  • @Glen_b - That book link is awesome man, ty!! – A L Dec 27 '23 at 06:12
  • @whuber - These two links are thick reads for me with my current level of math knowledge, but I've saved them for when I have more time to delve into them - tysm! – A L Dec 27 '23 at 06:13

2 Answers2

13

Too keep things as simple as possible, assume that $\mathbf{x} = (1,2)$ and $\mathbf{y} = (3,-5)$. You will think of $\mathbf{x},\mathbf{y}$ as "vectors", i.e. as arrows that start at the center $(0,0)$ and go onto those points.

Here is a picture:

enter image description here

How do we find the angle $\theta$ between these vectors? There is a formula in geometry that tells us that to do so: $$ \cos \theta = \frac{ \mathbf{x}\cdot \mathbf{y} }{ ||\mathbf{x}|| ~ ~ ||\mathbf{y} || } $$

Now we need to explain each piece of the right-hand-side. Let us begin with $||\mathbf{x}||$. This is called the (Euclidean) norm of $\mathbf{x}$. This represents the length of the vector. The way you find it is by basically using the Pythagorean theorem, in this case, $$ ||\mathbf{x}|| = \sqrt{ 1^2 + 2^2 } = \sqrt{5} $$ In a similar way, $$ ||\mathbf{y}|| = \sqrt{ 3^2 + (-5)^2 } = \sqrt{ 34 } $$

But how do we find $\mathbf{x}\cdot \mathbf{y}$? This is the ``dot product''. Here is how you calculate it, $$ \mathbf{x}\cdot \mathbf{y} = (1)(3) + (2)(-5) = 3 - 10 = -7 $$ This means that, $$ \cos \theta = \frac{-7}{\sqrt{170}} $$


Now in what follows you do not really care about the angle $\theta$ but care more about $\cos \theta$. Maybe you remember from high-school that $\cos \theta$ is always a number between $-1$ and $1$.


Now pretend we have two sets of data $\mathbf{x} = (x_1,x_2,...,x_n)$ and $\mathbf{y} = (y_1,y_2,...,y_n)$. Let us say, for simplicity, that the average $\overline{\mathbf{x}} = 0$ and $\overline{\mathbf{y}} = 0$. So the average of the two data sets is equal to zero. As before let us calculate the $\cos \theta$ of these two data sets where $\theta$ is the angle between $\mathbf{x}$ and $\mathbf{y}$. We would get, $$ \cos \theta = \frac{\mathbf{x}\cdot \mathbf{y}}{||\mathbf{x}|| ~~ ||\mathbf{y}||} = \frac{x_1y_1 + x_2y_2 + ... + x_ny_n}{\sqrt{x_1^2 + ... + x_n^2} ~ ~ \sqrt{y_1^2 + ... + y_n^2}} $$

Now make the profound observation that the numerator: $$ x_1y_1 + ... + x_ny_n $$ is exactly how you compute the covariance $\text{Cov}(\mathbf{x},\mathbf{y})$! And the denominator $\sqrt{x_1^2 + ... + x_n^2}$ is exactly how you compute the standard deviation $\text{Std}(\mathbf{x})$!

Therefore, you see that the correlation between $\mathbf{x}$ and $\mathbf{y}$ is a number between $-1$ and $+1$. Furthermore, it is equal to $+1$ exactly when the two arrows (vectors) are pointing in the same exact direction, i.e. it is a perfect straight line.

  • 2
    (+1) for the Pythagorean theorem. – Xi'an Dec 26 '23 at 08:59
  • 1
    One elaboration of this POV which I'm fond of is this 1937 article by de Finetti: http://www.brunodefinetti.it/Opere/AboutCorrelations.pdf. – Semiclassical Dec 26 '23 at 14:51
  • @Semiclassical - Tysm for this link! These are great resources for my future self and are much appreciated! – A L Dec 27 '23 at 06:15
  • @Nicolas Bourbaki - this is incredible man... to think that there's a geometric underpinning .... I guess then my real question becomes 'What is the origin of the creation of Cos0' and for me to figure out more about that. That being said - this answers my question and I REALLY appreciate you factoring in my knowledge level to make this digestible for me. I think that really speaks to your understanding and I can't thank you enough for that man. Makes me wonder what other geometric principles are underpinning other stats calculations! Exciting! Ty again. – A L Dec 27 '23 at 06:19
  • 1
    @AL If you are interested in understanding where $\cos \theta$ comes from then that changes the question into a new question. The quick answer is, "the law of cosines", something that you might have seen in high school geometry. If you are further interested, you can look at "law of cosines and the dot product". It will show you the connection between multiplying vectors and the cosine of the angle. – Nicolas Bourbaki Dec 27 '23 at 18:20
  • @NicolasBourbaki - ty again my friend - I will do this!! I can't stress enough the worth to me in your gift of this knowledge! Take care NB!! -AL – A L Dec 28 '23 at 04:11
0

Yes, it is same to what percentage does to the two values in comparison. The correlation coefficient is always between -1 and 1 is a result of the normalization process involved in its calculation. This normalization is achieved by dividing Covariance of variables by the product of standard deviations of variables. This division is responsible for normalizing and bounding the correlation coefficient.

Standardizing or Scaling quantities always help explaining numbers.

letdatado
  • 325