2

I am learning about the idea of correlation in statistics, and I came across the following

Statement: the best fit line of bivariate normal data passes through extrema of level sets. That is, if $(X,Y)\sim \mathcal{N}(0,\Sigma)$, then the best fit line $l$ passes through the extrema of the level sets of $f$, the PDF of $(X,Y$).

For example, in the following picture, if the red line is a level set of $f$, then $l$ would pass through the two (black) labeled points.

The best fit line $l$ is defined to minimise the sum of the squares of the horizontal distances of the data points to $l$. Let's say $l$ fits a large number of data points that are i.i.d $(X,Y)$.

The PDF is $f(x,y) = \frac{1}{\sqrt{(2\pi)^2|\Sigma|}}\exp\left(-\frac12(x\ y)\Sigma\binom{x}{y}\right)$ and the margin is (e.g.) $f(x) =\int_{-\infty}^{\infty}f(x,y)dy$.

Problem: How to prove the statement? I know you can find $l$ with calculus but how does it relate to the extrema of the level sets? Thanks.

enter image description here

  • What is the equation for a multivariate Gaussian with that correlation coefficient and corresponding margins? – Dave Sep 24 '22 at 17:56
  • @Dave edited to answer your question – Benjamin Wang Sep 24 '22 at 19:05
  • A demonstration is found by examining the illustration in my post at https://stats.stackexchange.com/a/71303/919 following "the key idea." Please note that your picture is not the correct one: the least squares line passes through the reflections of those points around the major axis of the ellipse. – whuber Sep 24 '22 at 19:17
  • Thanks @whuber . Your post looks very beautiful. I’ll read it in a few hours. – Benjamin Wang Sep 25 '22 at 00:25

1 Answers1

2

As explained at https://stats.stackexchange.com/a/71303/919, this level set is the image of an ellipse that has been skewed upwards. Here is part of the original ellipse, with vertical arrows indicating the effects of the skewing operation at its extrema:

enter image description here

This skewing operation lifts (or lowers) each point $(x,y),$ in coordinates originating at the center of the ellipse, by an amount directly proportional to the value of $x,$ arriving at the point $(x, y + \beta x)$ when the constant of proportionality is $\beta$ (the least-squares slope). Thus, every vertical line is mapped to itself.

Moreover, the original regression line--the horizontal axis of this ellipse consisting of the points with coordinates $(x,0)$--is skewed (not rotated!) into the regression line for the image of this ellipse consisting of the points $(x,\beta x),$ thus:

enter image description here

The dotted vertical lines indicate the tangents to the original left and right vertices of the ellipse. Clearly they are still tangent to the image of the ellipse and, because they originally intersected the horizontal ellipse where the regression line crosses, they still intersect the ellipse where the regression line crosses, QED.

Although the effect is subtle in this case, notice that the points where the regression line intersects the ellipse are no longer at its vertices.

whuber
  • 322,774