2

We know that for a sample (assume it's a data set that has two variables $x$ and $y$ of size $n$), $$R = \frac1{n-1}\sum_{i=1}^n\left(\frac{x_i-\overline{x}}{s_x}\right)\left(\frac{y_i-\overline{y}}{s_y}\right)$$

Say we add in a data point $(\overline{x}, \overline{y})$ to the sample, which lies on the linear regression trendline of the sample (?).

We can mathematically see this actually decreases the $R$ value (the sum portion for this data point is $0$, but $n$ increases by $1$ so the denominator increases). However, cannot intuitively understand why.

Is there an intuitive explanation for this?

Thanks!

Max0815
  • 145
  • 4
  • My answer here might be helpful, even if this is not quite a duplicate. (My hope is that you will be able to read that answer and then figure out enough to be able to post a self-answer!) – Dave Apr 13 '23 at 16:27
  • 1
    I don't "mathematically see" this at all, because including that point shrinks both $s_x$ and $s_y,$ which could more than compensate for the change in $n.$ In fact, this exactly compensates for the change. (I understand you really mean to ask about $|R|$ rather than $R$ itself, for otherwise the claim--if it were true for positive $R$--could easily be disproven by choosing very negative $R.$) – whuber Apr 13 '23 at 16:38

1 Answers1

2

$R$ is the ordinary least squares (OLS) estimate of the slope of the regression of the $y_i$ on the $x_i.$ Because the regression line must go through the point of averages $(\bar x, \bar y),$ adjoining any number of instances of that point to the dataset will not change the sum of squares and therefore will not change the OLS solution, whence $R$ will be unchanged.

whuber
  • 322,774