2

The definition of covariance says that it is: $cov(X,Y)=E [(X- \overline{X})(Y- \overline{Y})]$

However, it seems that we can also calculate it without subtracting the mean of the second variable: $cov(X,Y)=E [(X- \overline{X})(Y)]$

Is this correct? My argument for discrete variables results from comparing respective sums:

I. Sum number 1: $$\sum[(X-\overline{X})\times (Y-\overline{Y})]=\sum(XY-\overline{Y}X-\overline{X}Y+\overline{X}\overline{Y})=\sum(XY)-\overline{Y}n\overline{X}-\overline{X}n\overline{Y}+n\overline{X}\overline{Y}=\sum(XY)-2n\overline{X}\overline{Y}+ n\overline{X}\overline{Y}=\sum(XY)- n\overline{X}\overline{Y}$$

II. Sum number 2: $$\sum[(X-\overline{X})\times Y]= \sum(XY)-\overline{X}n\overline{Y}$$

Conclusion: sum number 1 = sum number 2, so the covariance can be calculated either way.

User1865345
  • 8,202
hamsa
  • 21
  • 3
  • no, $cov(,)=[(−)()]$ only if $E(Y)=0$. – utobi Nov 11 '22 at 10:42
  • 1
    If you assume unknown sample mean (which by your notation looks like to be the case), you are right, because the missing term $\sum (X - \bar X) (\bar Y)$ is zero – Firebug Nov 11 '22 at 12:16
  • See https://stats.stackexchange.com/a/18200/919 for an explanation of why you don't have to compute either mean. – whuber Nov 11 '22 at 14:28
  • @Firebug Thanks. I am asking about this not because in my case the sample mean of Y is unknown but rather because I encountered the term equal to the "shorter" version in a proof and I would like to interpret it as covariance which would help me to finish the proof. – hamsa Nov 11 '22 at 14:37

2 Answers2

3

For the theoretical covariance, \begin{aligned} \text{Cov}(X,Y) &= \mathbb{E}[(X-\mu_X)(Y-\mu_Y)] \\ &= \mathbb{E}[(X-\mu_X)\cdot Y - (X-\mu_X)\cdot\mu_Y] \\ &= \mathbb{E}[(X-\mu_X)\cdot Y] - \mathbb{E}[(X-\mu_X)\cdot\mu_Y)] \\ &= \mathbb{E}[(X-\mu_X)\cdot Y] - \mu_Y\cdot(\mathbb{E}[X]-\mu_X) \\ &= \mathbb{E}[(X-\mu_X)\cdot Y] - \mu_Y\cdot 0 \\ &= \mathbb{E}[(X-\mu_X)\cdot Y]. \\ \end{aligned}

For the sample covariance, \begin{aligned} \widehat{\text{Cov}}(X,Y) &= \frac{1}{n-1}\sum_{i=1}^n[(X_i-\bar{X})(Y_i-\bar{Y})] \\ &= \frac{1}{n-1}\sum_{i=1}^n[(X_i-\bar{X})\cdot Y_i - (X_i-\bar{X})\cdot\bar{Y}] \\ &= \frac{1}{n-1}\sum_{i=1}^n[(X_i-\bar{X})\cdot Y_i] - \frac{1}{n-1}\sum_{i=1}^n[(X_i-\bar{X})\cdot\bar{Y})] \\ &= \frac{1}{n-1}\sum_{i=1}^n[(X_i-\bar{X})\cdot Y_i] - \bar{Y}\cdot\left(\frac{1}{n-1}\sum_{i=1}^n[X_i-\bar{X}]\right) \\ &= \frac{1}{n-1}\sum_{i=1}^n[(X_i-\bar{X})\cdot Y_i] - \bar{Y}\cdot 0 \\ &= \frac{1}{n-1}\sum_{i=1}^n[(X_i-\bar{X})\cdot Y_i]. \\ \end{aligned}

Not subtracting the mean of the second variable works in both cases.

Richard Hardy
  • 67,272
0

Adding to Richard's answer, it is also possible to compute the covariance without ever computing the expectation (or the sum) of either variable, just based on the differences between pairs.

Because, given $(i,j,k) \in n \times n \times n$

$$\sum_{i,j,k}(X_i-X_j)(Y_i-Y_k)=\\ \sum_{i,j,k}(X_iY_i-X_jY_i - X_iY_k+X_jY_k)=\\ \sum_{i,j,k}(X_iY_i)-\sum_{i,j,k}(X_jY_i) - \sum_{i,j,k}(X_iY_k)+\sum_{i,j,k}(X_jY_k) $$

Going term by term: $$ \sum_{i,j,k}(X_iY_i) = \sum_{k,j,i}(X_iY_i) = n^3E[XY]\\ \sum_{i,j,k}(X_jY_i) = \sum_{i,j,k}(X_iY_k)=\sum_{i,j,k}(X_jY_k) = n\sum_k(Y_k\sum_jX_j) = n^2E[X]\sum_kY_k=n^3E[X]E[Y] $$

So

$$ \frac{1}{n^3}\sum_{i,j,k}(X_i-X_j)(Y_i-Y_k) = E[XY] - E[X]E[Y] - E[X]E[Y] + E[X]E[Y]\\ \frac{1}{n^3}\sum_{i,j,k}(X_i-X_j)(Y_i-Y_k) = E[XY] - E[X]E[Y] $$

Firebug
  • 19,076
  • 6
  • 77
  • 139