7

Given two vectors $X$ and $Y$ (length $n$, sampled from random variables), what is the name of the following quantity:

$$ \frac{1}{n^2}\sum_{i=1}^n\sum_{j=1}^n(x_i-y_j)^2 $$

I 'came up' with the formula to quantify the variance between two vectors and I guess that the formula is either nonsense, or if not -- given its triviality -- a well-known quantity. I know it's not the covariance between $X$ and $Y$, but what is it instead?

Edit: obviously, in the context of predictions with e.g. data $X$ and prediction $Y$, this would correspond to the mean squared error (except for the normalisation constant which would be $\frac{1}{n}$). I wonder though whether it has a name (and a meaning) in the context of statistics. see comment by Stephan Kolassa

whuber
  • 322,774
monade
  • 509
  • 2
    No, this would not be the mean squared error, because for the MSE, you would have paired data (to each prediction there corresponds one actual), and you would take the MSE only within each pair. Here, you are combining each $x$ with each $y$. – Stephan Kolassa Oct 10 '20 at 11:44
  • Thanks, good point! – monade Oct 10 '20 at 11:46
  • 3
    This is an empirical version of $\mathbb E[(X-Y)^2]$, empirical in the sense of the empirical distribution of the pair $(X,Y)$ under an assumption of independence. – Xi'an Oct 10 '20 at 12:54
  • 1
    It looks like you really want to be considering some multiple of $$\sum_{i=1}^n\sum_{j=1}^n (x_i-x_j)(y_i-y_j).$$ I illustrate this and describe its interpretation at https://stats.stackexchange.com/a/18200/919. – whuber Oct 10 '20 at 14:33
  • It’s a L2 norm... – Mithridates the Great Oct 10 '20 at 20:34
  • @whuber: Thanks! To clarify: I'm not interested in the degree to which $X$ and $Y$ co-vary, but rather in the variance of the pooled data of $X$ and $Y$, only considering cross-pairs of the data. The original motivation for this came from a different problem - I wondered how I could compute the variance between vectors $Y_1$ and $Y_2$, which correspond to the output variables of a model when run with two different values of a parameter. – monade Oct 11 '20 at 08:33

1 Answers1

5

This is a measure of squared dispersion between two sets of values but not between paired values. I doubt it has a name.

Indeed you do not need to have the same number of $x$ and $y$ values, and using the $\frac1n$ calculation of variance, you can say:

$$\frac{1}{mn}\sum_{i=1}^m\sum_{j=1}^n(x_i-y_j)^2\qquad \\\qquad\qquad= (\bar x - \bar y)^2 + \widehat{\text{Var}}(x) +\widehat{\text{Var}}(y)$$

so a combination of the squared distance between the centres of the two sets plus the square dispersions of the individual sets.

Xi'an
  • 105,342
Henry
  • 39,459
  • Thanks, this makes sense! I wonder, given that the variance of a variable $X$ is the mean squared difference between all $(x_i, x_j)$, would it not make sense to call this quantity something like the between-variable variance? – monade Oct 10 '20 at 13:09
  • 1
    @monade You might then want to divide it by $2$ since $\frac{1}{n^2}\sum\limits_{i=1}^n\sum\limits_{j=1}^n(x_i-x_j)^2= 2{\text{Var}}(x)$ – Henry Oct 10 '20 at 14:24