1

I have a somewhat convoluted question here. Suppose I have paired random variables $X$ and $Y$. That is, when I draw samples, I get one instance of $X$ and an associated instance of $Y$. Then I can calculate $$\operatorname{Cov}(\bar X, \bar Y) = \frac{\operatorname{Cov}(X, Y)}{n} $$ where $\bar{X}$ and $\bar{Y}$ are sample means computed from $n$ observations (see here).

I want to do something similar, that is, calculate $$\operatorname{Cov}\left( \frac{X - \mu_X}{\sigma_X} - \frac{X - \bar{\mu}_X}{\bar{\sigma}_X}, \frac{Y - \mu_Y}{\sigma_Y} - \frac{Y - \bar{\mu}_Y}{\bar{\sigma}_Y} \right)$$ where $\mu_X, \mu_Y$ are the population means of $X$ and $Y$, $\sigma_X, \sigma_Y$ are the populartion standard deviations, $\bar\mu_X, \bar\mu_Y$ are the sample means of $X$ and $Y$ (with $n$ observations), and $\bar\sigma_X, \bar\sigma_Y$ are the sample standard deviations.

The idea is that I want to express the error (additive noise) of normalizing by sample statistics (instead of by the population statistics) in terms of the covariance of the original variables.

We can maybe make the problem simpler by assuming that $X$ and $Y$ are zero-mean and unit variance.

Edit: Since the notation is confusing, I'll try again in a different format that I'm more comfortable with. Ignore all the previous notation.

I have a data set of $N$ points $\{x_i, y_i\}_{i=1}^N$ in $\mathbb{R}^2$. Let's say these are independent and identically distributed samples. I can then calculate the moments $$\mu_x = \frac{1}{N} \sum_{i=1}^N x_i \quad \mu_y = \frac{1}{N} \sum_{i=1}^N y_i\\ \sigma_x^2 = \frac{1}{N} \sum_{i=1}^N (x_i - \mu_x)^2 \quad \sigma_y^2 = \frac{1}{N} \sum_{i=1}^N (y_i - \mu_y)^2.$$ I can then normalize the data set like this in a process commonly called $z$-scoring: $$\left\{ \left( \frac{x_i - \mu_x}{\sigma_x}, \frac{y_i - \mu_y}{\sigma_y} \right) \right\}_{i=1}^N.$$ However, suppose I have so many data points that I cannot properly calculate $\sigma_x, \sigma_y$ and $\mu_x, \mu_y$. Then I could instead estimate these parameters from a small sample of $n$ observations for some $n \ll N$. Write my sample indicies as $\{i_1, ..., i_n\} \subset \{1, ..., N\}$ where the samples have been taken with replacement (not sure if necessary to have this assumption). I define the sample statistics like this: $$\hat\mu_x = \frac{1}{n} \sum_{j=1}^n x_{i_j} \quad \hat\mu_y = \frac{1}{n} \sum_{j=1}^N y_{i_j}\\ \hat\sigma_x^2 = \frac{1}{n} \sum_{j=1}^n (x_{i_j} - \hat\mu_x)^2 \quad \hat\sigma_y^2 = \frac{1}{n} \sum_{j=1}^n (y_{i_j} - \hat\mu_y)^2.$$One thing I am not sure about is whether I need to divide by $n - 1$ or $n$ for this proof to work. After doing this, there is going to be a small difference between the data set that was normalized using perfect information and the data set that was normalized using the small sample. The differences I am interested in are: $$ a_i = \frac{x_i - \mu_x}{\sigma_x} - \frac{x_i - \hat\mu_x}{\hat\sigma_x}\\ b_i = \frac{y_i - \mu_y}{\sigma_y} - \frac{y_i - \hat\mu_y}{\hat\sigma_y}. $$ I would like to express the covariance of this difference $(a_i, b_i)$ (the "noise term") in terms of the covariance of the original data set.

When the normalization procedure is not $z$-scoring but rather just subtraction of the mean, then the calculation boils down to using the bilinearity of the covariance as I linked above. This is what I mean (with potentially wrong notation): $$a_i = x_i - \mu_x - (x_i - \hat \mu_x) = \hat \mu_x - \mu_x\\ b_i = y_i - \mu_y - (y_i - \hat \mu_y) = \hat \mu_y - \mu_y \\ \Rightarrow \operatorname{Cov}(a, b) = \frac{1}{n}\operatorname{Cov}(x, y).$$

Open Season
  • 274
  • 1
  • 8
  • Your notation is extraordinarily confusing in that it does not adequately distinguish between population parameters and random variables, nor between calculations and estimates. Regardless, you seem to ask how to compute the covariance of linear combinations of random variables: that's given by a basic formula called bilinearity which is applied in a large number of threads here. – whuber Apr 12 '22 at 21:45
  • 1
    @whuber It's not a linear combination due to dividing by the sample standard deviation which is a random variable itself and has some variance,right? – Open Season Apr 12 '22 at 21:57
  • As I wrote, I simply cannot tell what is a parameter, what is an observation, and what is a statistic. (Usually, to avoid confusion, one uses one kind of symbol, such as Greek letters, for parameters: that is, non-random variables; and another kind of symbol, such as capital latin letters, for random variables.) After you edit this post to clarify your notation, we'll be able to determine what your question really is. – whuber Apr 12 '22 at 22:06
  • @whuber Ok I see what you are saying, I have made another attempt at the notation – Open Season Apr 12 '22 at 23:07

0 Answers0