1

Please let me know if the below statement is valid or not ;

Suppose that $X$ is an $n\times p$ data matrix with $p$ features and $n$ data samples. Suppose further that each feature(column) is zero centered so that the average of each column is zero. Then, the covariance matrix for $X$ is $\frac1nX^TX$.

It's such a simple and easy-looking question, but I had difficulty in finding the reference of the very problem. My proof including the relevant definitions are follows.

Let $$ X= \begin{bmatrix} x_{11} &\cdots&x_{1p}\\ \vdots &\ddots&\vdots\\ x_{n1} &\cdots&x_{np}\\ \end{bmatrix} $$

Let the random variable $X_j$ follow the uniform distribution among the entries of the $j$th column for $j=1,2,\cdots,p$. Then, $\mathbb E[X_j]=\frac1n\sum_{i=1}^nx_{ij}=0$ and the covariance $s_{jk}$ of $X_j$ and $X_k$ is \begin{align*} s_{jk} &=\mathbb E[(X_j-0)(X_k-0)]\\ &=\mathbb E[X_jX_k]\\ &=\frac1n\sum_{i=1}^nx_{ij}x_{ik}. \end{align*}

Now, denote the covariance matrix of $X$ by $S=[s_{jk}]_{p\times p}$. Then

\begin{align*} S &=[s_{jk}]_{p\times p}\\ &=\left[\frac1n\sum_{i=1}^nx_{ij}x_{ik}\right]_{p\times p}\\ &=\frac1n \begin{bmatrix} \sum_{i=1}^nx_{i1}x_{i1}&\cdots &\sum_{i=1}^nx_{i1}x_{ip}\\ \vdots &\ddots &\vdots\\ \sum_{i=1}^nx_{ip}x_{i1}&\cdots &\sum_{i=1}^nx_{ip}x_{ip}\\ \end{bmatrix}\\ &=\frac1nX^TX \end{align*}

sj.kim
  • 11
  • This doesn't need a proof because it can be taken as a definition. If you want to prove something, then, you must have a different definition of covariance in mind. What is it? Your introduction of a random variable is superfluous and the following material appears only to calculate the matrix product. – whuber Jun 14 '23 at 14:00
  • @whuber I know what the covariance of two random variable is, which is $s_{ij}$ in the above computation. But I don't know what is the defintition of the covariance matrix of a data matrix. So you are saying that $\frac1nX^TX$ is the definition of "the covariance matrix the of a data matrix $X$"? can you give me a reference for defining it? I didn't found one. wikipedia is saying about the covariance matrix of a family of random variables, not the covariance matrix of a data matrix. – sj.kim Jun 14 '23 at 14:09
  • One definition is given at https://stats.stackexchange.com/a/18200/919. Two more are given at https://stats.stackexchange.com/a/225758/919. If you have a definition of the variance of a set of numbers, such as at https://stats.stackexchange.com/a/222091/919, then it implies a definition of the covariance matrix via the polarization identity. Finally, https://stats.stackexchange.com/questions/17890/ discusses the distinction between this "population variance" and the "sample variance." – whuber Jun 14 '23 at 14:59
  • @whuber thanks for your comment. I read the first three of them, written by you, not understanding the whole things thoroughly. But I'm afraid they contain quite a lot of stuffs. I know what the polarization identity is, in linear algebra sense, but understanding the whole mathematical concepts related to your answers will take several time. I just want to know if the above computation(or definition if you like) is valid or not. (I came to be nervous since your posts involve one over n squared instead of one over n.) – sj.kim Jun 14 '23 at 23:54

0 Answers0