0

Suppose we have the set $\{\mathbf{x}_i\}_{i=1}^{N}$, where $N$ is the size of the data set and $\mathbf{x}_i \in \mathbb{R}^m$ is the $i$th $m$-sized regressor.

The question is simple: how to compute the sample correlation matrix of this regressor?

AFAIK, the sample correlation matrix is given by (considering standardized dataset)

$$ \hat{\mathbf{R}} = \dfrac{1}{N} \sum_{j=1}^{N} \mathbf{x}_j \mathbf{x}_j^\top $$

However, in Simon's Haykin book

enter image description here

I confess I've never seen such a expression. It makes no sense to me (let alone the negative sign).

I am leaving his book which is can be found on internet here (if you are up to see this discussion, it is on pdf page 106).

  • AFAIR related problems have been addressed here before but I am not able to trace any right now. – User1865345 Jul 04 '23 at 09:16
  • 1
    Your $\hat R$ is not the sample correlation matrix: to make it so, you would need to standardize the variables before applying the formula. I have a suspicion that $(2.30)$ has typographical errors. For instance, possibly the initial "$-$" was intended to be "$\frac{1}{N^2}$" but only the horizontal bar survived the printing process. – whuber Jul 04 '23 at 14:08
  • @whuber "you would need to standardize the variables before applying the formula". Yes, I was assuming a standardized dataset. – Rubem Pacelli Jul 04 '23 at 14:25
  • @RubemPacelli, I just noticed the author maintains a list of errata in his page. You can check that. – User1865345 Jul 04 '23 at 15:29
  • 1
    This makes perfect sense in context. The negative sign really doesn't belong there, strictly speaking, and there is a missing constant for the averaging, but the same changes are made to $\mathbf r_{dx}$ in $(2.30)$ and thereby cancel. The basis for calling this "time averaged" is given at the top of the next page (76, not 105!). – whuber Jul 04 '23 at 15:45
  • @User1865345-solidarityMods His page? which page? – Rubem Pacelli Jul 04 '23 at 18:31
  • @whuber "76, not 105" In my post I was referring to the PDF page, not the book page. Even though, I made a mistake. Now it is corrected. – Rubem Pacelli Jul 04 '23 at 18:39
  • @whuber "there is a missing constant for the averaging, but the same changes are made to $\mathbf{}_{}$ in (2.30) and thereby cancel." Indeed, the minus signs of cancel. But there is two problems: 1- It makes no sense calling (2.30) the estimate of the correlation matrix. – Rubem Pacelli Jul 04 '23 at 18:42
  • @whuber 2- The double summation makes absolutely no sense. It should be only one. If you differentiate (2.28) with respect to $\mathbf{w}$ you will find that $\hat{\mathbf{r}}{xd} = \sum{i=1}^{N} \mathbf{x}i d_i$ and $\hat{\mathbf{R}}{xx} = \sum_{i=1}^{N} \mathbf{x}_i \mathbf{x}_i^\top$ (as you said, the constant is missing) – Rubem Pacelli Jul 04 '23 at 18:46
  • 1
    @RubemPacelli, I am not following this in detail, but, as per whuber's earlier comment, if there is any typo, then you might find it here. – User1865345 Jul 04 '23 at 18:48

1 Answers1

1

This is part of a derivation of the solution for an L2-penalized linear regression (ridge regression), as demonstrated on this page.

With centered predictors, let the model matrix for the regression be the $N$ (rows, observations) by $m$ (columns, predictor variables) matrix $X$. With an $N \times 1$ outcome vector $y$, the parameter estimates from the L2-penalized regression at penalty $\lambda$ are given by:

$$\hat \beta = (X^TX + \lambda I)^{-1}X^Ty.$$

The text you cite uses $d$ instead of $y$ for the vector of outcomes and $w$ instead of $\beta$ for the vector of parameters, but I'm trying to keep to what seems to be the usual choices of variable names on this site.

Putting aside the (unfortunate?) choice of (offsetting?) negative signs in the cited text (noted in comments), what's called "the time-averaged $M$-by-$M$ correlation matrix" in the text is supposed to be equivalent to $X^TX$ above. That's evident on the next page of the text, when the "normal equation" for linear regression is discussed.

So the question can be re-phrased: does the formula in Equation 2.30 give the same matrix as $X^TX$ (except perhaps for sign)?

Let's index $X^TX$ by $p$ for rows and $q$ for columns. Then the $p,q$ element of $X^TX$ is $\sum_{i=1}^N X_{i,p} X_{i,q}$. There are no products involving predictor values at different times/trials.

Now consider "the outer product[s] of the regressors $x_i$ and $x_j$, applied to the environment on the $i$th and $j$th experimental trials" the are summed to get the matrix in Equation 2.30. In that context, $x_i$ and $x_j$ are the $m \times 1$ vectors of predictors at times/trials $i$ resp. $j$, equivalent to rows $i$ and $j$ of the model matrix $X$.

For a single choice of $i,j$, the $p,q$ element of that outer product, in terms of the model matrix $X$ defined above, would be $X_{i,p} X_{j,q}$. Even for a single choice of $i,j$ that thus doesn't make sense; it involves products of values of different predictors at different times/trials.

Unless I'm missing something, that does seem to be an error in the text. The text does get the form correct in subsequent references to the input-vector correlation matrix, as in Equation 3.30.

EdM
  • 92,183
  • 10
  • 92
  • 267
  • I've made a revision, but please, reject it. I missunderstood. – Rubem Pacelli Jul 05 '23 at 18:26
  • @RubemPacelli yes, there is a problem that I need to fix here, thanks for catching it. You caught part of the problem: I had confused the rows/columns of $X^TX$ with those of the model matrix $X$ and the way that I rejected the proposed edit was perhaps misleading. I'll fix that soon. – EdM Jul 05 '23 at 19:43
  • 1
    @RubemPacelli I think I fixed this in a way consistent with my indexing of the rows/columns of $X^TX$ as $p/q$. Then the $i$ and $j$ values do represent rows in the model matrix $X$. I've now fixed (I think) the indexing in the sums, etc. – EdM Jul 05 '23 at 19:53
  • Yes. Now, it is usual with I am used to, that is, $X \in \mathbb{R}^{N\times m}$ (the prediction vectors go in the columns). I have no time at the moment, but I will contribute to your question with the mathematical derivation of the optimal solution ($\hat{\beta}$) for the Ridge regressor. This should enlighten the question. – Rubem Pacelli Jul 06 '23 at 11:02