0

When we derive the estimates of $\vec{\beta}$ such that they minimize the sum of squared error ($SSE$) we begin with $\sum_{i=1}^{n} (y_i - (\beta_0 + \beta_1x_1 + ... + \beta_kx_k))^2$. This is equivalent to the matrix vector product $(X \vec{\beta} - \vec{y})^T(X \vec{\beta} - \vec{y})$ (correct me should this be wrong) and that is equal to $\vec{y}^T\vec{y} - \vec{y}^T X\vec{\beta} - \vec{\beta}^TX^T \vec{y} + \vec{\beta}^TX^T X\vec{\beta}$ which I was able to make sense of by writing that all out.

It makes sense to me that we want to set this expression equal to zero and take the partial with respect to $\vec{\beta}$ (the gradient) like so $\frac{\partial}{\partial{\vec{\beta}}}$ $\vec{y}^T\vec{y} - \vec{y}^T X\vec{\beta} - \vec{\beta}^TX^T \vec{y} + \vec{\beta}^TX^T X\vec{\beta} = 0$

What I do not understand is why the result here equals $-2X^T \vec{y} + 2X^T X \vec{\beta}$ particularly the second part, $2X^T X \vec{\beta}$

Could someone possibly explain how that result is obtained? I just don't get it. Thanks

Sycorax
  • 90,934

1 Answers1

4

It involves the usage of the following basic rules of vector differentiation:

\begin{align}\frac{\partial}{\partial \mathbf a}\mathbf a^\top\mathbf b&=\frac{\partial}{\partial \mathbf a}\mathbf b^\top\mathbf a = \mathbf b, \\\frac{\partial}{\partial \mathbf a}\mathbf a^\top\mathbf A\mathbf a&=\left(\mathbf A+\mathbf A^\top\right)~\mathbf a\\&=2\mathbf A\mathbf a~~~~\mathrm{when ~\mathbf{A} ~symmetric}. \end{align}

The title of the question seems to indicate the two terms are equal. But actually it is after differentiating the former, we get the latter.

User1865345
  • 8,202