2

Logistic Regression has two possible formulations depending on how we select the target variable: $y \in \{0,1\}$ or $y \in \{-1,1\}$.

This question discusses the derivation of Hessian of the loss function when $y \in \{0,1\}$. The following is about deriving the Hessian when $y \in \{-1,1\}$.

The loss function could be written as,

$$\mathcal{L}(\beta) = \frac{-1}{n} \sum_{i=1}^{n} \log \sigma(y_i\beta^{T}x_i),$$

where $y_i \in \{-1, 1\},$ $x_i \in \mathbb{R}^p,$ and $\sigma (x) = \frac{1}{1 +e^{-x}}.$ is the sigmoid function and $n$ is the number of examples in $X$.

I'm looking to calculate the Hessian for this Loss.

$\nabla \mathcal{L}(\beta)$ can be calculated as follows:

Let $l_i(\beta) = - \log \sigma(y_iz_i),$ where $z_i = \beta^{T}x_i$,

$$\frac {\partial l_i(\beta)}{\partial \beta} = \frac{-1}{\sigma(y_iz_i)}\sigma(y_iz_i)(1 - \sigma(y_iz_i)).\frac{\partial {y_iz_i}}{\partial \beta}$$

$$\frac {\partial l_i(\beta)}{\partial \beta} = -(1 - \sigma(y_iz_i)).y_ix_i$$

$$\frac {\partial l_i(\beta)}{\partial \beta} = ( \sigma(y_i\beta^{T}x_i) -1 ).y_ix_i$$

Averaging over all the $n$ examples:

$$ \nabla \mathcal{L}(\beta) = \frac{1}{n} \sum_{i=1}^{n} \sigma(y_i\beta^{T}x_i) -1 ).y_ix_i $$

I'm not sure how to proceed with calculating $\nabla^{2} \mathcal{L}(\beta)$. Any pointer is appreciated.

akilat90
  • 194

1 Answers1

1

One way to compute the Hessian in matrix form is to use differentials (shameless self-promotion).

Indeed, let's introduce $dx$ that is a matrix form of infinitesimally small vector, same length as $x$. Then, (assuming the Hessian exist and is well, yada-yada-yada) $$ f(x+dx) = f(x) + \nabla f(x)^T dx + \tfrac{1}{2} dx^T \nabla^2 f(x) dx + O(\|dx\|^2) $$ So we just need to approximate $\mathcal{L}(\beta + d\beta)$ to quadratic in $d\beta$, and the quadratic coefficient matrix would be twice the Hessian.

First, define $g(x) = \log \sigma(x)$, then $$ \begin{align*} \mathcal{L}(\beta + d\beta) &= -\frac{1}{N} \sum_{n=1}^N g( y_n (\beta+d\beta)^T x_n ) = -\frac{1}{N} \sum_{n=1}^N g( y_n \beta^T x_n + y_n d\beta^T x_n ) \\ &= -\frac{1}{N} \sum_{n=1}^N \left[ g(y_n \beta^T x_n) + g'(y_n \beta^T x_n) y_n d\beta^T x_n + \tfrac{1}{2} g''(y_n \beta^T x_n) (y_n d\beta^T x_n)^2 \right] + O(\|d\beta\|^2) \\ &= \mathcal{L}(\beta) - \frac{1}{N} \sum_{n=1}^N \left[g'(y_n \beta^T x_n) y_n x_n^T d\beta + \tfrac{1}{2} d\beta^T g''(y_n \beta^T x_n) y_n^2 x_n x_n^T d\beta \right] + O(\|d\beta\|^2) \\ &= \mathcal{L}(\beta) + \left(\underbrace{- \frac{1}{N} \sum_{n=1}^N g'(y_n \beta^T x_n) y_n x_n }_{\nabla \mathcal{L}(\beta)}\right)^T d\beta \\ & \quad\quad\quad + \tfrac{1}{2} d\beta^T \left(\underbrace{- \frac{1}{N} \sum_{n=1}^N g''(y_n \beta^T x_n) y_n^2 x_n x_n^T}_{\nabla^2 \mathcal{L}(\beta)} \right) d\beta + O(\|d\beta\|^2) \end{align*} $$

Now, $g(x)$ is a simple scalar function, so it's easy to compute its derivatives: $$ \begin{align*} g'(x) &= \sigma(-x) \\ g''(x) &= -\sigma(-x) \sigma(x) \end{align*} $$

First, let's do a sanity check with $\nabla \mathcal{L}(\beta)$: $$ \begin{align*} \nabla \mathcal{L}(\beta) &= - \frac{1}{N} \sum_{n=1}^N g'(y_n \beta^T x_n) y_n x_n \\ &= \frac{1}{N} \sum_{n=1}^N -\sigma(-y_n \beta^T x_n) y_n x_n \\ &= \frac{1}{N} \sum_{n=1}^N (\sigma(y_n \beta^T x_n)-1) y_n x_n \end{align*} $$ Where I used the fact that $\sigma(-x) = 1-\sigma(x)$. And the results matches the gradient computed by the question's author.

Finally, the Hessian is $$ \begin{align*} \nabla^2 \mathcal{L}(\beta) &= \frac{1}{N} \sum_{n=1}^N \sigma(y_n \beta^T x_n) \sigma(-y_n \beta^T x_n) y_n^2 x_n x_n^T \\ &= \frac{1}{N} \sum_{n=1}^N \sigma(\beta^T x_n) \sigma(-\beta^T x_n) x_n x_n^T \end{align*} $$ Here I got rid of $y_n$ since $y_n \in \{-1, +1\}$.

Curiously enough, in this formulation the Hessian is exactly the same as in the 0-1 formulation! But perhaps this shouldn't be all that surprising given that it's a minor reformulation of the same optimization problem.

  • 1
    The final paragraph is the most important (maybe the rest of the post is actually superfluous and just repeating the answer to the linked question). The loss function didn't change at all so the Hessian, the derivative of that loss function, won't change either. – Sextus Empiricus Aug 19 '22 at 07:26