3

I have to minimize a function $f(\mathbf{x})$, where the vector $\mathbf{x}\in\mathbb{R}^n$ satisfies $|\mathbf{x}|=1$. So I tweaked the code of $f$ so that it renormalizes $\mathbf{x}$ as the first step, and this allows me to avoid adding the constraint to the minimization algorithm. At the end of the optimization I can simply renormalize the result. This is working well.

Now I'd like to help the algorithm further by supplying the gradients of $f$. I calculated them naively by hand and I'm happy that they turned out to be somewhat simple functions, but I'm having troubles because my "code for $f$" effectively computes $g(\mathbf{x}) = f(\mathbf{x}/|\mathbf{x}|)$, while the gradient that I calculated is actually the gradient of $f(\mathbf{x})$. Can I get around this problem without having to calculate the gradient of $g$?

Ziofil
  • 133
  • 5
  • 2
    You can use your tweaked code to minimize $f$ using the gradient of $f$; in effect you are renormalizing after taking a gradient step. This is known as projected gradient descent. – Christian Clason Feb 09 '17 at 08:15

2 Answers2

5

Based on your current optimization strategy, it is likely you cannot get away with just the gradient of $f(\cdot)$ evaluated at $\frac{\boldsymbol{x}}{\left| \boldsymbol{x}\right|}$ since normalization isn't linear wrt. $\boldsymbol{x}$. However, you could try using chain rule by first assuming that $f(\boldsymbol{u}(\boldsymbol{x}))$ where $\boldsymbol{u}(\boldsymbol{x}) = \frac{\boldsymbol{x}}{\left| \boldsymbol{x}\right|}$. Then the gradient result you say you derived would actually represent $\vec{\nabla}_{u} f$, compared to $\vec{\nabla}_{x} f$ which you want.

Using chain rule, we can show the following:

\begin{align} \frac{\partial f}{\partial x_{p}} &= \frac{\partial f}{\partial u_{k}} \frac{\partial u_{k}}{\partial x_{p}} \\ \end{align}

We can find $\frac{\partial u_{k}}{\partial x_{p}} \;\forall k,p$, since $u_k = x_{k} \left(x_l x_l\right)^{-1/2} \; \forall k$, doing the following:

\begin{align} \frac{\partial u_{k}}{\partial x_{p}} &= \frac{\partial}{\partial x_{p}} \left( x_{k} \left(x_l x_l\right)^{-1/2} \right) \\ &= \delta_{kp} \left(x_l x_l\right)^{-1/2} - x_{k} x_{p} \left(x_l x_l\right)^{-3/2} \\ &= \left(\delta_{kp} - \frac{x_{k} x_{p}}{\left(x_l x_l\right)} \right) \left(x_l x_l\right)^{-1/2} \end{align}

This expression can be simplified into the following in matrix form: \begin{align} \frac{\partial \boldsymbol{u}}{\partial \boldsymbol{x} } &= \frac{1}{\left| \boldsymbol{x}\right|} \left( I - \frac{\boldsymbol{x} \boldsymbol{x}^{T}}{\left| \boldsymbol{x}\right|^2}\right) \end{align}

Thus, assuming you are defining the gradients as column vectors, you get the following relationship to compute what you want:

\begin{align} \vec{\nabla}_{x} f &= \frac{1}{\left| \boldsymbol{x}\right|} \left( I - \frac{\boldsymbol{x} \boldsymbol{x}^{T}}{\left| \boldsymbol{x}\right|^2}\right) \vec{\nabla}_{u} f \end{align}

spektr
  • 4,238
  • 1
  • 18
  • 19
1
  1. Does your optimization guarantee that $|\mathbf{x}|=1$ all the time? If not you have to renormalize at each step and actually your function changes.
  2. If you are always on the L1-ball, then simply treat $\mathbf{x}_{new}=\mathbf{x}/|\mathbf{x}|$ and optimize $f(\mathbf{x}_{new})$. This will not change your optimization. You could scale back at the end of optimization by the L1-norm to get back to the scale of the data.
Tolga Birdal
  • 2,229
  • 15
  • 24
  • I am indeed optimizing $f(\mathbf{x}{new})$, but the algorithm would need $\vec\nabla f$ with respect to $\mathbf{x}$, not with respect to $\mathbf{x}{new}$, because $f(\mathbf{x}_{new}) = f(\mathbf{x}/|\mathbf{x}|)$ – Ziofil Feb 08 '17 at 23:52
  • But why is this the case if you could just fix $x_{new}$ once? Just compute $\nabla f_{new}$? – Tolga Birdal Feb 09 '17 at 00:21