Why are my steps getting smaller when using fixed step size in gradient descent?

Question

Suppose we are doing a toy example on gradient decent, minimizing a quadratic function $x^TAx$, using fixed step size $\alpha=0.03$. ($A=[10, 2; 2, 3]$)

If we plot the trace of $x$ in each iteration, we get following figure. Why the points get "much dense" when we use fixed step size? Intuitively, it does not looks like a fixed step size, but a decreasing step size.

PS: R Code include plot.

A=rbind(c(10,2),c(2,3))
f <-function(x){
  v=t(x) %*% A %*% x
  as.numeric(v)
}
gr <-function(x){
  v = 2* A %*% x
  as.numeric(v)
}

x1=seq(-2,2,0.02)
x2=seq(-2,2,0.02)
df=expand.grid(x1=x1,x2=x2)
contour(x1,x2,matrix(apply(df, 1, f),ncol=sqrt(nrow(df))), labcex = 1.5, 
        levels=c(1,3,5,10,20,40))
grid()

opt_v=0
alpha=3e-2
x_trace=c(-2,-2)
x=c(-2,-2)
while(abs(f(x)-opt_v)>1e-6){
  x=x-alpha*gr(x)
  x_trace=rbind(x_trace,x)
}
points(x_trace, type='b', pch= ".", lwd=3, col="red")
text(x_trace, as.character(1:nrow(x_trace)), col="red")

Your code doesn't match your description: it uses alpha=3e-2 rather than $0.01$. — whuber, Jan 04 '18 at 20:46

score 14 · Accepted Answer · edited Jan 05 '18 at 03:08

Let $f(x) = \frac 12 x^T A x$ where $A$ is symmetric and positive definite (I think this assumption is safe based on your example). Then $\nabla f(x) = Ax$ and we can diagonalize $A$ as $A = Q\Lambda Q^T$. Use the change of basis $y =Q^T x$. Then we have $$ f(y) = \frac 12 y^T \Lambda y \implies \nabla f(y) = \Lambda y. $$

$\Lambda$ is diagonal so we get our updates as $$ y^{(n+1)} = y^{(n)} - \alpha \Lambda y^{(n)} = (I - \alpha \Lambda)y^{(n)} = (I - \alpha \Lambda)^{n+1}y^{(0)}. $$

This means that $1 - \alpha \lambda_i$ govern the convergence, and we only get convergence if $|1 - \alpha \lambda_i| < 1$. In your case we have $$ \Lambda \approx \left(\begin{array}{cc} 10.5 & 0 \\ 0 & 2.5\end{array}\right) $$ so $$ I - \alpha \Lambda \approx \left(\begin{array}{cc} 0.89 & 0 \\ 0 & 0.98\end{array}\right). $$

We get convergence relatively quickly in the direction corresponding to the eigenvector with eigenvalue $\lambda \approx 10.5$ as seen by how the iterates descend the steeper part of the paraboloid pretty quickly, but convergence is slow in the direction of the eigenvector with the smaller eigenvalue because $0.98$ is so close to $1$. So even though the learning rate $\alpha$ is fixed, the actual magnitudes of the steps in this direction decay according to approximately $(0.98)^n$ which becomes slower and slower. That is the cause of that exponential-looking slowdown in the progress in this direction (it happens in both directions but the other direction gets close enough soon enough that we don't notice or care). In this case convergence would be much faster if $\alpha$ was increased.

For a much better and more thorough discussion of this, I strongly recommend https://distill.pub/2017/momentum/.

thanks for detailed answer and great reference!. change the basis of $y$ really helped me. — Haitao Du, Jan 05 '18 at 14:47

score 12 · Answer 2 · answered Jan 04 '18 at 20:15

For a smooth function, $\nabla f=0$ at the local minima.

Because your update scheme is $\alpha \nabla f$, the magnitude $|\nabla f|$ controls the step size. In the case of your quadratic $|\Delta f|\rightarrow 0$ as well (just compute the hessian of the quadratic in your case). Note that this doesn't always have to be true. For example try the same scheme on $f(x)=x$. Then your step size is always $\alpha$ hence will never decrease. Or more interestingly, $f(x,y)=x+y^2$, where the gradient goes to 0 in the y coordinate, but not the $x$ coordinate. See Chaconne's answer for methodology for quadratics.

Why are my steps getting smaller when using fixed step size in gradient descent?

2 Answers2

Linked