Questions tagged [gradient-descent]

Gradient descent is a first-order iterative optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. For stochastic gradient descent there is also the [sgd] tag.

Gradient descent is based on the observation that if the multi-variable function $F(\mathbf {x} )$ is defined and differentiable in a neighborhood of a point $\mathbf {a}$ , then $F(\mathbf {x} )$ decreases fastest if one goes from $\mathbf {a}$ in the direction of the negative gradient of $F$ at $\mathbf {a}, > - \nabla F(\mathbf {a} )$. It follows that, if

$$ \mathbf {b} =\mathbf {a} -\gamma \nabla F(\mathbf {a} )$$

for $ \gamma $ small enough, then $ F(\mathbf {a} )\geq F(\mathbf {b} > )$. In other words, the term ${\displaystyle \gamma \nabla F(\mathbf > {a} )}$ is subtracted from $\mathbf {a} $ because we want to move against the gradient, namely down toward the minimum. With this observation in mind, one starts with a guess $ \mathbf {x} _{0}$ for a local minimum of $F$, and considers the sequence $\mathbf {x} > _{0},\mathbf {x} _{1},\mathbf {x} _{2},\dots$ such that

$$ \mathbf {x} _{n+1}=\mathbf {x} _{n}-\gamma _{n}\nabla F(\mathbf {x} > _{n}),\ n\geq 0$$

We have

$$ F(\mathbf {x} _{0})\geq F(\mathbf {x} _{1})\geq F(\mathbf {x} > _{2})\geq \cdots$$

so hopefully the sequence $\mathbf {x} _{n}$ converges to the desired local minimum.

-- Wikipedia

988 questions

votes

2 answers

Why use matrix transpose in gradient descent?

I just don't understand why use matrix transpose, instead of matrix inverse, to calculate delta of weight in gradient descent, like described in http://cs231n.github.io/optimization-2/#mat. # forward pass W = np.random.randn(5, 10) X =…

gradient-descent

asked Mar 19 '18 at 02:58

eric2323223

votes

1 answer

Gradient descent: Shouldn't step size be proportional to inverse of gradient of residual?

It has been decades since I coded up any type of gradient descent algorithm to drive a function to zero (or to a minimum). I am following this tutorial, which minimizes $J(\overrightarrow{\theta})$. It all seems straighforward except for one…

gradient-descent

asked Dec 02 '19 at 01:31

user2153235

votes

3 answers

Why doesn't gradient descent terminate on saddle point?

I always read that gradient descent terminates at a local minima. Isn't it possible that it will terminate at a saddle point? I know it cant terminate at a maxima unless iteration begins there. However, I fail to see why it would fail to terminate…

gradient-descent

asked Nov 20 '15 at 11:54

Minaj

1,421

votes

1 answer

Gradient Descent and the No-Free-Lunch Theorem

I'm doing a presentation on the No-Free-Lunch theorem, since I've found that the best way to learn about a topic is to try and teach it...In order to get an idea what I'm talking about, I started reading the Wolpert & Macready Paper. In this paper,…

gradient-descent

asked Nov 15 '18 at 03:53

JacKeown

votes

1 answer

Gradient descent with random kicks on non-convex functions

I've recently learned gradient descent and it clearly gets stuck at local minimums when applied to non-convex functions. Can't we just randomly kick the values in between steps when iterating? Kind of like quantum tunneling. That would drastically…

gradient-descent

asked Jan 02 '17 at 13:54

A__A

votes

3 answers

Gradient descent for logistic regression partial derivative doubt

I'm a software engineer, and I have just started a Udacity's nanodegree of deep learning. I have also worked my way through Stanford professor Andrew Ng's online course on machine learning and now I'm comparing. I have a big doubt about Gradient…

gradient-descent

asked Feb 13 '17 at 18:24

VansFannel

votes

2 answers

For gradient descent, does there always a exist a step size such that the cost of the training error you are trying to minimize never increase?

Consider running gradient descent to minimize ERM/training error (empirical risk) and denote it with $ J(\theta, X, y)$ for data $X$,$y$ and parameter/function $ \theta $. Recall gradient descent: $$ \theta^{(t)} := \theta^{(t-1)} - \gamma_{t-1}…

gradient-descent

asked Aug 26 '15 at 21:20

Charlie Parker

6,866

votes

2 answers

is there a "generic" gradient descent

on this week we learned that the general form of the update step for gradient descent is: $x := x - \alpha \frac{\partial f}{\partial x}$ So, in order to find x where f is minimum, we have to calculate the derivative df/dx. Along this course, we…

gradient-descent

asked Jan 07 '20 at 14:15

rodolfo.mendes

votes

1 answer

Deciding Initial Weights for Gradient Descent Algorithm

Having done an interesting Coursera module on Linear Regression, and being provided with the starting values for Gradient Descent, I am wondering about some things that were not touched upon, and, for which I could not find the answer to: What is…

gradient-descent

asked Mar 19 '18 at 16:30

thebluephantom

votes

1 answer

Gradient descent for an objective function

I am reading the Wikipedia on gradient descent on a non-linear system just to get better understand of how they work under the hood. I can follow everything up until this line: I understand the definition of a Jacobian, but what I don't understand…

gradient-descent

asked Oct 21 '16 at 15:16

YellowPillow

1,251

vote

0 answers

Gradient Descent Algorithm: For multiple local minimum which one to pick

This might be a newbie question, but it is from a newbie. If there are multiple local minimums, and the function converges at various local minima, which local minima to pick for optimization? Do we keep calculating multiple local minima to find the…

gradient-descent

asked Sep 24 '22 at 18:58

Rahul

vote

0 answers

What is the range of learning rate which makes gradient descent converge and not converge for this loss?

Range in terms of x and y As you might be aware that gradient descent uses alpha = learning rate W* = W -alpha*(dL/dW) x,y are constant

gradient-descent

asked Mar 02 '21 at 16:53

Rockko joke

vote

1 answer

does cost function of one hidden layer perception have only one global minimal

is it true that: After training a multilayer perceptron with one hidden layer using gradient descent, we always get the same decision boundary regardless of the initialization point.

gradient-descent

asked Oct 29 '19 at 06:10

user42493

vote

2 answers

Some confusion of Gradient Descent?

In logistics regression we use Gradient Descent to minimize the error function. For example: We have error function e(x) = hθ(x) - y To get an appropriate θ to minimize error(x), we use Gradient Descent. To simplify my question let`s assume θ, x,…

gradient-descent

asked Sep 29 '17 at 03:59

Kramer Li

vote

1 answer

What determines the cost function / how to find the cost function to use

I'm starting to learn a bit about gradient descent, and that method attempts to minimize the cost function. I've seen examples using linear regression, and the corresponding cost function, but now I'm wondering, given an arbitrary function $g(x)$…

gradient-descent

asked Jul 01 '16 at 00:46

atomsmasher

2 Next