1

I am presented with a data set, where I am supposed to perform linear regression on this using SGD. My first instinct would be to train each data point there is until I reach the last one. Only then will I get my estimate $\hat{y}$.

I understand there are some drawbacks to this:

  1. It will take so long to finish, since you have to train all data points.
  2. Convergence of parameters is still not guaranteed.

Thus, the idea of batching comes into mind. For example, I have a set of 100 data points. I have decided to group them in to 25 batches (or 4 data points each).

My questions are:

  1. How does this batching work? Do I randomly pick one data point to train from each batch? Meaning, after the end of first run, I will have one estimate.
  2. Is it possible that I will have 4 different estimates after having 4 different runs? Should I choose whichever gives the smallest error?
cgo
  • 9,107

1 Answers1

0

First, let’s clarify terminology: stochastic gradient descent means doing update one sample at a time, if you use batches it’s batch gradient descent, if you train on all data at once, it’s just gradient descent.

  1. How does this batching work? Do I randomly pick one data point to train from each batch? Meaning, after the end of first run, I will have one estimate.

You make an update using all the data in a batch, same as in gradient descent you would use all your data.

  1. Is it possible that I will have 4 different estimates after having 4 different runs? Should I choose whichever gives the smallest error?

You are splitting the data randomly to batches, so yes, results may differ between trainings. If you train long enough they should converge. Moreover, with gradient descent you usually initialize the data randomly, so for this reason alone you could get different results, even without batches.

Tim
  • 138,066