1

Suppose our hypothesis space is $$\mathcal{H}=\{f:f(x)=f_\theta (x), \theta\in \Theta\},$$ where $\theta$ is the trainable parameter.

Suppose we have a dataset $\{x_i,y_i\}_{i=1}^N.$

In the notes from my professor, he defines the empirical risk minimization as $\,\Phi(\theta)=\frac{1}{N} \sum_{i=1}^N L(f_\theta(x_i),y_i)$.

Now we can evaluate $\nabla \Phi(\theta)$, which is a function of $x_i,y_i,\theta$.

Suppose we have a initialization $\theta_0$ and we do the gradient decent with some fixed learning rate $\eta$.

We update $\theta_0$ with $\theta_0 - \eta \nabla\Phi(\theta_0)$ until it converges (suppose it will converge).


My question then arises.

For each update, we need to use the entire dataset $\{x_i,y_i\}_{i=1}^N$ and we need to update many times until $\theta_0$ converges.

So, do we just keep reusing the dataset for all these updates, and call the dataset used for each update an epoch?

What I need is a confirmation that if the the dataset used for each update is called an epoch. Thanks.

Sam Wong
  • 177

1 Answers1

2

Yes, on each epoch you are using the same dataset. Gradient descent basically runs in a for-loop. Using a Julia-like pseudocode, it would be something like below

for epoch in 1:n_epochs
   theta = update(theta, data)
end

There is also batch gradient descent, where each epoch there is an inner loop that iterates over batches of the dataset

for epoch in 1:n_epochs
   for batch in split_to_batches(data)
      theta = update(theta, batch)
   end
end

When the batch size is 1, we call it stochastic gradient descent.

Tim
  • 138,066