I am working on DQN and have confused myself with Epoch and Episode. I have gone through this answer but the confusion is increased.
I will explain my scenario and would like to understand the difference through it if anyone could guide me better.
Here is my scenario:
I have written a DQN Agent with the following properties:
class DQNAgent(nn.Module):
def __init__(self, state_size, action_size):
super().__init__()
self.state_size = state_size
self.action_size = action_size
self.memory = deque(maxlen=2000)
# print(f"self.memory {type(self.memory)}")
self.gamma = 0.95 # discount rate
self.epsilon = 1.0 # exploration rate
self.epsilon_min = 0.01
self.epsilon_decay = 0.995
self.learning_rate = 0.001
self.model = self._build_model()
def _build_model(self):
# Neural Net for Deep-Q learning Model
model = Sequential()
model.add(Dense(24, input_dim=4, activation='relu'))
model.add(Dense(24, activation='relu'))
model.add(Dense(self.action_size, activation='linear'))
model.compile(loss='mse',
optimizer=Adam(lr=self.learning_rate))
print(model.summary())
return model
def memorize(self, state, action, reward, next_state, done):
self.memory.append((state, action, reward, next_state, done))
print(f" action {action} reward {reward}")
def act(self, state):
if np.random.rand() <= self.epsilon:
return random.randrange(self.action_size)
act_values = self.model.predict(state)
return np.argmax(act_values[0]) # returns action
def replay(self, batch_size):
minibatch = random.sample(self.memory, batch_size, )
for state, action, reward, next_state, done in minibatch:
target = reward
if not done:
target = (reward + self.gamma *
np.amax(self.model.predict(next_state)[0]))
target_f = self.model.predict(state)
target_f[0][action] = target
self.model.fit(state, target_f, epochs=1, verbose=0)
if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay
Here is the way I am calling the Environment class:
for episode in range(100):
for time in range(500):
call step function of gym.Env class
if len(agent.memory) > batch_size:
agent.replay(batch_size)
memory_size=2000 batch_size=1024
I am confused with the epoch, time, and episode in my scenario. What I am getting is as follows:
1 Episode = 500-time slots So, 1 epoch will train on the data stored in replay memory after the completion of episode 1.
When Episode 2 completes, it will train on the entire data it has stored inside the memory this time. And until the forward propagation and backward propagation does not complete on the data, it will keep iterating?
Am I getting it right? I am just confused with the episode, time and epoch concept here.