1

I am working on DQN and have confused myself with Epoch and Episode. I have gone through this answer but the confusion is increased.

I will explain my scenario and would like to understand the difference through it if anyone could guide me better.

Here is my scenario:
I have written a DQN Agent with the following properties:

class DQNAgent(nn.Module):
    def __init__(self, state_size, action_size):
        super().__init__()
        self.state_size = state_size
        self.action_size = action_size
        self.memory = deque(maxlen=2000)
        # print(f"self.memory {type(self.memory)}")
        self.gamma = 0.95  # discount rate
        self.epsilon = 1.0  # exploration rate
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.995
        self.learning_rate = 0.001
        self.model = self._build_model()
def _build_model(self):
    # Neural Net for Deep-Q learning Model
    model = Sequential()
    model.add(Dense(24, input_dim=4, activation='relu'))
    model.add(Dense(24, activation='relu'))
    model.add(Dense(self.action_size, activation='linear'))
    model.compile(loss='mse',
                  optimizer=Adam(lr=self.learning_rate))
    print(model.summary())

    return model

def memorize(self, state, action, reward, next_state, done):
    self.memory.append((state, action, reward, next_state, done))
    print(f" action {action} reward {reward}")

def act(self, state):
    if np.random.rand() <= self.epsilon:
        return random.randrange(self.action_size)
    act_values = self.model.predict(state)
    return np.argmax(act_values[0])  # returns action

def replay(self, batch_size):
    minibatch = random.sample(self.memory, batch_size, )

    for state, action, reward, next_state, done in minibatch:
        target = reward
        if not done:
            target = (reward + self.gamma *
                      np.amax(self.model.predict(next_state)[0]))
        target_f = self.model.predict(state)
        target_f[0][action] = target
        self.model.fit(state, target_f, epochs=1, verbose=0)

    if self.epsilon > self.epsilon_min:
        self.epsilon *= self.epsilon_decay

Here is the way I am calling the Environment class:

for episode in range(100):
    for time in range(500):
        call step function of gym.Env class
     if len(agent.memory) > batch_size:
        agent.replay(batch_size)

memory_size=2000 batch_size=1024

I am confused with the epoch, time, and episode in my scenario. What I am getting is as follows:

1 Episode = 500-time slots So, 1 epoch will train on the data stored in replay memory after the completion of episode 1.

When Episode 2 completes, it will train on the entire data it has stored inside the memory this time. And until the forward propagation and backward propagation does not complete on the data, it will keep iterating?

Am I getting it right? I am just confused with the episode, time and epoch concept here.

JAMSHAID
  • 113
  • 3

1 Answers1

2

Time (timestep) is a concept from Markov Decision Processes. It is the way how the evolution of an MDP is indexed. For example, steps in a game.

Episode is a concept from Markov Decision Processes. It is a run of given MDP with a given policy. For example, when the game is over. In your code, you restrict the length of your episode to 500 timesteps.

Epoch is a concept from Deep Learning.

The number of epochs is a hyperparameter that defines the number times that the learning algorithm will work through the entire training dataset. https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/

The fact that epochs=1 in your code means that an update of the weights happens with the given data set only once. From the RL context, it makes sense because of an online nature of your learning.

Karel Macek
  • 2,816