What is the definition of `rollout' in neural network or OpenAI gym

Question

I'm relatively new to the area. I run into several time the term ``rollout'' in training neural networks. I have been searching for a while but still not sure what it means.

For example, the number of rollout for running the hopper environment. I'm not sure what it means.

https://www.gwern.net/docs/rl/2018-segler.pdf – jsotola Oct 25 '18 at 20:34 — jsotola, Oct 25 '18 at 20:34

score 7 · Accepted Answer · answered Oct 26 '18 at 04:16

7

The standard use of “rollout” (also called a “playout”) is in regard to an execution of a policy from the current state when there is some uncertainty about the next state or outcome - it is one simulation from your current state. The purpose is for an agent to evaluate many possible next actions in order to find an action that will maximize value (long-term expected reward).

Uncertainty in the next state can arise from different sources depending on your domain. In games the uncertainty is typically from your opponent (you are not certain what move they will make next) or a chance element (e.g. a dice roll). In robotics, you may be modeling uncertainty in your environment (e.g. your perception system gives inaccurate pose estimates, so you are not sure an object is where you think it is) or your robot (e.g. noisy sensors result in unreliable transition dynamics).

I think the term comes from Tesauro and Galperin NIPS 1997 in which they consider Monte Carlo simulations of Backgammon where a playout considers a sequence of dice rolls:

In backgammon parlance, the expected value of a position is known as the "equity" of the position, and estimating the equity by Monte-Carlo sampling is known as performing a "rollout." This involves playing the position out to completion many times with different random dice sequences, using a fixed policy P to make move decisions for both sides.

As you could imagine, this can be quite computationally expensive. A lot of tricks have been developed to make this faster/more efficient. For example, AlphaGo uses a simpler classifier for rollouts than in the supervised learning layers. This results in rollout policies that are considerably less accurate than supervised learning policies, but, they are also considerably faster, so you can very quickly generate a ton of game simulations to evaluate a move.

The Second Edition of Sutton and Barto’s famous textbook on reinforcement learning has a full section just about rollout algorithms (8.10), and also more information on Monte Carlo sampling and Monte Carlo Tree Search (which has a strong simulation component). You can find a draft version here.

answered Oct 26 '18 at 04:16

adamconkey

360
2
10

Thank you so much for your help. Deeply appreciate it. – Ali Oct 26 '18 at 15:05
Thanks for this answer. Could one compare a rollout during training to a step in the environment after training? Is it correct to assume a rollout is a bunch of different possible steps, from which the one with the highest reward is being selected and taken? – Philipp May 06 '20 at 15:59
What does it mean in the context of off-policy algorithms? For example for TRPO, this article says: "when rollout workers and optimizers are running in parallel asynchronously, the behavior policy can get stale". Can you please explain this sentence a bit? – Mahesha999 Mar 25 '21 at 20:51
1

@Mahesha999 it essentially just means executing your policy to collect some data (usually state, action, reward, next-state tuples). A rollout worker is one simulation executing the policy, and you could have many rollout workers running in parallel in different threads to speed up data collection. The rollout workers will use a behavior policy, which in TRPO is an "old" version of the parametric policy you're optimizing, i.e. it hasn't yet been updated based on the most recently collected data. – adamconkey Mar 25 '21 at 23:16
Thanks to clarify. I got confused with Sutton-Barto's use of term "rollout policy" in the context of MCTS, where its explained that it first follows MCTS policy from root till certain depth (of frontiers) and from that depth use rollout policy (till leaves or some resource limit (mostly time) is reached). So I guess this use of "rollout" is a bit different than for off-policy algorithms, mainly because in MCTS rollouts are not part of any kind of behavioral policy. Am I right with this? (continued...) – Mahesha999 Mar 26 '21 at 07:57
(...continued from last comment) If yes, then I guess your 1st paragraph definition is more general and hence applies to both (MCTS and off policy policy gradients) of these scenarios: "execution of policy from current state ... to evaluate many possible next actions ...". One more question, is rollout policy always different (like in MCTS or TRPO) from actual learned policy? I believe it might be dependent on algorithm, but what is general / frequent case? – Mahesha999 Mar 26 '21 at 08:02
@Mahesha999 MCTS still has a behavior policy in a sense, only it's a random policy. Random policies are one of the most simple behavior policies you can have. Regarding your last question, if the behavior policy and policy being learned are the same, then you're into the world of on-policy algorithms. That means every time you do rollouts with your policy to collect more data, you throw out the old data and optimize your policy based only on the newly collected data. – adamconkey Mar 26 '21 at 15:27

score 6 · Answer 2 · edited May 24 '19 at 14:25

The definition of "rollouts" given by Planning chemical syntheses with deep neural networks and symbolic AI (Segler, Preuss & Waller ; doi: 10.1038/nature25978 ; credit to jsotola):

Rollouts are Monte Carlo simulations, in which random search steps are performed without branching until a solution has been found or a maximum depth is reached. These random steps can be sampled from machine-learned policies p(a|s), which predict the probability of taking the move (applying the transformation) a in position s, and are trained to predict the winning move by using human games or self-play

What is the definition of `rollout' in neural network or OpenAI gym

2 Answers2