19

I am unable to understand Thompson Sampling and how it works. I was reading about Multi Arm Bandit and after reading Upper Confidence Bound Algorithm, many text suggested that Thompson Sampling performs better than UCB. What is Thompson Sampling, in layman's or simple terms?

Feel free to provide reference articles for further understanding.

dejavu
  • 291

2 Answers2

16

I am going to try to give an explanation without any mathematics. Part of this answer is repeated from some points I made in an answer to another question on MAB problems.


The strategic trade-off in multi-arm bandit problems: In multi-arm bandit problems the gambler plays one "bandit" each round and attempts to maximise his total expected return over a given number of rounds. The expected return of each of the bandits is described by some unknown parameters in the problem, and so as we observe more outcomes in each round, we get more information about these unknown parameters, and hence, about the expected return of each of the bandits. In each round of play (except the last), the MAB problem involves a strategic trade-off by the gambler between two objectives:

  • Immediate rewards: In each round he would like to choose a distribution that gives him a high expected reward on this round, which entails a preference for distributions he (presently) infers to have a high mean reward;

  • Future rewards (affected by information gain): On the other hand, he wants to refine his knowledge of the true expected rewards by gaining more information on the distributions (especially those that he has not played as much as others), so that he can improve his choices in future rounds.

The relative importance of these two things will determine the trade-off, and this relative importance is affected by a number of factors. For example, if there is only a small number of remaining rounds in the problem then inference for future trials is relatively less valuable, whereas if there is a large number of remaining rounds then inference for future rewards is relatively more valuable. So the gambler needs to consider how much he wants to focus on maximising the immediate rewards in the current round, and how much he wants to deviate from this, to learn more about the unknown parameters that determine the expected reward of each of the bandits.


Thompson sampling: The basic idea of Thompson sampling is that in each round, we take our existing knowledge of the machines, which is in the form of a posterior belief about the unknown parameters, and we "sample" the parameters from this posterior distribution. This sampled parameter yields a set of expected rewards for each machine, and now we bet on the one with the highest expected return, under that sampled parameter.

Prima facie, the Thompson sampling scheme seems to involve an attempt to maximise the immediate expected return in each round (since it involves this maximisation step after sampling the parameter). However, because it involves random sampling of the parameter from the posterior, the scheme involves an implicit variation of maximising the present reward, versus searching for more information. Most of the time we will get a parameter "sample" that is somewhere in the main part of the posterior, and the choice of machine will roughly approximate maximisation of the immediate reward. However, sometimes we will randomly sample a parameter value that is far in the tails of the posterior distribution, and in that case we will end up choosing a machine that does not maximise the immediate reward - i.e., this will constitute more of a "search" to assist with future rewards.

The Thompson scheme also has the nice property that we tend to decrease our "search" as we get more information, and this mimics the desirable strategic trade-off in the problem, where we want to focus less on searches as we obtain more information. As we get play more and more rounds and get more and more data, the posterior converges closer to the true parameter values and so the random "sampling" in the Thompson scheme becomes more tightly packed around the parameter values that will lead to maximisation of the immediate reward. Hence, there is an implicit tendency of this scheme to be more "search-oriented" early on with little information, and less "search-oriented" later on when there is a lot of data.

Now, having said this, one clear drawback of the Thompson sampling scheme is that it does not take into account the number of rounds remaining in the MAB problem. This scheme is sometimes formulated on the basis of a game with infinite rounds, and in this case that is not an issue. However, in MAB problems with finite rounds, it is preferable to take account of the number of remaining rounds in order to decrease the "search" as the number of future rounds decreases. (And in particular, the optimal play in the last round is to ignore searches completely and just bet on the bandit with the highest posterior expected return.) The Thompson scheme does not do this, so it will play finite-round games in a way that is clearly sub-optimal in certain cases.

Ben
  • 124,856
  • 1
    I wish I could give this response multiple thumbs up. I would probably add how I would update the posteriors - for example if the posteriors were represented as normal distributions - how are the updates for the mean and standard deviation of the posteriors calculated. I say this because I don't know myself – piccolo Dec 01 '18 at 23:00
6

I will give it a shot and I hope you like it! There are some formulas below which might scare you of. I don't hope so, because I will do my best to explain them in the most simple way I can.

These are the two formulas:

  • The likelihood: $P(r|\theta,a,x)$
  • And the posterior: $P(\theta|D)$

TL;DR

Thompson Sampling lets you

  1. Choose an random model parameter from all the model parameters that you think are possible.
  2. Act once according to that particular model parameter.
  3. Observe the reward you get with that particular model parameter.
  4. Learn from this new experience and update your belief about the possible model parameters.

Likelihood??

The likelihood is something that defines how likely things are. In this case the likelihood says how likely it is that we get reward $r$ if play action $a$ in context $x$. For example, if it is raining (context!) and you take an umbrella (action!) you stay dry (reward! :) ). On the other hand, if it is not raining (context!) and you take an umbrella (action!) you have to carry extra weight (negative reward! :( ). So the likelihood is the central thing that you want to understand. If you know everything about the likelihood it is easy to act optimal.

What about that strange circle??

As you might have noticed I did not wrote anything about that strange circle $\theta$ which is called theta. (Mathematicians have a habit of indicating which parts are the hardest by giving them greek letters, making it even harder to understand). This $\theta$ represents the model parameter. These parameters are used when the relationship between the context+actions and the reward is more difficult. For example, a model parameter might be how much your reward goes down if 1mm rain falls on top of your head. Another model parameter might state how much your reward goes down if you take an umbrella. I just said that the likelihood is the central thing you want to understand; and central to the likelihood are the model parameters. If you know the model parameters $\theta$, you know how context+actions relate to reward and it is easy to act optimal.

So how do we get to know these model parameters such that I can get maximum reward??

That is the essential question for the multi-armed bandit problem. Actually, it has two parts. You want to get to know the model parameters precisely by exploring all different kind of actions in different contexts. But if you already know which action is good for a specific context you want to exploit that action and get as much reward as possible. So if you are uncertain about your model parameters $\theta$ you might want to do some extra exploration. If you are pretty sure about our model parameters $\theta$, you are also pretty sure which action to take. This is known as the exploration versus exploitation trade-off.

You haven't said anything about this posterior

Key to this optimal behaviour is your (un)certainty about the model parameters $\theta$. And the posterior says exactly that: given all the previous rewards we got from previous actions in previous contexts, how much do you know about $\theta$. For example, if you never have been outside you do not know how unhappy you become when rain falls on your head. In other words, you are very uncertain about the unhappiness-when-rain-on-head model parameter. If you have been in a rain sometimes, with and without an umbrella, you can start to learn something about this obscure model parameter.

Now what does Thomson Sampling suggest to do with all these uncertainties??

Thomson Sampling suggests something very simple: just pick a random model parameter from your posterior, take an action and observe what happens. For example, when you have never been outside before, the unhappiness-when-rain-on-head parameter can be anything. So we just pick one, we assume that we get really unhappy when rain falls on our head. We see that it is raining (context) so we take an umbrella (action) because our model parameter tells us that this is how we can get the maximum reward. And indeed, you observe that you get slightly grumpy from walking in the rain with an umbrella but not really unhappy. We learn from this that rain+umbrella is grumpy. Next time it rains you pick again a random belief about what happens when rain falls on your head. This time it might be that it doesn't bother you at all. However, once you are half-way down to your destination you are wringing wet and you learn that rain without umbrella is really really bad. This reduces your uncertainty about unhappiness-when-rain-on-head, because now you know it is probably high.

This sounds so simple!!

Yep, it is not that complex. The difficult part is sampling from a model parameter posterior. Getting and maintaining a distribution over all your model parameters, that is also appropriate for your specific problem is hard. But... it is definitely doable :).

Pieter
  • 2,077
  • 11
  • 26