I will give it a shot and I hope you like it! There are some formulas below which might scare you of. I don't hope so, because I will do my best to explain them in the most simple way I can.
These are the two formulas:
- The likelihood: $P(r|\theta,a,x)$
- And the posterior: $P(\theta|D)$
TL;DR
Thompson Sampling lets you
- Choose an random model parameter from all the model parameters that you think are possible.
- Act once according to that particular model parameter.
- Observe the reward you get with that particular model parameter.
- Learn from this new experience and update your belief about the possible model parameters.
Likelihood??
The likelihood is something that defines how likely things are. In this case the likelihood says how likely it is that we get reward $r$ if play action $a$ in context $x$. For example, if it is raining (context!) and you take an umbrella (action!) you stay dry (reward! :) ). On the other hand, if it is not raining (context!) and you take an umbrella (action!) you have to carry extra weight (negative reward! :( ). So the likelihood is the central thing that you want to understand. If you know everything about the likelihood it is easy to act optimal.
What about that strange circle??
As you might have noticed I did not wrote anything about that strange circle $\theta$ which is called theta. (Mathematicians have a habit of indicating which parts are the hardest by giving them greek letters, making it even harder to understand). This $\theta$ represents the model parameter. These parameters are used when the relationship between the context+actions and the reward is more difficult. For example, a model parameter might be how much your reward goes down if 1mm rain falls on top of your head. Another model parameter might state how much your reward goes down if you take an umbrella. I just said that the likelihood is the central thing you want to understand; and central to the likelihood are the model parameters. If you know the model parameters $\theta$, you know how context+actions relate to reward and it is easy to act optimal.
So how do we get to know these model parameters such that I can get maximum reward??
That is the essential question for the multi-armed bandit problem. Actually, it has two parts. You want to get to know the model parameters precisely by exploring all different kind of actions in different contexts. But if you already know which action is good for a specific context you want to exploit that action and get as much reward as possible. So if you are uncertain about your model parameters $\theta$ you might want to do some extra exploration. If you are pretty sure about our model parameters $\theta$, you are also pretty sure which action to take. This is known as the exploration versus exploitation trade-off.
You haven't said anything about this posterior
Key to this optimal behaviour is your (un)certainty about the model parameters $\theta$. And the posterior says exactly that: given all the previous rewards we got from previous actions in previous contexts, how much do you know about $\theta$. For example, if you never have been outside you do not know how unhappy you become when rain falls on your head. In other words, you are very uncertain about the unhappiness-when-rain-on-head model parameter. If you have been in a rain sometimes, with and without an umbrella, you can start to learn something about this obscure model parameter.
Now what does Thomson Sampling suggest to do with all these uncertainties??
Thomson Sampling suggests something very simple: just pick a random model parameter from your posterior, take an action and observe what happens. For example, when you have never been outside before, the unhappiness-when-rain-on-head parameter can be anything. So we just pick one, we assume that we get really unhappy when rain falls on our head. We see that it is raining (context) so we take an umbrella (action) because our model parameter tells us that this is how we can get the maximum reward. And indeed, you observe that you get slightly grumpy from walking in the rain with an umbrella but not really unhappy. We learn from this that rain+umbrella is grumpy. Next time it rains you pick again a random belief about what happens when rain falls on your head. This time it might be that it doesn't bother you at all. However, once you are half-way down to your destination you are wringing wet and you learn that rain without umbrella is really really bad. This reduces your uncertainty about unhappiness-when-rain-on-head, because now you know it is probably high.
This sounds so simple!!
Yep, it is not that complex. The difficult part is sampling from a model parameter posterior. Getting and maintaining a distribution over all your model parameters, that is also appropriate for your specific problem is hard. But... it is definitely doable :).