The Warren Buffett Problem

Question

Here is an abstraction of an online learning / bandit problem that I've been working on in the summer. I haven't seen a problem like this before, and it looks quite interesting. If you know of any related work, I would appreciate references.

The Problem The setting is that of multi-armed bandits. You have N arms. Each arm i has an unknown but fixed probability distribution over rewards that can be earned by playing it. For concreteness, let's assume that each arm i pays reward $10 with probability p[i] and reward $0 with prob. 1-p[i].

In every round t you select a set S[t] of arms to play. For each arm you select, you pay a fee of $1 up front. For each selected arm, you collect a reward that is drawn from the (unknown) reward probability distribution of that arm. All rewards are credited to your bank account, and all fees are deducted from that account. In addition, you get a credit of $1 at the beginning of every iteration.

The problem is to develop a policy to select a subset of arms to play in each iteration to maximize profit (i.e. rewards minus fees for playing) over a long enough horizon, subject to the constraint that it must maintain a non-negative account balance at all times.

I did not specify whether the per-arm reward distributions are chosen from a prior distribution or chosen by an adversary. Both choices make sense. The adversary formulation is more appealing to me, but probably harder to make progress on. Here, the adversary chooses a vector (D1, D2, .., DN) of distributions. Given the distributions, the optimal budget balanced policy is to play all arms whose expected reward is greater than $1. Let P be the per-step profit of this optimal omniscient policy. I want my online policy to minimize regret (i.e. loss of profit over a time window T) wrt this omniscient policy.

Are you sure that the best policy is to play all arms whose expected reward is greater than $1 in every round? If you have the strict constraint that you have to maintain a non-negative account balance at all times, there may be rounds in which you aren't even allowed to play. — Matthias, Oct 22 '10 at 20:24
So you don't know the reward probabilities, but you can tell the payoff from each individual arm? — David Thornley, Oct 22 '10 at 21:00
You don't know probabilities and you don't know expected rewards. An omniscient "optimal" policy that I want to compare myself against can however play all arms with reward greater than 1 because it's omniscient. — Martin Pál, Oct 23 '10 at 00:54
I'll make a wild guess that after $\Theta(N)$ rounds you can get your expected income to within a constant factor of the optimal, after which the problem seems to have lost most of its unusual character. A lower-bound of $\Omega(N)$ follows from an instance where only one arm has a non-zero payoff. I don't see an upper-bound immediately. — Warren Schudy, Oct 23 '10 at 01:32
Correction: after $\Theta(N)$ rounds you probably can't guarantee to get within a constant factor of optimal income. You can however probably get that guarantee relative to the income available from arms that have expected return at least say 2 dollars. — Warren Schudy, Oct 23 '10 at 01:37
meta discussion on whether to retag this post is here: http://meta.cstheory.stackexchange.com/questions/549/the-warren-buffet-question — Suresh Venkat, Oct 23 '10 at 16:18
Surely the regret of any online strategy will depend on the maximum payment in any distribution? (E.g., for some $M\gg N$, you could have one distribution that, at each draw, has probability $1/M$ of paying $M^2$ and otherwise pays zero. And all other distributions just pay zero all the time. The expected number of steps until the first positive payoff is $NM$ for the online algorithm and $M$ for the optimal, and after that they are the same. So expected regret will be $(N-1)M$. — Neal Young, Apr 26 '23 at 17:36

score 13 · Answer 1 · edited Apr 26 '23 at 11:21

I imagine there are lots of possible approaches to this problem (many of which I'm sure you considered) -- here are a few ideas/references.

You could play this as $N$ independent parallel single-arm bandit games, deciding to pull or not pull each arm independently. This should work especially well if the rewards are independently distributed.
Allow each set of arms to be a new arm and run an Exp3-type algorithm. This gives a $O(2^{N/2} T^{1/2})$ regret - not so great.
In an upcoming NIPS 2010 paper, Satyen Kale, Rob Schapire, and I consider the case where one plays a slate of arms at once. In our work, however, the size of the slate is fixed. This paper also considers a similar problem. Another similar work appeared in ALT 2010. Perhaps some of the ideas transfer.
If you treat it as an experts problem (each expert recommends a different one of $2^N$ subsets), by following one expert, you can get estimate the performance of other experts who have non-empty intersections in their choice of arms to pull by using importance weighting. An Exp4 type analysis might get you $O(N\sqrt{T})$ regret but $O(2^N T)$ running time.

EDIT below:

It seems to me that the budget constraint (not going below $0$) makes the problem intractable. Imagine you have a budget of $1$. The adversary can make one of the arms always pay off and the rest never pay off. So w.p. $(n-1)/n$ you go bust in the first round while the optimal strategy gets $T$ dollars after $T$ rounds. So your expected regret is at least $(n-1)T/n$ and you can't hope for a high probability bound at all.

It also seems that this can work for any initial budget. Say you start with $B$ dollars. Then the adversary can set all but one arms to pay $0$ and one arm to pay something like $2B$ w.p. $1/B$. I guess if you have a limit on the possible payout amount and a high enough initial budget, then this might leave room for an interesting problem.

Hi Lev, thanks for the pointers. I agree that if I had an unlimited initial budget playing N parallel single arm bandits would solve the problem. The budget constraint however introduces coupling among arms and makes things interesting. In particular, in the first step you only have budget to play one arm. In the second step you can play either 11 arms or just 1 arm, depending on whether you got lucky in the first step and so on. So it is important to find a bunch of profitable arms early on that you then use tofund further exploration. — Martin Pál, Oct 23 '10 at 01:01
I didn't realize there was an initial budget (I now understand the "non-negative balance" part, but perhaps you can make it clearer in the question?) -- that makes the problem more interesting. Also the "contextual" or experts version might be fun to consider. Unfortunately, I do not know any more relevant references for this problem. — Lev Reyzin, Oct 23 '10 at 02:15
If I got the problem formulation right, you gain an extra $1 each round. Martin, could you perhaps clarify the question? — Jukka Suomela, Oct 23 '10 at 15:05
I think you gain whatever a machine pays if you play it and win and lose $1 whenever you decide to play. — Lev Reyzin, Oct 23 '10 at 15:14

The Warren Buffett Problem

1 Answers1