Here is an abstraction of an online learning / bandit problem that I've been working on in the summer. I haven't seen a problem like this before, and it looks quite interesting. If you know of any related work, I would appreciate references.
The Problem The setting is that of multi-armed bandits. You have N arms. Each arm i has an unknown but fixed probability distribution over rewards that can be earned by playing it. For concreteness, let's assume that each arm i pays reward $10 with probability p[i] and reward $0 with prob. 1-p[i].
In every round t you select a set S[t] of arms to play. For each arm you select, you pay a fee of $1 up front. For each selected arm, you collect a reward that is drawn from the (unknown) reward probability distribution of that arm. All rewards are credited to your bank account, and all fees are deducted from that account. In addition, you get a credit of $1 at the beginning of every iteration.
The problem is to develop a policy to select a subset of arms to play in each iteration to maximize profit (i.e. rewards minus fees for playing) over a long enough horizon, subject to the constraint that it must maintain a non-negative account balance at all times.
I did not specify whether the per-arm reward distributions are chosen from a prior distribution or chosen by an adversary. Both choices make sense. The adversary formulation is more appealing to me, but probably harder to make progress on. Here, the adversary chooses a vector (D1, D2, .., DN) of distributions. Given the distributions, the optimal budget balanced policy is to play all arms whose expected reward is greater than $1. Let P be the per-step profit of this optimal omniscient policy. I want my online policy to minimize regret (i.e. loss of profit over a time window T) wrt this omniscient policy.