Circular reference in states of the Bellman equation

Question

I want to formulate a game of darts as a dynamic program. The goal is to minimize the number of darts thrown while reaching checkout. A dart player has a score s. If his score is s = 2, there is only a single legal / possible move.

He can checkout by throwing at D1
Every other action is not legal and results in a so-called bust and the score remains the same, s = 2

The above can be formulated as:

where z represents the outcome on the dartboard

z ∈ {S1, S2, S3, ..., D1, D2, D3, ..., T1, T2, T3, SB, DB}

The corresponding numerical scores are h. So if z=T20, h=60

Now the Bellman equation for this DP satisfies:

Where V(s) is the value function when the score is equal to s. In addition, mu is the aiming location on the dartboard. The reward of each action is equal to 1 (because one dart is thrown). In addition, we have p, the probabilities of reaching s_i+1 when aiming at mu, and the corresponding value function of s_i+1. The terminal condition V(0) is equal to 0. All of the above can be found op page 13 here

So, inserting all for s = 2, excluding the illegal moves, leads to the following:

V(2) = min{1 + (V(0) * p(D1;mu))

Lets assume there are two possible values for mu, namely the center of D1 and the center of D19, which is on the opposite of the dartboard. Logically

p(z=D1, mu=center D1) > p(z=D1, mucenter D19)

However, V(0) = 0, so

p(z=D1, mu=center D1) * V(0) =  p(z=D1, mu=center D19) * V(0) =  0

So, if we want to minimize the Bellman curve, there is no optimal solution. Both options lead to the same result, V(2) = 1. However, logical reasoning would suggest aiming for mu = center D1 is the optimal / desired solution. What am I doing wrong here?

So, I thought maybe I should add the illegal moves into the Bellman equation. This would look as follows:

V(2) = min{1 + sum((V(0) * p(D1;mu)) + (V(2) * p(S1;mu)) + (V(2) * p(S2;mu)) + ... + (V(2) * p(T19;mu)) + (V(2) * p(T20;mu)))

However, now there is a circular reference for all the illegal moves. How should this be addressed? Should the illegal moves be added? Please help me explain what I am doing wrong.

Can you define all the symbols you're using? $f$, $h$, $z$, $p$, $V$... — fontanf, Sep 15 '21 at 08:30
z is {S1, S2, S3, ..., D1, D2, D3, ..., T18, T19, T20}. h is the numerical score of z. For example h(T20) = 60. p is the probability of throwing z when aiming for mu. V is the value function. f is the formula that subtracts h from s. s is the score {0, 1, ..., 501}. — HJA24, Sep 15 '21 at 08:37
Can you edit the question and incorporate this to it? It would be easier to read. The meaning of $V$ is still not clear for me, what does it represent physically? — fontanf, Sep 15 '21 at 09:52
done. V is the value function and represents the value of each score / state — HJA24, Sep 15 '21 at 10:39
From what I understand, V is the expected number of darts thrown. In this case, the recursive formula looks wrong to me. Can you justify it? Maybe it would be easier with $S(v)$ the expected score using $v$ darts — fontanf, Sep 15 '21 at 14:11
The formula can be found on page 13 of https://arxiv.org/pdf/2011.11031.pdf — HJA24, Sep 15 '21 at 17:51
ok, the question is clear now. I think you can add the link in the question. I'm not sure of the answer though, I'll let someone else answer — fontanf, Sep 16 '21 at 05:59
You don’t need to add illegal moves, but you do need to include all $z$ that can result from each legal move. Your first approach omits terms that correspond to missing the target. — RobPratt, Sep 16 '21 at 13:16
If the score is 2 and a player aims at D1 and misses, the outcome leads to a bust. Hence the resulting z is illegal right? So these terms are omitted rightfully I believe. In this particular state (s=2), we can only consider D1 right? — HJA24, Sep 16 '21 at 14:14
Isn't that just the "otherwise" case of $f$, where $s_{i+1}=s_i$? You must consider all possible outcomes $z$ for each action $\mu$ so that $\sum_z p(z;\mu) = 1$. — RobPratt, Sep 16 '21 at 22:26
Yes it is. But in the "otherwise" case we have a circular reference right? The value function of score s depends on the probability of the "otherwise" cases multiplied by the value function of the bust cases (which is the value function itself again) — HJA24, Sep 17 '21 at 09:13

RobPratt · Accepted Answer · 2021-09-17T21:53:32.633

For each state $s$, you want to compute the value function $V(s)$, which satisfies $V(0)=0$ and the Bellman equation for $s \not= 0$: $$V(s) = \min_\mu\left\{1+\sum_z V(f(s,z)) p(z;\mu)\right\}. \tag{13}$$ The $\min_\mu$ is over all legal actions $\mu$ in state $s$. The $\sum_z$ is over all outcomes $z$ that can occur when action $\mu$ is taken in state $s$. As a consequence, $\sum_z p(z;\mu)=1$ for each $\mu$, so you can use that property to help check that you have constructed $(13)$ correctly when you write everything out explicitly with actual numbers. In particular, when $s=2$ and the only legal action is $\mu=D_1$, the Bellman equation is $$V(2) = 1+V(0) p(D_1;D_1) + V(2) (1-p(D_1;D_1)) = 1+ V(2) (1-p(D_1;D_1)),$$ which implies that $V(2)=1/p(D_1;D_1)$.

If $f(s,z) < s$ for all $z$ such that $p(z;\mu)>0$, you can solve $(13)$ recursively by considering states in increasing order. But as you have observed, sometimes $f(s,z)=s$ when $p(z;\mu)>0$, which leads to an equation with a circular reference in that $V(s)$ appears both on the left-hand side and on the right-hand side as a minimand. In this situation, there are several approaches to solve $(13)$, including value iteration, policy iteration, and linear programming (LP). The LP is to maximize $\sum_s V(s)$ subject to \begin{align} V(s) &\le 1+\sum_z V(f(s,z)) p(z;\mu) &&\text{for $s \not= 0$ and all $\mu$} \\ V(0) &= 0 \\ V(s) &\ge 0 &&\text{for all $s$} \\ \end{align} The positive LP dual variables indicate which action or actions to take in each state.

For an application to the casino game blackjack (where the problem is to maximize expected profit rather than minimize the expected number of darts thrown), see this blog post.

Thank you! It finally makes sense now. Because we want to minimize the value function, we need the denominator to be as large as possible. Hence, we aim for D1 and not D19 when targeting D1.. — HJA24, Sep 18 '21 at 04:50
Thanks for help. I asked a new question regarding the same subject. Can you take a look? (https://or.stackexchange.com/questions/7021/bellman-equation-for-darts-that-minimizes-the-number-of-turns) — HJA24, Sep 30 '21 at 12:03

Circular reference in states of the Bellman equation

1 Answers1

Linked