Baum-Welch algorithm variation for Hidden Markov model with reward

Question

Following my previous question on the subject I would like to get your feedback on the following alternative solution. (The original solution to this question is the usage of the POMDP model proposed in the previous thread by @ziggystar.)

The Baum-Welch algorithm uses two stages to iteratively estimate and update the transition probabilities $a_{ij}$ and the emission probabilities $b_{ik}$:

The forward-backward procedure in which the parameters: $\alpha_i(t), \beta_i(t), \gamma_i(t)$ are calculated.
The update procedure in which the variables $\xi_{ij}(t), a_{ij}, b_{ik}$, are updated.

See wikipedia for more details about the algorithm and the variables definitions.

The following suggested solution is based on the original (Baum-Welch) algorithm with some changes to support rewards:

Forward and backward procedure: Keep the calculation of $\alpha_i(t), \beta_i(t), \gamma_i(t)$ as it is the original algorithm (intact).
Update procedure: Instead of using one transition matrix $A=\{a_{ij}\}$ we will have two transition matrices: $A^{+}=\{a_{ij}^+\}$ and $A^{-}=\{a_{ij}^-\}$ which represent the transition probabilities with the presence of reward and when a reward is missing, respectively. To estimate the transition matrices, instead of calculating $\xi_{ij}(t)$, we will calculate $\xi_{ij}^{+}(t)$ and $\xi_{ij}^{-}(t)$ as follows: \begin{equation} \xi^-_{ij}(t)=P(X_t=i, X_{t+1}=j| Y, \theta, R_t=-) = \frac{\alpha_i(t) a^-_{ij} \beta_i(t+1) b_j(y_{t+1})}{\sum_{k=1}^N{\alpha_k(t)}} \end{equation}

\begin{equation} \xi^+_{ij}(t)=P(X_t=i, X_{t+1}=j| Y, \theta, R_t=+) = \frac{\alpha_i(t) a^+_{ij} \beta_i(t+1) b_j(y_{t+1})}{\sum_{k=1}^N{\alpha_k(t)}} \end{equation} where $R_t=+/-$ represent the fact the the subject got a reward or didn't get a reward, respectively, at time $t$.

$a^+_{ij}$ and $a^-_{ij}$ will be updated as following:

\begin{equation} a^-_{ij} = \frac{\sum_{t=1}^{T-1}{\xi^-_{ij}(t)}}{\sum_{t=1}^{T-1}{\gamma_i(t)}} \end{equation}

\begin{equation} a^+_{ij} = \frac{\sum_{t=1}^{T-1}{\xi^+_{ij}(t)}}{\sum_{t=1}^{T-1}{\gamma_i(t)}} \end{equation}

The update procedure for $b_{ij}$ remains unchanged.

Before formulating the solution, you should formalize the problem. What is your algorithm calculating? — ziggystar, Jul 19 '16 at 20:42
The same parameters the Baum-Welch algorithm estimates, i.e., the transition probability matrix $a_{ij}$ and the emissions probabilities $b_{ik}$, but here, taking into account the reward. — Goek, Jul 20 '16 at 10:21
If you have a sequential decision problem, you need to find a value function or a policy. In case of hidden state this object is far more complex than the model for an HMM, as it maps beliefs to actions or values. This means your model/solution needs to be able to map a distribution over the states to an action/value. The value function is thus of type $\mathbb{R}^n \to \mathbb{R}$ with $n$ the number of hidden states. — ziggystar, Jul 20 '16 at 10:32

Baum-Welch algorithm variation for Hidden Markov model with reward

0 Answers0

Linked