Following my previous question on the subject I would like to get your feedback on the following alternative solution. (The original solution to this question is the usage of the POMDP model proposed in the previous thread by @ziggystar.)
The Baum-Welch algorithm uses two stages to iteratively estimate and update the transition probabilities $a_{ij}$ and the emission probabilities $b_{ik}$:
- The forward-backward procedure in which the parameters: $\alpha_i(t), \beta_i(t), \gamma_i(t)$ are calculated.
- The update procedure in which the variables $\xi_{ij}(t), a_{ij}, b_{ik}$, are updated.
See wikipedia for more details about the algorithm and the variables definitions.
The following suggested solution is based on the original (Baum-Welch) algorithm with some changes to support rewards:
- Forward and backward procedure: Keep the calculation of $\alpha_i(t), \beta_i(t), \gamma_i(t)$ as it is the original algorithm (intact).
- Update procedure: Instead of using one transition matrix $A=\{a_{ij}\}$ we will have two transition matrices: $A^{+}=\{a_{ij}^+\}$ and $A^{-}=\{a_{ij}^-\}$ which represent the transition probabilities with the presence of reward and when a reward is missing, respectively. To estimate the transition matrices, instead of calculating $\xi_{ij}(t)$, we will calculate $\xi_{ij}^{+}(t)$ and $\xi_{ij}^{-}(t)$ as follows: \begin{equation} \xi^-_{ij}(t)=P(X_t=i, X_{t+1}=j| Y, \theta, R_t=-) = \frac{\alpha_i(t) a^-_{ij} \beta_i(t+1) b_j(y_{t+1})}{\sum_{k=1}^N{\alpha_k(t)}} \end{equation}
\begin{equation} \xi^+_{ij}(t)=P(X_t=i, X_{t+1}=j| Y, \theta, R_t=+) = \frac{\alpha_i(t) a^+_{ij} \beta_i(t+1) b_j(y_{t+1})}{\sum_{k=1}^N{\alpha_k(t)}} \end{equation} where $R_t=+/-$ represent the fact the the subject got a reward or didn't get a reward, respectively, at time $t$.
- $a^+_{ij}$ and $a^-_{ij}$ will be updated as following:
\begin{equation} a^-_{ij} = \frac{\sum_{t=1}^{T-1}{\xi^-_{ij}(t)}}{\sum_{t=1}^{T-1}{\gamma_i(t)}} \end{equation}
\begin{equation} a^+_{ij} = \frac{\sum_{t=1}^{T-1}{\xi^+_{ij}(t)}}{\sum_{t=1}^{T-1}{\gamma_i(t)}} \end{equation}
- The update procedure for $b_{ij}$ remains unchanged.