Hidden state models vs. stateless models for time series regression

Question

This is a quite generic question: assume I want to build a model to predict the next observation based on the previous $N$ observations ($N$ can be a parameter to optimize experimentally). So we basically have a sliding window of input features to predict the next observation.

I can use a Hidden Markov Model approach, i.e. Baum-Welch to estimate a model, then Viterbi to predict a current state based on the last $N$ observations, then predict the most likely next state based on the current state, and then predict the next observation using the most likely next state and the HMM parameters (or variants such as find the predictive distribution of the next observation).

Or I can use a much simpler approach, using a stateless model (which can get as input the previous $N$ observations), e.g. SVM, linear regression, splines, regression trees, nearest neighbors, etc. Such models are based on minimizing some prediction error over the training set and are therefore, conceptually, much simpler than a hidden state based model.

Can someone share her/his experience in dealing with such a modelling choice? What would speak in favour of the HMM and what in favour of a regression approach? Intuitively one should take the simpler model possible to avoid over-fitting; this speaks in favour of a stateless approach...We also have to consider that both approaches get the same input data for training (I think this implies that if we do not incorporate additional domain knowledge in the modelling of a hidden state model, e.g. fix certain states and transition probabilities, there is no reason of why a hidden state model should perform better). At the end one can of course play with both approaches and see what performs better on a validation set, but some heuristics based on practical experience might also be helpful...

Note: for me it is important to predict only certain events; I prefer a model which predicts few "interesting/rare" events well, rather than a model which predicts "average/frequent" events but the interesting ones not so well . Perhaps this has an implication for the modelling choice. Thanks.

Can you clarify why you believe regression models are necessarily stateless? Dynamic linear regression models (in which previous values of the predictand are included on the right-hand side of the model equation) would seem very much to be state-conditioned. But perhaps I am missing something. — Alexis, May 01 '14 at 18:12
thanks for reading the question. I would say it is a bit a question of semantics, I also give an example of regression models which include the n-past observation values on the right hand side of the model, such a model is of course dynamic. However, I was more referring to the concept of a hidden/latent variable for which usually EM techniques are used to find the model vs models for which we do not have such hidden states (i.e. the states are observable, they are the observations). From a practical and pragmatic perspective, is it possible to tell what works better and when? — Mannaggia, May 01 '14 at 20:19
I missed the fact that you refer to past values of the prediction as inputs.Are such models the equivalent of a hidden state model (in principle they would just include more than N observation, replacing the equation for the past predictions)?For me the question is more if we observe the state and model it or if we infer the state given an assumption of the model. I am more interested in the practical aspect however, not the mathematical one. I.e. is it possible to tell under which conditions the one or the other approach works better?(I think no theorem can provide an answer to this question) — Mannaggia, May 01 '14 at 20:28
Perhaps this earlier question is one half of the question presented here. — Meadowlark Bradsher, May 02 '14 at 03:37

score 1 · Answer 1 · answered May 02 '14 at 09:40

In short, I think they are working in different learning paradigm.

State-space model (hidden state model) and other stateless model you mentioned are going to discover the underlying relationship of your time series in different learning paradigm: (1) maximum-likelihood estimation, (2) Bayes' inference, (3) empirical risk minimization.

In state-space model,

Let $x_t$ as the hidden state, $y_t$ as the observables, $t>0$ (assume there is no control)

You assume the following relationship for the model:

$P(x_0)$ as a prior

$P(x_t | x_{t-1})$ for $t \geq 1$ as how your state change (in HMM, it is a transition matrix)

$P(y_t | x_t)$ for $t \geq 1$ as how you observe (in HMM, it could be normal distributions that conditioned on $x_t$)

and $y_t$ only depends on $x_t$.

When you use Baum-Welch to estimate the parameters, you are in fact looking for a maximum-likelihood estimate of the HMM. If you use Kalman filter, you are solving a special case of Bayesian filter problem (which is in fact an application of Bayes' theorem on update step):

Prediction step:

$\displaystyle P(x_t|y_{1:t-1}) = \int P(x_t|x_{t-1})P(x_{t-1}|y_{1:t-1}) \, dx_{t-1}$

Update step:

$\displaystyle P(x_t|y_{1:t}) = \frac{P(y_t|x_t)P(x_t|y_{1:t-1})}{\int P(y_t|x_t)P(x_t|y_{1:t-1}) \, dx_t}$

In Kalman filter, since we assume the noise statistic is Gaussian and the relationship of $P(x_t|x_{t-1})$ and $P(y_t|x_t)$ are linear. Therefore you can write $P(x_t|y_{1:t-1})$ and $P(x_t|y_{1:t})$ simply as the $x_t$ (mean + variance is sufficient for normal distribution) and the algorithm works as matrix formulas.

On the other hand, for other stateless model you mentioned, like SVM, splines, regression trees, nearest neighbors. They are trying to discover the underlying relationship of $(\{y_0,y_1,...,y_{t-1}\}, y_t)$ by empirical risk minimization.

For maximum-likelihood estimation, you need to parametrize the underlying probability distribution first (like HMM, you have the transition matrix, the observable are $(\mu_j,\sigma_j)$ for some $j$)

For application of Bayes' theorem, you need to have "correct" a priori $P(A)$ first in the sense that $P(A) \neq 0$. If $P(A)=0$, then any inference results in $0$ since $P(A|B) = \frac{P(B|A)P(A)}{P(B)}$.

For empirical risk minimization, universal consistency is guaranteed for any underlying probability distribution if the VC dimension of the learning rule is not growing too fast as the number of available data $n \to \infty$

Hidden state models vs. stateless models for time series regression

1 Answers1