Kalman Filter Derivation - Shumway / Stoffer

Question

I'm going through the proof of the Kalman filter equations in Shumway, Stoffer - Time Series Analysis and its applications. Could someone please tell me how equation (6.26) is justified? How can we say that the joint is normal? Could you please provide a reference for the result. For your convenience, here's the relevant chapter from the book. Thank you for your time.

https://www.stat.pitt.edu/stoffer/tsa4/Chapter6.pdf

Edit: On @jbowman's request, adding the math -

The state-equation in the basic Gaussian linear state-space model is given by

$ \mathbf{x}_t = \mathbf{\Phi}\mathbf{x}_t+\mathbf{w}_t, $ where $\mathbf{x}_t$ is a p-dimensional vector of reals and $\mathbf{w}_t$ is multivariate normal (MVN) with mean $\mathbf{0}$ and variance $\mathbf{Q}$.

The observation equation is given by $\mathbf{y}_t = \mathbf{A}_t\mathbf{x}_t + \mathbf{v}_t$, where $\mathbf{y}_t$ is a q-dimensional vector of reals, $\mathbf{A}_t$ is a $q\times p$ matrix, and $\mathbf{v}_t$ is MVN with mean zero and variance $\mathbf{R}$.

Suppose $\mathbf{x}_0$ is the initial state vector with mean 0 and variance $\mathbf{\Sigma}_0$. Further suppose $\mathbf{x}_0$, $\mathbf{w}_t$, and $\mathbf{v}_t$ are uncorrelated. Let $\mathbf{x}^s_t = E(\mathbf{x}_t\mid \mathbf{y}_{1:s})$ and $P^s_t$ be the variance of $X_t\mid \mathbf{y}_{1:s}$.

The Kalman filter is given by

$\mathbf{x}^{t-1}_t = \mathbf{\Phi}\mathbf{x}^{t-1}_{t-1}$.

$P^{t-1}_t = \mathbf{\Phi}P^{t-1}_{t-1}\mathbf{\Phi}^{T} + \mathbf{Q}$.

$\mathbf{x}^t_t = \mathbf{x}^{t-1}_t + K_t(\mathbf{y}_t-\mathbf{A}_t\mathbf{x}^{t-1}_t)$.

$P^t_t = (I-K_t\mathbf{A}_t)P^{t-1}_t$,

where $K_t = P^{t-1}_t\mathbf{A}_t^{T}(\mathbf{A}_tP^{t-1}_t\mathbf{A}_t^{T}+\mathbf{R})^{-1}$.

The first two equations above can be easily obtained by expanding the definition of $\mathbf{x}^{t-1}_t$ and $P^{t-1}_t$, respectively.

Consider the regression of $\mathbf{y}_t$ on $\mathbf{y}_{1:(t-1)}$ and define the residual $\mathbf{\epsilon}_t = \mathbf{y}_t - E(\mathbf{y}_t\mid\mathbf{y}_{1:(t-1)}) = \mathbf{y}_t - \mathbf{A}_t\mathbf{x}^{t-1}_t$. It can be shown by a straight-forward expansion of the definition that $Cov(\mathbf{x}_t, \mathbf{\epsilon}_t\mid \mathbf{y}_{1:(t-1)}) = P^{t-1}_t\mathbf{A}_t^{T}$.

The proof in the book claims that the conditional of $\mathbf{x}_t$ and $\mathbf{\epsilon}_t$, given $\mathbf{y}_{1:(t-1)}$ is MVN and the final equation for $\mathbf{x}^t_t$ Kalman update is obtained by conditioning $\mathbf{x}_t$ on $(\mathbf{\epsilon}_t, \mathbf{y}_{1:(t-1)})$, using standard results for MVN (see: https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Conditional_distributions).

The question here is: why is the conditional of $\mathbf{x}_t$ and $\mathbf{\epsilon}_t$, given $\mathbf{y}_{1:(t-1)}$ MVN? Can we show it?

This is basically a link-only question; please put the math in the question itself. It makes it easier for the answerers if the question is all in one place! — jbowman, Oct 17 '18 at 15:58

Taylor · Answer 1 · 2018-10-19T19:42:35.530

State-space models assume you have the two distributions $g(y_t \mid x_t)$ and $f(x_t \mid x_{t-1})$. The first can also be written down in the form of an "observation equation," and the second can be written down as a "state transition equation." In the case of the Kalman Filter, your model is linear and Gaussian, so both of these are normal distributions. Note that I am supressing from the notation any dependence on parameters.

When you do filtering, you usually have this distribution as well, the state predictive distribution, which is built from the state transition and the old filtering distribution from the previous time point: $$ p(x_t \mid y_{1:t-1}) = \int f(x_t \mid x_{t-1}) p(x_{t-1} \mid y_{1:t-1}) dx_{t-1}. $$

You can also string some of these together to get $$ p(y_t, x_t \mid y_{1:t-1}) = g(y_t \mid x_t) p(x_t \mid y_{1:t-1}). $$ This comes from the conditional independence assumed.

One more step: they use the transformation theorem. They start with $$ \left[\begin{array}{c} x_t \\ y_t \end{array}\right] \bigg\rvert y_{1:t-1} $$ and they transform them into $$ \left[\begin{array}{c} x_t \\ \epsilon_t \end{array}\right] \bigg\rvert y_{1:t-1}. $$ Where $\epsilon_t = y_t - A x_t$. Using the transformation theorem they get $$ p(\epsilon_t, x_t \mid y_{1:t-1}) = p(\epsilon_t + A x_t, x_t \mid y_{1:t-1}) |\text{Jacobian}|. $$

Another way to think about this transformation is as follows: $$ \left[\begin{array}{c} x_t \\ \epsilon_t \end{array}\right] = \left[\begin{array}{cc} I & 0 \\ -A & I \end{array}\right] \left[\begin{array}{c} x_t \\ y_t \end{array}\right]. $$ Any linear transformation of Gaussian vectors preserves normality, so just compute the new mean and new variance and they should line up.

Edit: you asked why $x_t, y_t | y_{1:t-1}$ is multivariate normal if we only know that $p(y_t, x_t \mid y_{1:t-1})$ is proportional to $$ \exp\left[ -\frac{1}{2} \left\{ (y_t - Ax_t)'R^{-1}(y_t - A x_t) + (x_t - \Phi x^{t-1}_{t-1})'P_{t-1}^{-1}(x_t - \Phi x^{t-1}_{t-1}) \right\}\right], $$ where $P_{t-1} = \text{Cov}(x_t \mid y_{1:t-1})$.

First, we can find the mean vector of this joint distribution: $$ E \left[\begin{array}{c} y_t \\ x_t \end{array} \bigg\rvert y_{1:t-1} \right] = \left[\begin{array}{c} Ax_{t}^{t-1} \\ x_{t}^{t-1} \end{array}\right] $$ where $x_{t}^{t-1} = E(x_{t} \mid y_{1:t-1})$. Next, we can find the covariance matrix of this joint vector: $$ \text{Cov} \left[\begin{array}{c} y_t \\ x_t \end{array} \bigg\rvert y_{1:t-1} \right] = \left[\begin{array}{cc} AP_{t-1}A' + R & AP_{t-1} \\ P_{t-1}A' & P_{t-1} \end{array}\right]. $$ The inverse of this matrix is $$ \left[\begin{array}{cc} R^{-1} & -R^{-1}A \\ A'R^{-1} & P^{-1}_{t-1} + A'R^{-1}A \end{array}\right]. $$

If you multiply everything out, you should get the same thing:

$$ \left( \left[\begin{array}{c} y_t \\ x_{t} \end{array}\right] - \left[\begin{array}{c} Ax_{t}^{t-1} \\ x_{t}^{t-1} \end{array}\right] \right)' \left[\begin{array}{cc} R^{-1} & -R^{-1}A \\ A'R^{-1} & P^{-1}_{t-1} + A'R^{-1}A \end{array}\right] \left( \left[\begin{array}{c} y_t \\ x_{t} \end{array}\right] - \left[\begin{array}{c} Ax_{t}^{t-1} \\ x_{t}^{t-1} \end{array}\right] \right). $$

saipk · Answer 2 · 2018-10-19T03:21:38.673

1

Regarding equation (6.26), I don't think you need normality to write down the expression for $E(\mathbf{X}_1 \mid \mathbf{X}_2)$ as it's done when a joint of $(\mathbf{X}_1, \mathbf{X}_2)^{T}$ exists with mean $(\mathbf{\mu}_1, \mathbf{\mu}_2)^{T}$ and variance $\mathbf{\Sigma}$, partitioned appropriately as $\mathbf{\Sigma}_{11}$, $\mathbf{\Sigma}_{12}$, $\mathbf{\Sigma}_{21}$, and $\mathbf{\Sigma}_{22}$ as long as conditional expectation is linear and covariance is constant. Therefore, that expression in (6.26) seems justified even without normality as long as those conditions can be shown to agree.

Also, to comment on another reply to the original question - I'm not sure if linear combination of normal variables is necessarily normal...you need independence for that.

edited Oct 19 '18 at 03:21

answered Oct 17 '18 at 18:24

saipk

105

A linear transformation of a normal random vector is still normal, even when you don’t start with independence. Also, the question is asking about normality, but you are correct that one can justify Kalman filtering after that assumption has been relaxed. – Taylor Oct 17 '18 at 18:33
That's correct. I was talking about linear combination. Here we have $\mathbf{x}_t$ and $\mathbf{e}_t$, both are normal but not independent as their cov is not $0$, so a linear combination of them is not necessarily normal, thereby their joint is not necessarily normal. Yes, the question is on normality but I was justifying that conditional expectation expression even without normality. – saipk Oct 17 '18 at 18:44
https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Affine_transformation – Taylor Oct 17 '18 at 19:01
Thanks. I agree with the linear transformation preserving normality. How is $\begin{pmatrix}\mathbf{x}t\ \mathbf{y}{t} \end{pmatrix} \mid y_{1:t-1}$ normal? It's not obvious to me. – saipk Oct 18 '18 at 05:33
See my second equation – Taylor Oct 18 '18 at 11:08
So clearly $x_t\mid y_{1:t-1}$ is multivariate normal (MVN). Even though it's not clear that $y_t\mid y_{1:t-1}$ is MVN, it can be shown. How is their joint is MVN? – saipk Oct 18 '18 at 23:08
I see as typo in my comment and it wouldn't allow me to edit! So clearly $y_t\mid x_t$ is multivariate normal (MVN). Even though it's not clear that $x_t\mid y_{1:t-1}$ is MVN, it can be shown. How is their joint is MVN? Can you please use a simpler example? Appreciate your time. – saipk Oct 18 '18 at 23:15
$g(y_t \mid x_t) p(x_t \mid y_{1:t-1}) = p(y_t \mid x_t, y_{1:t-1}) p(x_t \mid y_{1:t-1})$ by conditional independence. Then you can expand and simplify: $\frac{p(y_{1:t}, x_t)}{p(x_t, y_{1:t-1})}\frac{p(x_t, y_{1:t-1})}{p(y_{1:t-1})} = p(x_t, y_t \mid y_{1:t-1})$. Does that help? – Taylor Oct 19 '18 at 00:23
Thanks. Your second equation is correct. My question was about normality. How is $(x_t, y_t \mid y_{1:t-1})$ multivariate normal? I agree with your second equation..that is clear from the definition of conditioning. – saipk Oct 19 '18 at 03:19
my mistake. It gets kind of messy to show, but if you write the large vector $(y_t',x_t')'$, and then write out the multivariate normal density, where both means and both variances are conditional on $y_{1:t-1}$, then it should look the same as something proportional to $\exp\left[-\frac{1}{2}(y_t - Ax_t)'R^{-1}(y_t - Ax_t) \right]\exp\left[-\frac{1}{2} (x_t - \Phi x_{t-1})'Q^{-1}(x_t - \Phi x_{t-1})\right]$. If you asked a separate question I'd be happy to give a more detailed answer – Taylor Oct 19 '18 at 16:11
It's actually me who asked the original question (ended up logging from Google from my home laptop -- so ended up creating a new account!). I was working out the algebra yesterday and haven't finished yet. Do you want to move your comment to an answer? I'll also type the algebra I have so far. That way I'll try upvoting. Thanks. – saipk Oct 19 '18 at 17:44
I just edited my answer to show some more detail – Taylor Oct 19 '18 at 19:43

Kalman Filter Derivation - Shumway / Stoffer

2 Answers2