Why is it difficult to incorporate uncertainty in random effects when making predictions from mixed models?

Question

There are several threads on R-sig-ME about obtaining confidence intervals for predictions using lme4 and nlme in R. For example here and here in 2010, including some commentary by Dougals Bates, one of the authors of both packages. I hesitate to quote him verbatim, for fear of them being taken out of context, but anyway, one comment he makes is

"You are combining parameters and random variables in your predictions and I'm not sure what it would mean to assess the variability of those predictions. A Bayesian may be able to make sense of it but I can't get my head around it." https://stat.ethz.ch/pipermail/r-sig-mixed-models/2010q1/003447.html

I know that the Bayesian glmm package MCMCglmm can produce credible intervals for predictions.

Lately, the development version of lme4 on github has been given a predict method, but it is accompanied by the following comment:

" @note There is no option for computing standard errors of predictions because it is difficult to define an efficient method that incorporates uncertainty in the variance parameters; we recommend \code{\link{bootMer}} for this task." https://github.com/lme4/lme4/blob/master/R/predict.R

So, why is it difficult to incorporate uncertainty in random effects when making predictions from mixed models in a frequentist setting ?

John · Answer 1 · 2013-08-03T15:23:32.597

I'm not sure about the predict method comment but a primary issue is related to generating easily interpretable variance measures, not variance measures per se. Bates isn't commenting in the first quote on whether you can do it, just what it means.

Take a simple multi-level model of a two level repeated measures design. Let's say you have the following data where each line is a subject:

enter image description here

In lmer the model could be expressed as:

y ~ x + (1|subject)

You're predicting the y-value from x as a fixed effect (the difference between A and B); and the intercept a random effect**. Look carefully at the graph and note that while there is variability in the x effect for each subject (each line's slope) it's relatively small compared to the variability across subjects (the height of each line).

The model parses these two sets of variability and each one is meaningful. You can use the random effects to predict heights of lines and you can use fixed effects of x to predict slopes. You could even use the two combined to work our individual y-values. But what you can't do is really say anything meaningful with respect to your model when you combine the variability of slopes and heights of lines together. You need to talk about the variability of your slopes and heights of lines separately. That's a feature of the model, not a liability.

You will have a variability of the effect of x that's relatively easily estimated. You could say something about a confidence interval around that. But note that, this confidence interval is going to have small relation to the prediction of any particular y value because the y value is influenced by a combination of effect and subject variance that's different from the variability of the effect alone.

When Bates writes things like you've quoted I imagine he's often thinking of much more complex multi-level designs that this doesn't even approach. But even if you just consider this simple example you come down to wondering what kind of real meaning can be extracted from combining all of the variance measures together.

** I ignored the fixed effect of intercept for simplicity and just treat it as a random effect. You could extract similar conclusions from an even simpler model with a random and fixed intercept only but I think that it would be harder to convey. In that case, again, the fixed effect and random effect are parsed for a reason and mean different things and putting their variability back together for predicted values causes that variability to make little sense with respect to the model.

So, what I hear you saying is that this comes down to the same old saw about not being sure whether we want to treat subject variance as error or partition it separately and pretend it doesn't exist? Is that right? — russellpierce, Aug 03 '13 at 16:15
I've never heard that old saw. I've never heard that you should pretend subject variance doesn't exists. But I suppose it's related to this particular example. The model parses variance. This feature of the modelling process is how you can understand the model. If you recombine the variance again you're defeating the purpose of the model in the first place. I'm not saying ignore the subject variance, just that the random effect of subject is separate. You might want to read Blouin & Riopelle (2005) and see how the meaning of SE's changes when you combine variance. — John, Aug 05 '13 at 14:02
Maybe I'm missing something, but this seems very much like the back and forth people have about what effect size is best to use for Within Subjects/Repeated Measures ANOVA and how those confidence intervals are best plotted... but I suppose after I read the thing you've pointed me to I won't be missing whatever it is that I am missing anymore. :) Thanks. — russellpierce, Aug 05 '13 at 22:29
Like I said, they're related. I didn't know there was a back and forth, would love to see a reference. The fact is, the two CI's and effects you're talking about mean different things. So, you use the one that convey what you want to mean. And you have to make them seem sensible. [It's hard to argue (even though some have) that putting a CI incorporating subject variance around a mean in a repeated measures design and using it to say something about the repeated measures effect is sensible.] — John, Aug 05 '13 at 23:40
I haven't seen anything in the literature, just a lot of informal hand wringing and attempts to guess what the reviewer du jour will think. — russellpierce, Aug 06 '13 at 01:55

dave fournier · Answer 2 · 2016-05-05T21:51:14.610

For a long time I have wondered about the seemingly common belief that there is some fundamental difference in fixed and random effects for (generally nonlinear) mixed effects models. This belief is for example stated by Bates in the following response

https://stat.ethz.ch/pipermail/r-sig-mixed-models/2010q1/003447.html

Bates clearly states that he believes there is a fundamental difference between fixed and random effects so that they can not be combined. I think he is wrong and I hope to convince a few readers of an alternative point of view. I take a frequentist approach so what I want to do is to define a notion of profile likelihood for a function of both the fixed and random effects. To motivate the discussion suppose we have a two parameter model with parameters x and u (nothing about random effects so far). Let $L(x,u)$ be the likelihood function where we supress any reference to the data. Let $g(x,u)$ be any (nice) function of x and u. The profile likelihood $P_g(t)$ for the function $g$ is given by

$$P_g(t)=\max_{x,u} \{L(x,u)\ |\ g(x,u)=t \} \eqno(1) $$

I believe that no one would argue with this. Now suppose we have a prior probability distribution $p(u)$ for u. Then I would claim that the profile likelihood for $g$ still makes sense, but we should modify (1) by including the prior.

$$P_g(t)=\max_{x,u} \{L(x,u)p(u)\ |\ g(x,u)=t \} \eqno(2) $$ Note that since $u$ is a parameter with a prior it is exactly the same as what is referred to as a random effect. So why do many people think that random effect parameters are somehow different. The difference I think comes from the usual practice of parameter estimation for them. What makes random effects ``different'' is that there are a lot of them in many models. As a result to get useful estimates for the fixed effects (or other parameters) it is necessary to treat the random effects in a different way. What we do is to integrate them out of the model. In the above model we would form the likelihood $F(x)$ where $$F(x) = \int L(x,u)p(u)du$$ Now the $u$ are gone. So if all we have is $F(x)$ it seems to make no sense to talk about the profile likelihood for some function $g(x,u)$.

So to get information about the function $g(x,u)$ we should not integrate over the parameter $u$. But what happens in the case where there are many random effect parameters. Then I claim that we should integrate over ``most'' but not all of them in a sense I will make precise. To motivate the construction, let there be $n$ random effects $u=(u_1,u_2,...,u_{n-1},u_n)$. Consider the special case where the function $g(x,u)$ only depends on $u_n$, and in fact is the simplest function imagineable, $g(x,u)=u_n$. Integrate over the random effects $u_1,u_2,...,u_{n-1}$ to get $$F(x,u_n) = \int L(x,u_1,...,u_n)p(u_1,...,u_n))du_1du_2...du_{n-1}\eqno(4)$$ so as before we can form the profile likelihood $$P_g(t)=\max_{x,u_n} \{F(x,u_n) | u_n=t \} \eqno(3) $$ How to generalize $(3)$ so that it makes sense for an arbitrary function $g(x,u)$. Well notice that the definition of $F(x,u_n)$ in $(4)$ is the same as $$F(x,s) = \lim_{\epsilon\rightarrow 0}{1\over\epsilon} \int_{\{(x,u_n) | s-\epsilon/2<g(x,u_n)<s+\epsilon/2\}} L(x,u_1,...,u_n)p(u_1,...,u_n))du_1du_2...du_n\eqno(5)$$ To see this note that for the simple case $g(x,u)=u_n$, $(5)$ is the same as $$F(x,s)=\lim_{\epsilon\rightarrow 0}{1\over\epsilon} \int_{\{(x,u_n) | s-\epsilon/2<u_n<s+\epsilon/2\}} F(x,u_n)du_n\eqno(6)$$

For a general function $g(x,u)$ we form the function $F(x,s)$ defined by $(5)$ and calculate the profile likelihood $$P_g(s)=\max_{x,u} \{F(x,s) | g(x,u)=s \} \eqno(3) $$

This profile likelihood is a well defined concept and stands on it own. However to be useful in practice one needs to be able to calculate its value, at least approximately. I believe that for many models the function $F(x,s)$ can be approximated well enough using a variant of the Laplace approximation. Define $\hat x(s),\hat u(s)$ by $$ \hat x(s),\hat u(s)= \max_{x,u} \{L(x,u)p(u)\ |\ g(x,u)=s\}$$ Let H be the hessian of the log of the function $-L(x,u)p(u)$ with respect to the parameters $x$ and $u$.

The level sets of $g$ are $m+n-1$ dimensional submanifolds of an $n+m$ dimensional space where there are $m$ fixed effects and $n$ random effects. We need to integrate an $n$ form $du_1\wedge du_2\wedge\ldots\wedge du_n$ over this manifold where all is linearized at $ \hat x(s),\hat u(s)$ This involves a bit of elementary differential geometry. Assume that $g_{x_n}(\hat x(s),\hat u(s))\ne 0$ By reparameterizing we can assum that $\hat x(s)=0$ and $\hat u(s)=0$. Then consider the map $$(x_1,x_2,\ldots,x_{m-1},u_1,u_2,\ldots,u_n) \rightarrow (x_1,x_2,\ldots,x_{m-1}, {-\sum_{i=1}^{m-1}g_{x_i}x_i-\sum_{i=1}^ng_{u_i}u_i\over g_{x_m}}, u_1,u_2,\ldots,u_n) $$ where $g_{x_i}$ is used to denote the partial derivatvie of $g$ with respect to $x_i$ evaluated at the maximum point. This is a linear map of the $m+n-1$ dimensional space onto the tangent space of the level set of $g$. We can use it to compute the desired integral. First the pullback of the 1 forms $du_i$ are simply themselves.

The pullback of the Hessian is the quadratic form $$T_{i,j} =H_{i+m,j+m}+{g_{u_i}g_{u_j}\over {g_{x_m}}^2}H_{m,m}\quad \hbox{\rm for} \ 1<=i,j<=n$$

So the integral can be calculated (or approximated) via the Laplace approximation which is the usual formula involving the logarithm of the determinant of $T$, which is calculated via the Cholesky decomposition. The value of the Laplace approximation of the integral is $$L(\hat x(s),\hat u(s))|-T|^{1\over2}$$ where $|\cdot|$ is the determinant. we still need to deal with the width of the level set of $g$ as $\epsilon\rightarrow 0$ To first order this has the value $\epsilon/\|\nabla g(\hat x(s),\hat u(s))\|$ where $\nabla g(\hat x(s),\hat u(s)))$ is the vector of partial derivatives of $g$ $( g_{x_1}, g_{x_2}, \ldots, g_{x_m}, g_{u_1}, g_{u_2}, \ldots, g_{u_n})$ so that the likelihood value on the level set of $g$ is given by $${L(\hat x(s),\hat u(s))|-T|^{1\over2}\over \|\nabla g(\hat x(s),\hat u(s))\|}$$ This is the correct approximation to use for calculating the profile likelihood.

Why is it difficult to incorporate uncertainty in random effects when making predictions from mixed models?

2 Answers2