For a long time I have wondered about the seemingly common belief
that there is some fundamental difference in fixed and random effects
for (generally nonlinear) mixed effects models. This belief is
for example stated by Bates in the following response
https://stat.ethz.ch/pipermail/r-sig-mixed-models/2010q1/003447.html
Bates clearly states that he believes there is a fundamental difference between
fixed and random effects so that they can not be combined.
I think he is wrong
and I hope to convince a few readers of an alternative point of view.
I take a frequentist approach so what I want to do is to define
a notion of profile likelihood for a function of both the fixed and
random effects.
To motivate the discussion suppose we have a two parameter model with
parameters x and u (nothing about random effects so far). Let
$L(x,u)$ be the likelihood function where we supress any reference to the data.
Let $g(x,u)$ be any (nice) function of x and u. The profile likelihood $P_g(t)$
for the function $g$ is given by
$$P_g(t)=\max_{x,u} \{L(x,u)\ |\ g(x,u)=t \} \eqno(1) $$
I believe that no one would argue with this. Now suppose we have a prior
probability distribution $p(u)$ for u. Then I would claim that the
profile likelihood for $g$ still makes sense, but we should modify (1)
by including the prior.
$$P_g(t)=\max_{x,u} \{L(x,u)p(u)\ |\ g(x,u)=t \} \eqno(2) $$
Note that since $u$ is a parameter with a prior it is exactly the same as
what is referred to as a random effect. So why do many people think that
random effect parameters are somehow different. The difference I think
comes from the usual practice of parameter estimation for them.
What makes random effects ``different'' is that there are a lot of them
in many models. As a result to get useful estimates for the fixed
effects (or other parameters)
it is necessary to treat the random effects in a different way.
What we do is to integrate them out of the model. In the above
model we would form the likelihood $F(x)$ where
$$F(x) = \int L(x,u)p(u)du$$
Now the $u$ are gone. So if all we have is $F(x)$ it seems to make no sense
to talk about the profile likelihood for some function $g(x,u)$.
So to get information about the function $g(x,u)$ we should not
integrate over the parameter $u$. But what happens in the case where there
are many random effect parameters. Then I claim that we should
integrate over ``most'' but not all of them in a sense I will make
precise. To motivate the construction, let there be $n$ random effects
$u=(u_1,u_2,...,u_{n-1},u_n)$.
Consider the special case where the function $g(x,u)$ only depends on $u_n$, and in fact is the simplest function imagineable, $g(x,u)=u_n$.
Integrate over the random effects $u_1,u_2,...,u_{n-1}$ to get
$$F(x,u_n) = \int L(x,u_1,...,u_n)p(u_1,...,u_n))du_1du_2...du_{n-1}\eqno(4)$$
so as before we can form the profile likelihood
$$P_g(t)=\max_{x,u_n} \{F(x,u_n) | u_n=t \} \eqno(3) $$
How to generalize $(3)$ so that it makes sense for an arbitrary
function $g(x,u)$. Well notice that the definition of $F(x,u_n)$ in $(4)$
is the same as
$$F(x,s) = \lim_{\epsilon\rightarrow 0}{1\over\epsilon}
\int_{\{(x,u_n) | s-\epsilon/2<g(x,u_n)<s+\epsilon/2\}} L(x,u_1,...,u_n)p(u_1,...,u_n))du_1du_2...du_n\eqno(5)$$
To see this note that for the simple case $g(x,u)=u_n$,
$(5)$ is the same as
$$F(x,s)=\lim_{\epsilon\rightarrow 0}{1\over\epsilon}
\int_{\{(x,u_n) | s-\epsilon/2<u_n<s+\epsilon/2\}} F(x,u_n)du_n\eqno(6)$$
For a general function $g(x,u)$ we form the function $F(x,s)$
defined by $(5)$ and calculate the profile likelihood
$$P_g(s)=\max_{x,u} \{F(x,s) | g(x,u)=s \} \eqno(3) $$
This profile likelihood is a well defined concept and stands on it own.
However to be useful in practice one needs to be able to calculate
its value, at least approximately.
I believe that for many models the function $F(x,s)$ can be
approximated well enough using a variant of the Laplace approximation.
Define $\hat x(s),\hat u(s)$ by
$$ \hat x(s),\hat u(s)= \max_{x,u} \{L(x,u)p(u)\ |\ g(x,u)=s\}$$
Let H be the hessian of the log of the function $-L(x,u)p(u)$ with respect to the
parameters $x$ and $u$.
The level sets of $g$ are $m+n-1$ dimensional submanifolds of an $n+m$ dimensional
space where there are $m$ fixed effects and $n$ random effects.
We need to integrate an $n$ form $du_1\wedge du_2\wedge\ldots\wedge du_n$
over this manifold where all is linearized at
$ \hat x(s),\hat u(s)$ This involves a bit of elementary differential geometry.
Assume that $g_{x_n}(\hat x(s),\hat u(s))\ne 0$
By reparameterizing we can assum that $\hat x(s)=0$ and $\hat u(s)=0$. Then consider the map
$$(x_1,x_2,\ldots,x_{m-1},u_1,u_2,\ldots,u_n) \rightarrow
(x_1,x_2,\ldots,x_{m-1},
{-\sum_{i=1}^{m-1}g_{x_i}x_i-\sum_{i=1}^ng_{u_i}u_i\over g_{x_m}},
u_1,u_2,\ldots,u_n)
$$
where $g_{x_i}$ is used to denote the
partial derivatvie of $g$ with respect to $x_i$
evaluated at the maximum point.
This is a linear map of the $m+n-1$ dimensional space onto the tangent space of the
level set of $g$.
We can use it to compute the desired integral. First the pullback of the
1 forms $du_i$ are simply themselves.
The pullback of the Hessian is the quadratic form
$$T_{i,j} =H_{i+m,j+m}+{g_{u_i}g_{u_j}\over {g_{x_m}}^2}H_{m,m}\quad \hbox{\rm for} \ 1<=i,j<=n$$
So the integral can be calculated (or approximated) via the Laplace approximation
which is the usual formula involving the logarithm of the determinant of $T$,
which is calculated via the Cholesky decomposition.
The value of the Laplace approximation of the integral is
$$L(\hat x(s),\hat u(s))|-T|^{1\over2}$$
where $|\cdot|$ is the determinant.
we still need to deal with the width of the level set of $g$ as
$\epsilon\rightarrow 0$
To first order this has the value $\epsilon/\|\nabla g(\hat x(s),\hat u(s))\|$
where $\nabla g(\hat x(s),\hat u(s)))$ is the vector of partial derivatives of
$g$ $( g_{x_1}, g_{x_2}, \ldots, g_{x_m}, g_{u_1}, g_{u_2}, \ldots, g_{u_n})$
so that the likelihood value on the level set of $g$ is given by
$${L(\hat x(s),\hat u(s))|-T|^{1\over2}\over \|\nabla g(\hat x(s),\hat u(s))\|}$$
This is the correct approximation to use for calculating the profile likelihood.