15

We are trying to figure out an intuitive reasoning behind integrate out the unobserved random effect. The specific formula is:

$$f\left(y_i|x_i;\beta, \sigma_c\right)=\int_{-\infty}^{+\infty}\left(\prod_{i=1}^{T}f_t(y_{it}|x_{it};c,\beta)\right)(1/\sigma_c)\phi(c/\sigma_c)~\mathrm dc. $$

I think we get most of the idea: you give a random effect to a person to model the unobserved effect such as IQ or his health. But this random effect is unobserved, so you integrate it out.

But here comes my question which makes me doubt my understanding: do you calculate the likelihood for all the possible values the random effect can take?

Why do you then say that you have integrated it out? Does it mean that you let the random effect take all possible values and just pick the most probably $\beta$'s, at whatever value of the random effect?

User1865345
  • 8,202
Kasper
  • 3,399
  • 4
    In order to give a sensible answer to this question you are going to have to give us a bit more context on your model, etc. We can roughly guess its structure from the equation given, but it would be best to avoid guesswork here by seeing some of the initial definitions of the terms used. – Ben Nov 13 '22 at 08:36
  • 3
    Integrating out an effect is equivalent to obtaining the marginal distribution. Since you know the joint distribution, but do not have data to fit the joint-distribution, the best you can do is to fit the marginal distribution – user277126 Nov 20 '22 at 04:01
  • Possibly of interest: integrating out unobserved parameters can yield less biased, "better" estimators https://stats.stackexchange.com/a/151666/11646 – Paul Nov 26 '22 at 20:24

3 Answers3

8

I'll offer an explanation based on marginal vs joint probability distributions. First, let's introduce some notation.

Let $Y_1,\ldots,Y_n$ be the random sample, with $Y_i$ the $i$th (random) observation vector, with $Y_i = (Y_{i1},\ldots,Y_{iT})$; suppose also that we have some observed covariates $x_i$. Thus the observable data are $(Y_i,x_i)$, for $i=1,\ldots,n$.

Typically, but not always, the values in $Y_i$ are repeated measures taken in time or space, pertaining to the same individual $i$. Thus, towards building the likelihood function, while we may assume independence between $Y_i,Y_j$, we cannot assume independence across $Y_i$.

Here is where the random effects come in. We assume that for each of the individuals we have an unobserved random effect $c_i$; where unobserved means that we cannot measure it and, random means that it is a random variable. Since all individuals are independent, we assume that $c_i$ are also independent and that $c_i\sim N(0,\sigma_c^2)$; without loss of generality, we can assume $\sigma_c^2$ is known.

Therefore the complete data for individual $i$ is $(Y_i, x_i,c_i)$ and thus the joint probability distribution is

$$f(y_i, c_i;x_i,\beta) = f(y_i|c_i;x_i,\beta_i)\phi(c_i).$$

If we were given a value of $c_i$, we could use this probability distribution to estimate $\beta$. It's clear that no one will ever give us such a $c_i$ so we have to come out with a better strategy.

The common$^1$ approach is work unconditionally or marginally with respect to $c_i$. That is, the idea is to take the likelihood

$$\bar{f}(y_i;x_i,\beta) = \int f(y_i|c_i;x_i,\beta_i)\phi(c_i)\,dc_{i}.$$

This would guarantee that the estimator for $\beta$ will not depend on the particular $c_i$ although it will depend on the distribution of $c_i$, thus we are taking care of the dependence across $Y_i$.

To sum up, the random effects are just uninteresting random data that we cannot measure. To deal with this missingness issue, the most reasonable choice is to eliminate, i.e. integrate, them from the joint distribution.

$^1$ There is another frequentist approach called Hierarchical Generalized Linear Models in which, essentially, $c_i$ is treated as a parameter and is estimated jointly with $\beta$, using the so-called $h$-likelihood.

utobi
  • 11,726
3

Do you calculate the likelihood for all the possible values the random effect can take?

Does it mean that you let the random effect take all possible values and just pick the most probably

It’s kind of both. This is a common technique in statistics that’s often not explained explicitly: you “integrate out” a random variable that you don’t know or don’t observe, so you can get the distribution of the random variable you’re interested in.

I’ll take a step back to basic probability theory to draw a parallel to the expected value. Remember the definition of expected value for discrete random variables is

$$\mathbb{E}(X) = \sum_{x \in \mathcal{X}} x P(X = x).$$

To get the average value (expectation) of a random variable, you take a weighted average of every possible value of X. The weight given to each $x$ is how likely each event is to occur, e.g. $P(X=x)$. Remembering that integrals act like sums over infinite sets, the “weighted average” of a continuous variable is

$$\mathbb{E}(X) = \int_{-\infty}^{\infty} x f(x)dx.$$

Like $P(X=x)$ did in the discrete case, the $f(x)$ term tells us how much weight to assign to each value $x$ when we “add” everything together with the integral.

It’s the same idea in your example. We want to know the distribution $f(y | x; \beta, \sigma_c)$, but we don’t observe $c$. So, we “integrate out” $y$’s dependence on $c$:

$$ \int_{-\infty}^{+\infty}f(y|x;c,\beta) (1/\sigma_c)\phi(c/\sigma_c)~\mathrm dc. $$

Notice that only $c$ varies in the integral. $y, x$, and $\beta$ are fixed. We’re plugging in every possible value of $c$ (using an integral over the real numbers), weighing each value by its likelihood ($\phi(c/\sigma_c))$, and averaging over everything. That gives us the most plausible distribution of $y | x; \beta$ after we’ve removed (“integrated out”) everything we know about $c$.

Eli
  • 2,652
0

Let us assume the following bivariate probability density function

$$f(x, y) = 4xy \hspace{50pt} 0<x\le1 \hspace{25pt} 0<y\le1$$

it has its maximum at $x^*=y^* = 1$

We could now integrate out $y$ or $x$ and thus obtain the marginal probability density functions of $f(x)$ and $f(y)$, respectively

$$ f(x) = \int_{y=0}^{1} 4xy \hspace{3pt} dy = \left[ 2xy^2 \right]_0^1 = 2x$$

$$ f(y) = \int_{x=0}^{1} 4xy \hspace{3pt} dx = \left[ 2x^2y \right]_0^1 = 2y$$

Note that the values that maximize the marginal densities are still $x^*=y^* = 1$, respectively. It is, however, not generally true that the values that maximize a particular joint probability density function also have to be the values that maximize the pertaining marginal density functions. It is, however, true here, since it holds

$$f(x, y) = 4xy = 2x * 2y = f(x) * f(y)$$

This shows us that the two random variables $X$ and $Y$ are independent of each other. It is intuitively clear that if two random variables are independent, their maxima also have to be independent of each other.

In a mixed-effects model we make the very strong assumption that the random effects are random variables that are completely independent of all fixed effects. While the joint likelihood of a mixed model is not a probability density function, the principle is the same here. Assuming independence, we can simply integrate out all random effects and the parameter estimates that maximize the marginal likelihood of the fixed effects will be identical.

One big advantage is that maximizing marginal functions separately will be computationally less demanding. Of course the independence assumption is an extremely strong assumption that will in practice never be completely met. This, however, is not surprising when considering that a mixed model is a model and that all models are wrong.