2

Say I have an independent variable category with 3 possible values (a,b,c) and this list is exhaustive (category can only take 1 of these 3 values). I want to build a model which uses category to predict a response variable.

One obvious way is to one-hot encode category by creating 2 dummy variables. I also see maybe I can use random effects model (https://en.wikipedia.org/wiki/Random_effects_model). I wonder what is the difference between these 2 models in this case? I read somewhere saying that if we know all levels then we should use fixed effects model but I wonder what happens if we still decide to use random effect models?

What I care most about is getting a confidence interval for the predictions

Tommy Do
  • 73
  • 4

1 Answers1

2

It depends.

In the case where you have lots of observations of each category, the difference is negligible. If you create dummy variables, the calculations for a prediction or confidence interval are simple. You can also get confidence intervals for the random effects model, but the prediction interval might require a little more effort. Still doable though.

Now, let's assume you have lots of observations of a and b, but only a few of c. Then the differences manifest. Random effects models partially pool estimates of the group effect towards the grand mean. This is a desirable property in some circumstances. The amount of pooling depends on how many observations are in the group; fewer means more pooling.

I've included an example in R below:

library(tidyverse)
library(lme4)
set.seed(0)

N = 50 category = sample(letters[1:3], size = N, prob = c(0.49, 0.49, 0.02), replace = T) X = model.matrix(~category) y = X %*% c(10, 2, 5) + rnorm(N, 0, 2)

d = tibble(category, y)

ref_mod = lmer(y~1 + (1|category), data = d) classic_mod = lm(y~category)

predictions = tibble(category = letters[1:3]) predictions$classic = predict(classic_mod, newdata = predictions) predictions$ref_mod = predict(ref_mod, newdata = predictions)

> predictions

A tibble: 3 x 3

category classic ref_mod <chr> <dbl> <dbl> 1 a 9.71 9.79 2 b 12.2 12.1 3 c 13.3 12.3

I sample some 100 observations in which the category c is far less prevalent. The classic model (OLS) makes more extreme predictions compared to the random effects model (category c is predicted to have a mean outcome of 13.3 for the OLS model, and 12.3 in the random effects model). That is partial pooling at work.

Note that some other categories are effected by the partial pooling too, except not as strongly, pooled towards the grand mean of 10.8. That is because they are more prevalent and the estimates of their means are more precise.