Different amounts of observations for individuals - Which model to use?

Question

I have a dataframe which is structured in a way, that it includes a different amount of observations (trip chains) for each person. A trip chain consists of all the trips an individual took from leaving his home up to returning to his home. The persons made different amounts of trip chains.

Trip_Chain_ID Person_ID   Amount_Trip_Chains_Individual   Trip_Chain_Complexity  Gender  Employment   Age
     1            1                      1                       "simple"        "male"     "No"      66
     2            2                      1                       "complex"       "female"    "Yes"     17
     3            3                      3                       "simple"        "male"     "Yes"     31
     4            3                      3                       "simple"        "male"     "Yes"     31
     5            3                      3                       "complex"        "male"     "Yes"     31
     6            4                      1                       "simple"       "female"    "No"      44
     7            5                      1                       "simple"        "male"     "No"      28
     8            6                      2                       "simple"       "female"    "Yes"     52
     9            6                      2                       "complex"       "female"    "Yes"     52   
    10            7                      1                       "complex"        "male"     "Yes"     37

I want to estimate a model with the "Trip_Chain_Complexity" as the dependent variable to see the effect of the independent variables "Gender", "Employment" and "Age" on the probability of a trip chain being complex or simple.

Trip_Chain_complexity ~ Gender + Employment + Age

Because of the multiple observations from some persons, the observations are not independent from each other. Hence, I can not use a normal binary logistic regression model.

I was thinking about including dummy variables. Dummy Variable A would be 1 if a person has one trip chain and 0 otherwise. Dummy Variable B would be 1 if a person has two trip chains and 0 otherwise. And so on.

But I think there would still be a problem because the probability of a trip chain being simple rising with the amount of trip chains of an individual. So there is also another relationship, which is not paid attention to in constructing the model.

Do you have any idea how to construct a model which fits this problem?

EDIT: This is only a snippet of my data set. My actual dataset consists of more than 280.000 observations by almost 200.000 persons. 65 % of the persons only have one observation, 27 % have two observations, 7 % have three observations, and some people have four, five or six observations. Also the amount of independent variables is actually a lot bigger. I am using R.

Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. — Community, Aug 17 '23 at 09:59
Mixed/hierarchical/multilevel models are the general approach for this, but this question needs more detail for a good answer. — mkt, Aug 17 '23 at 11:21
Thanks! I just edited the question to give some more detail. I hope this makes the problem clearer. — Matherabauke, Aug 17 '23 at 12:16

mkt · Answer 1 · 2023-08-18T06:31:36.880

As I mentioned in the comment, mixed/hierarchical/multilevel models are the general way to address this form of non-independence. The model is modified to deal with the correlations between points measured on the same individuals by adding what is called a random effect.

The simplest form of this is a random intercept, in which each individual's predictions are offset from the population mean by a constant value, and those values are generally assumed to be drawn from a normal distribution. This is an excellent thread introducing the topic: What is the difference between fixed effect, random effect and mixed effect models? . I recommend taking a look at it and our many other useful threads tagged mixed-model, because there are some subtle issues to keep in mind when fitting and interpreting them.

In your case, because you have a binary response, you need a Generalised Linear Mixed Model (GLMM) with an appropriate link function (link functions allow you to predict responses with non-Gaussian distributions). Here's an example of how you can do this in R.

library(lme4)
glmer(Trip_Chain_complexity ~ Gender + Employment + Age + (1 | Person_ID),
        data = dat, family = binomial)

The (1 | Person_ID) term at the end indicates the random intercept. This model can be extended to allow for random slopes, which are a topic worth visiting when you have a good grasp of the basics.

This can also be addressed in a Bayesian framework, of course, and the brms package is a good place to start if you'd like to try that. rstan is the more complex and presumably full-featured package for this. The hierarchical-bayesian tag has plenty of relevant threads on this.

EDIT: Peter Flom makes some good points below in his comment (and in his answer, +1). If your dataset is small and the snippet here is representative of it, then the model I coded above may not be identifiable. This is less of a concern in Bayesian statistics and so the brms / rstan approach should still work fine. This blog post by Andrew Gelman discusses identifiability from a Bayesian perspective: https://statmodeling.stat.columbia.edu/2014/02/12/think-identifiability-bayesian-inference/

The basic syntax for the model above is identical in brms:

library(brms)
brm(Trip_Chain_complexity ~ Gender + Employment + Age + (1 | Person_ID),
        data = dat, family = binomial)

brms chooses default priors that will work but it's much better if you set them yourself based on your knowledge. There are also a number of other settings that it would help to change, such as the number of iterations to run the chains for, and the number of cores. It has good vignettes online that you can use to understand this better: https://paul-buerkner.github.io/brms/articles/

In the sample data, most people have one observation, only one has more than two. And most of the information is identical across observations. I think that will cause problems here. — Peter Flom, Aug 17 '23 at 17:57
@PeterFlom Fair point, if this data is representative there should be some identifiability issues; I will add a caveat. I'm not certain that it would make this impossible, though. Wouldn't a Bayesian model with strong priors be able to deal with this? — mkt, Aug 17 '23 at 18:26
I tend to agree here that Bayesian inference is the way to go for mixed effects where some random effects are only estimated by a single point. Some options for Python users include PyMC, PyStan, Bambi, and tensorflow-probability. — Galen, Aug 17 '23 at 18:49
Thanks you two! Some remarks, that I will also add in the original post. This is only a snippet of my data set. My actual dataset consists of more than 280.000 observations by almost 200.000 persons. 65 % of the persons only have one observation, 27 % have two observations, 7 % have three observations, and some people have four, five or six observations. Also the amount of independent variables is actually a lot bigger — Matherabauke, Aug 17 '23 at 20:53
@mkt I am not a Bayesian (it's a character flaw). But I'm always leery of strong priors. I think my method avoids any such problems, but perhaps not. — Peter Flom, Aug 17 '23 at 21:40
@Matherabauke That is good to know. I've now added a code snippet for the model in brms and a little more information about that. — mkt, Aug 18 '23 at 06:32
@PeterFlom I understand. I come from a field where we often have pretty good theoretical constraints but noisy data, and so using strong priors seems quite reasonable to me. A principled frequentist approach (combined with a focus on better measurement) would also be fine, but it is often taught so poorly that it is easier to start anew with Bayesian training that undo all the "p<0.05 => success" confusion. — mkt, Aug 18 '23 at 10:47

score 6 · Answer 2 · answered Aug 17 '23 at 18:09

If your sample data is representative of what your actual data will look like, then I don't think multilevel models are the way to go.

First, most of your people have one observation and only one has more than two.

Second, employment doesn't change across observations within a person (this is probably going to be mostly true for real data) and age doesn't change at all (but will change in a very orderly fashion with real data, and those changes probably won't matter). Sex will rarely change.

Finally, but most importantly, your interest is not (I think) in how the DV changes over time.

So, I think you should combine observations and make a new dependent variable, or, rather, the same one but with three levels instead of two. You could have "only simple", "only complex" and "both". Then you could do multinomial logistic.

A separate issue is the number of trip chains. You clearly can't have "both" if you only have one. I'm not sure how best to deal with this. You might want to divide your data into people with only one and people with more than one.

Another idea is to have two count models: One for number of complex chains and one for number of simple chains. These could be modeled with Poisson or negative binomial regression.

Thanks! my actual data consists of a lot more observations and variables. I added some info to the original post. But it is true that I am not interested in how the DV changes over time. The observations from each person have been made on the same day. I also already thought about dividing my data into persons with the same amount of trip chains. — Matherabauke, Aug 17 '23 at 21:01

Different amounts of observations for individuals - Which model to use?

2 Answers2