Perfect separation, perhaps? In binary outcome and repeated measure (random effect) with multiple independent variables (using R)

Question

(Using R) - this is my first time posting a stats question online, so please let me know if I'm on the wrong forum or haven't provided enough information and I'll do my best to fix it!

About the data and my goal here: Best analogy I can think of is that it's a language course and the final exam is a long conversation. Four times during the course I gather reports on student performance (for example, handwriting, speed of writing, reading ability). I want to know if I can predict pass or failure for the course based on these four reports. I've created a demo dataset here:

set.seed(22)
reportsdata <- structure(list(Student = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
                       3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 
                       6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 
                       9L, 9L, 9L, 9L), 
           TermReport = c("A", "B", "C", "D", "A", "B", "C", "D",
                     "A", "B", "C", "D", "A", "B", "C", "D", 
                     "A", "B", "C", "D",  "A", "B", "C", "D", 
                     "A", "B", "C", "D",  "A", "B", "C", "D", 
                     "A", "B", "C", "D"), 
           Handwriting = c(sample(x = 1:5, size = 36, replace = TRUE)), 
           Speedwriting = c(sample(x = 1:5, size = 36, replace = TRUE)), 
           Reading = c(sample(x = 1:5, size = 36, replace = TRUE)), 
           Loudness = c(sample(x = 1:5, size = 36, replace = TRUE)),  
           Enthusiasm = c("5", "5", "3", "5", "2", "4", "3", "NA", "1", "4",
                          "3", "3", "NA", "2", "1", "1", "1", "2", "2", "NA",
                          "3", "2", "4", "2", "4", "3","5", "2", "3", "1", 
                          "2", "3", "5", "4", "NA", "5"),  
           EndCoursePassFail = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
                                 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
                                 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
                                 0L, 0L, 0L, 0L)), 
      class = "data.frame", row.names = c(NA, -36L))

Note that their score at the end of the course has been retroactively applied to all their entries, though whether they would pass (1) or fail (0) was not known at the time. My real dataset has the same structure, but contains a little over 600 observations and 30 variables (which have been filtered to those that contain less than 30% NA entries, e.g. sometimes could not get a score for enthusiasm).

So far I've been trying a mixed effects logistic regression with student and trial as random effects (Bobyqa & Nelder_Mead are the only optimisers that don't fail, I need to use ~ . syntax as there are too many variables to list and for reproducibility). E.g.:

model <- glmer(EndCoursePassFail 
                ~ . -Student -TermReport + (1|Student) + (1|TermReport),
                data = reportsdata, 
                family = binomial, 
                control = glmerControl
                (optimizer = "bobyqa", optCtrl=list(maxfun=1e6)), 
                nAGQ = 1, 
                na.action=na.exclude)

For both my original dataset and for the sample data provided above when the seed is set to 22, my model produces convergence errors:

Warning messages:
1: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : 
Model failed to converge with max|grad| = 0.0477113 (tol = 0.002, component 1)
2: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
Model is nearly unidentifiable: very large eigenvalue - Rescale variables?

But, my sample dataset shows the following errors if seed is set to 1:

boundary (singular) fit: see help('isSingular')

I think my issue could be due to perfect separation between Student and Course Result since the result had been retroactively added.

The question is - where to go from here?

Some thoughts:

I could average student scores across term reports somehow, so that I no longer need repeated measures and therefore don't get perfect separation. But this seems crude and feels like looking at only the tip of the iceberg.
Looking at other answers for similar issues, I might need to switch to trying a penalised likelihood from blme (R package), however I don't understand it well enough yet to know whether (and if so, how) this sort of perfect separation can be dealt with using blme.
Or, I could pretend that there aren't any repeated measures and run the model as though there weren't any - but of course, this is also crude and ignores a lot of potentially useful information provided by the data.

Also, in case it is relevant - because there are so many scores to include in the full dataset, I want to later use stepAIC (or a loop, or equivalent) to roughly identify the 'best' model.

score 1 · Answer 1 · answered Apr 09 '23 at 15:04

First, don't treat TermReport as a random effect. The isSingular warning comes from finding 0 variance for the attempt by glmer() to force the intercepts for TermReport into a Gaussian distribution. In many cases with 5 or fewer levels of a categorical predictor it's better to model as a fixed effect. When I replaced the random effect for TermReport with a fixed effect I had no errors or warnings with your sample data.

Second, be careful how you handle missing data. For your "Enthusiasm" you have specified "NA" as an actual level. That's not usually a wise choice. You might instead consider multiple imputation to deal with the missing data. See Section 1.3 of the linked reference for the problems with solutions like yours.

Third, watch out for overfitting with your data. You typically need about 15 cases in the minority outcome class per coefficient that you are evaluating. Even if half of the 600 students failed (300 in the minority outcome group), that would limit you to about 20 coefficients. I doubt that such a high proportion failed. If you had a continuous outcome (like a course grade) instead of pass/fail you could probably fit a more flexible model.

So 30 variables are already too many, and multi-level categorical predictors make it worse. For example, the way that you have coded Enthusiasm requires 5 coefficients, and would still require 4 if you did multiple imputation to get around the NA category.

Fourth, do NOT use stepAIC() or similar automated methods to find the "best" model. Yes, that cuts down on the number of predictors, but it does it in a way that is likely to lead you astray, to end up overfitting your data anyway, and to make inference and p-values uninterpretable. Any use of the outcomes to choose the model ends up posing problems. See this page among many others on this site.

Fifth, you will be better off using the data and your understanding of the subject matter to reduce the number of predictors in a way that doesn't involve the outcomes. Highly related predictors can be combined into single combined predictors. Continuous predictors might be reduced to a few principal components. See Chapter 4 of Harrell's Regression Modeling Strategies, in particular Section 4.7 dealing with "data reduction."

Sixth, penalized regression is certainly a possibility if you do end up with perfect separation. In the form of ridge regression, it also allows you to use all of your predictors in a way that, properly implemented, will minimize overfitting by reducing the magnitudes of all the regression coefficients. It's worth learning about in general even if you end up not needing it for this project.

Hi, Thanks for the comprehensive answer. I have a few follow up questions. What do you mean by 'specifying NA as an actual level'? — Nikita, Apr 10 '23 at 19:23
@Nikita when I modeled your data, I got coefficients for Enthusiasm not only for levels 2 through 5 (versus the reference of 1) but also for EnthusiasmNA. Your data coding thus seems to have set up a separate level of Enthusiasm called "NA" for cases without data on Enthusiasm, rather than identifying those cases as having a true NA value (which usually leads to omission of the case from analysis). Some people use that type of coding deliberately as a way to deal with missing data, but it's not a good idea. See the link on multiple imputation in the answer. — EdM, Apr 11 '23 at 13:01
Hi, I have checked back with my original data and this is not the case there. I'm not quite sure how to properly add some NAs to the sample dataset (this is the first time I've made one!), but thanks for pointing it out - I will try to figure out how to fix that in future — Nikita, Apr 14 '23 at 16:59
@Nikita you quoted the NAs in your data frame like this: "NA". Thus the software assumed that those entries were normal character values that happened to be called "NA" rather than true NA values. Remove the quotes from them and you get what I think you intended. Cases with any true NA values in any of the variables in the model are omitted from analysis, which can lead to problems, particularly when there is a large fraction of NAs . Hence my suggestion to look into multiple imputation. — EdM, Apr 14 '23 at 17:41

score 1 · Answer 2 · answered Apr 10 '23 at 15:36

1

You get a perfect separation because the random effects can seperate the results for each student. For n students you have 4n datarows, but within the students the result is each time the same.

So effectively you have no repetition within the groups for whih you compute the random effects.

answered Apr 10 '23 at 15:36

Sextus Empiricus

77,915

Although that can be a problem in general, I don't think that's happening in this case. The covariate values for each student differ among the 4 data rows, so with 10 fixed-effect coefficients to estimate there's enough variance left for the random Student intercepts to incorporate. When I fit the data, for some reason I didn't get the perfect separation, just the isSingular warning. The singular random effect was for TermReport rather than for Student; changing TermReport to fixed instead of random while keeping Student random gave a model that fit without problems. – EdM Apr 11 '23 at 13:12
@EdM the coefficients for the covariates can end up at zero if the random effects already cause complete separation. – Sextus Empiricus Apr 11 '23 at 14:02

Perfect separation, perhaps? In binary outcome and repeated measure (random effect) with multiple independent variables (using R)

2 Answers2

Linked