(Using R) - this is my first time posting a stats question online, so please let me know if I'm on the wrong forum or haven't provided enough information and I'll do my best to fix it!
About the data and my goal here: Best analogy I can think of is that it's a language course and the final exam is a long conversation. Four times during the course I gather reports on student performance (for example, handwriting, speed of writing, reading ability). I want to know if I can predict pass or failure for the course based on these four reports. I've created a demo dataset here:
set.seed(22)
reportsdata <- structure(list(Student = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L,
6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L,
9L, 9L, 9L, 9L),
TermReport = c("A", "B", "C", "D", "A", "B", "C", "D",
"A", "B", "C", "D", "A", "B", "C", "D",
"A", "B", "C", "D", "A", "B", "C", "D",
"A", "B", "C", "D", "A", "B", "C", "D",
"A", "B", "C", "D"),
Handwriting = c(sample(x = 1:5, size = 36, replace = TRUE)),
Speedwriting = c(sample(x = 1:5, size = 36, replace = TRUE)),
Reading = c(sample(x = 1:5, size = 36, replace = TRUE)),
Loudness = c(sample(x = 1:5, size = 36, replace = TRUE)),
Enthusiasm = c("5", "5", "3", "5", "2", "4", "3", "NA", "1", "4",
"3", "3", "NA", "2", "1", "1", "1", "2", "2", "NA",
"3", "2", "4", "2", "4", "3","5", "2", "3", "1",
"2", "3", "5", "4", "NA", "5"),
EndCoursePassFail = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L)),
class = "data.frame", row.names = c(NA, -36L))
Note that their score at the end of the course has been retroactively applied to all their entries, though whether they would pass (1) or fail (0) was not known at the time. My real dataset has the same structure, but contains a little over 600 observations and 30 variables (which have been filtered to those that contain less than 30% NA entries, e.g. sometimes could not get a score for enthusiasm).
So far I've been trying a mixed effects logistic regression with student and trial as random effects (Bobyqa & Nelder_Mead are the only optimisers that don't fail, I need to use ~ . syntax as there are too many variables to list and for reproducibility). E.g.:
model <- glmer(EndCoursePassFail
~ . -Student -TermReport + (1|Student) + (1|TermReport),
data = reportsdata,
family = binomial,
control = glmerControl
(optimizer = "bobyqa", optCtrl=list(maxfun=1e6)),
nAGQ = 1,
na.action=na.exclude)
For both my original dataset and for the sample data provided above when the seed is set to 22, my model produces convergence errors:
Warning messages:
1: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 0.0477113 (tol = 0.002, component 1)
2: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model is nearly unidentifiable: very large eigenvalue - Rescale variables?
But, my sample dataset shows the following errors if seed is set to 1:
boundary (singular) fit: see help('isSingular')
I think my issue could be due to perfect separation between Student and Course Result since the result had been retroactively added.
The question is - where to go from here?
Some thoughts:
I could average student scores across term reports somehow, so that I no longer need repeated measures and therefore don't get perfect separation. But this seems crude and feels like looking at only the tip of the iceberg.
Looking at other answers for similar issues, I might need to switch to trying a penalised likelihood from blme (R package), however I don't understand it well enough yet to know whether (and if so, how) this sort of perfect separation can be dealt with using blme.
Or, I could pretend that there aren't any repeated measures and run the model as though there weren't any - but of course, this is also crude and ignores a lot of potentially useful information provided by the data.
Also, in case it is relevant - because there are so many scores to include in the full dataset, I want to later use stepAIC (or a loop, or equivalent) to roughly identify the 'best' model.
Enthusiasmnot only for levels 2 through 5 (versus the reference of 1) but also forEnthusiasmNA. Your data coding thus seems to have set up a separate level ofEnthusiasmcalled "NA" for cases without data onEnthusiasm, rather than identifying those cases as having a trueNAvalue (which usually leads to omission of the case from analysis). Some people use that type of coding deliberately as a way to deal with missing data, but it's not a good idea. See the link on multiple imputation in the answer. – EdM Apr 11 '23 at 13:01"NA". Thus the software assumed that those entries were normal character values that happened to be called "NA" rather than true NA values. Remove the quotes from them and you get what I think you intended. Cases with any true NA values in any of the variables in the model are omitted from analysis, which can lead to problems, particularly when there is a large fraction of NAs . Hence my suggestion to look into multiple imputation. – EdM Apr 14 '23 at 17:41