Grouped/Nested data in logistic regression

Question

Update: Added more meaningful example data

Setting:

In my study, each of three randomly chosen readers (A, B,C) applies three different qualitative scores (score1, score2, score3, each an ordinal scale) to the same set of 95 cases. score1 for example is a score how the reader rates the severity of a case (1 = not severe, 10 = extremely severe).

Example: Reader A rates case 1 with score1=1, score2=7 and score3=8, Reader B rates case 1 with score1=.., score2=.. and score3=.. ASO

For some of the cases (in the example data case 92-95) they apply the score multiple times (i.e., at different time points without a special event between the time points).

Example data:

library("lme4")
library("tidyverse")
example data
set.seed(1)
df <- data.frame(reader=rep(c("A", "B", "C"), 
    each=100), case=rep((c(rep(1:91), 92, 92, 
    93, 93, 94, 94, 95, 95, 95)), 3),

    class=sample(0:1, 300, replace=TRUE, 
    prob=c(2/3, 1/3)))
approx. 66% are class 0 and 33% are class 1
set.seed(1)
df %>% 
  rowwise() %>% 
  mutate(
    score1=case_when(class==1 ~ sample(5:10, 1), 
             TRUE ~ sample(1:6, 1)), 
    score2=case_when(class==1 ~ sample(1:7, 1), 
             TRUE ~ sample(4:10, 1)),
    score3=sample(1:10, 1)) -> df
df <- data.frame(df)
str(df)
#> 'data.frame':    300 obs. of  6 variables:
#>  $ reader: chr  "A" "A" "A" "A" ...
#>  $ case  : num  1 2 3 4 5 6 7 8 9 10 ...
#>  $ class : int  0 0 0 1 0 1 1 0 0 0 ...
#>  $ score1: int  1 1 6 10 3 9 9 1 5 4 ...
#>  $ score2: int  7 9 4 7 4 1 4 8 6 9 ...
#>  $ score3: int  8 2 1 3 2 7 10 4 1 7 ...

^{Created on 2022-03-12 by the reprex package (v2.0.1)}

Aim:

Now I would like to investigate the association between the scores1-3 (independent variables) and a class (dependent variable, 1 or 0). Class could change for cases that are scored repeated times (i.e., case 92-95).

I could do this with a simple logistic regression in R:

    glm(class ~ score1 + score2 + score3, 
            family="binomial", data = df)

However, since the data is groped/nested, I think I get too low p-values for the independent variables.

Question:

What analysis is most appropriate to meet the level of groping/nesting of my data?

My solutions:

Averaging

Average eachscore1-3 among the readers and among the cases with multiple measurements and perform a simple logistic regression as mentioned above.

Use a mixed-effects model

I found some advice for nested data: Mixed Effects Model with Nesting and What is the difference between fixed effect, random effect and mixed effect models?

However, since I am new to mixed-effects models I am sure which variable should be considered as fixed and random:

Only reader as random effect

mod1 <- glmer(class ~ score1 + score2 + score3 + 
    (1|reader), family="binomial", data = df)
#> boundary (singular) fit: 
# seehelp('isSingular')

Reader and case as random effect

mod2 <- glmer(class ~ score1 + score2 + score3 + 
    (1|reader/case), family="binomial", 
    data = df)
#> boundary (singular) fit: see 
# help('isSingular')

I think I get the warning boundary (singular) fit: see help('isSingular') because the effects are very small in the test data.

It would help if you decided on your "structure" first. So it is quite evident that there is a correlation within the same reader. However, it is unclear to me if there is a correlation also within the scores or not. For example, you could create a structure where the scores are clustered within readers to account for possible correlations present scores within a reader. If you have a simple "two-level" structure and wish to use logistic regression, consider using a GEE instead as a mixed model logistic regressions are known to exaggerate coefficients compared to GEE logistic regression. — Pashtun, Mar 14 '22 at 19:15
Thank you. The scores 1-3 share some features and may therefore be correlated. — ava, Mar 14 '22 at 23:40
Could you elaborate some more on what exactly is class and how does it depends on cases. — Pashtun, Mar 15 '22 at 06:43
class is a binary variable that has been applied to each case in terms of a test. it has been applied before reader A-C rated the cases and it is not linked to score1-3 — ava, Mar 15 '22 at 11:17

Toby · Answer 1 · 2022-03-02T13:16:53.867

1

Your example data are not meaningful, therefore I can only give advice on the description of your data.

I assume reader1 provides score1 etc. Therefore the scores are nested and you should use a GLMM. You probably can build a model like this:

mod <- glmer(class ~ (1 + score1 + score2 + score3|reader), family="binomial", data = df)

What it does is treating intercept, score1, score2, score3 both as fixed and random effects in your model. I'm not sure about your variable case. There might be just too little cases per reader.

If you don't know, if your variables should be treated as fixed or random, see this question on CV.

In genneral, there is no straightforward way to build your model. It depends on the given data. You need to check, if the varibles in question are significant in your model.

Edit

Why I think you should not use reader/case:

Mixed models, for my understanding, assume dependency within groups and independency with other groups. Think of students in schools. There you can group students in classes. All classes of one school are subject to the same influences. But classes from another school don't. In your data I don't see this structure, since each reader creates one score for each case. Each case is handled by each reader.

Therefore I'd say you have grouped, but not nested data within groups.

edited Mar 02 '22 at 13:16

answered Feb 23 '22 at 12:30

Toby

416

Thank you for your thoughts. Each reader provides all three scores (score1, score2, score3) for each case. This was ambiguous in the data description (pardon!) and I corrected it. – ava Feb 23 '22 at 12:44
Is your answer still valid with the information from my comment above? – ava Feb 24 '22 at 01:30
In this case you could treat your variables as fixed and random effects within cases: class ~ (1 + score1 + score2 + score3|reader) + (1 + score1 + score2 + score3|case). Depending on your data this might lead to bad estimation due to many parameters/ few data. – Toby Feb 25 '22 at 09:41
Thank you. Why is score1 + score2 + score3+ (1|reader/case) not appropriate? – ava Mar 01 '22 at 18:11
Please see Edits – Toby Mar 02 '22 at 13:17
Thank you for the explanations. cases: class ~ (1 + score1 + score2 + score3|reader) + (1 + score1 + score2 + score3|case) does only treat the Intercept as fixed effects, correct? Should the other variabels not also additionaly be treated as fixed effects? – ava Mar 02 '22 at 16:14

Grouped/Nested data in logistic regression

example data

approx. 66% are class 0 and 33% are class 1

1 Answers1