Should I do pairwise NA deletion for the predictors in regression and delete NAs from outcome?

Question

I have a very straight-forward question. THIS is my dataset

I have NAs in both the outcome (y = "scores") and in the continuos predictor (X1_c):

## Missing data for the outcome (Y = "SCORE")
### in each value of the categorical outcome (X2)
data  %>% 
  dplyr::select(PARTICIPANT, SCORE, X2) %>%
  group_by(X2) %>% 
  count(is.na(SCORE))
outcome:
 X2    is.na(SCORE)     n
  <fct> <lgl>          <int>
1 Test1 FALSE            106
2 Test1 TRUE              12
3 Test2 FALSE            100
4 Test2 TRUE              18
Missing data for continuous predictor (X1_c)
data %>% 
  count(is.na(data$X1_c))
outcome:
is.na(data$X1_c)`     n
  <lgl>              <int>
1 FALSE                160
2 TRUE                  76

mod1 <- lmer(SCORE ~ X1_c * X2 + (1|PARTICIPANT), data = data)

My question is:

1 I know that Lmer does listwise deletion of NAS, but should I do it pairwise instead? (ie, delete all participants who lack a value in both Y and X1_c) ?

if not, then:

2 Should I delete the observations only in the outcome Y (Scores, n = 76 NAS) ?

I've seen a lot of posts here (like this , this , and this ) and elsewhere concerning NAS and regression, but I got lost with many different answers. Thanks in advance! Any thoughts would be much appreciated.

EDIT What NAs mean here?

Y: scores in two tests/exams (X2). Each participant should have 2 different scores (1 for test 1 and 1 for test 2)

X1_c : score in another exam (which acts as a moderator). Each participant should have a repeated score. I wanna see if this score interacts with the two other tests (X2)

Then, NAs mean that the participant in question did not take one of the tests. So,

X2    `is.na(SCORE)`     n
  <fct> <lgl>          <int>
1 Test1 FALSE            106
2 Test1 TRUE              12  # did not take test 1 (no score on this)
3 Test2 FALSE            100
4 Test2 TRUE              18  # did not take test 2 (no score on this)
and:
is.na(data$X1_c)`     n
  <lgl>              <int>
1 FALSE                160
2 TRUE                  76 # did not take test X1 (the moderator)

The data look like this:

> head(data)
# A tibble: 6 x 5
  PARTICIPANT SCORE  X1_c X2    
        <int> <int> <dbl> <fct>          
1           1    21 -42.9 Test1           
2           1    NA -42.9 Test2 #didn't take Test2, took test 1 n' X1
3           2    21  18.9 Test1           
4           2    20  18.9 Test2           
5           3    15  NA   Test1 #didn't take X1_c, took test1 n' test2         
6           3    17  NA   Test2

bonus: I was having some normality issues and I thought that removing NAs could solve that, but it didn't, any thoughts?

The question may be straightforward, but to be answerable you need to explain what the NA values mean. If their presence is related to values of other variables, your results will likely depend on how you treat the NAs. — whuber, Sep 09 '22 at 22:56
@whuber oh, all right! I'll do that right now and edit the post. Basically, Y is composed of exam scores (2 per student), X1_c is a score on another exam (moderator) and X2 is the type of exam of Y . So missing values mean that the participant did not take that specic test — Larissa Cury, Sep 09 '22 at 22:58
What we really need is some information about how the NAs might be related to the other variables. Why did some people not take a test? (Maybe they were overqualified? Underqualified? It snowed on the test day? Those lead to three different solutions!) — whuber, Sep 09 '22 at 23:12
@whuber , ohh, got it. ok, thank you! So, as much as I would like to give you a fancy answer on this, it is pretty simple: they didn't show up on the day of the test, that's it (I don't have their justifications of why they didn't show up, tho) — Larissa Cury, Sep 09 '22 at 23:14
@whuber , how does that change the solutions you've mentioned? — Larissa Cury, Sep 10 '22 at 00:28
You might want to investigate material on multiple imputation, of which "MICE" is one popular approach. The introductions to that will explain the different kinds of missingness that are routinely recognized and the potential effects they can have on the results. — whuber, Sep 10 '22 at 15:00

dipetkov · Accepted Answer · 2022-09-10T19:22:46.080

This is a follow-up question to your first question about this data and I have a follow-up answer. I take advantage of your comments to the previous question to rename X1 as Proficiency score, Test1 as Written score and Test2 as Oral score. This makes it easier to put the analysis in context.

The plan is: Follow @whuber's advice to do multiple imputation, with two caveats:

I use aregImpute from the Hmisc package instead of the mice package. This choice is mostly determined by the fact that I use Gls from the rms package to analyze the data. The Hmics and rms packages go hand in hand as they are both written by Frank Harrell. Aside: I encourage you to read his Regression Modeling Strategies course notes.
I "pivot" the data to do multiple imputation, so that all three scores of a participant (proficiency, written, oral) are on the same row. I suspect this leads to better imputation.

First we look at the patterns of missingness in the data. Almost one third of the participants are missing their proficiency score.

I impute the test score data 20 times.

n.impute <- 20
xtrans <- aregImpute(
  ~ Proficiency + Written + Oral,
  data = data, n.impute = n.impute,
  nk = 4
)

Then I refit the generalize least squares (GLS) model on each of the 20 completed datasets. [The full R code is attached at the end.]

The mean structure of the model allows for a smooth nonlinear effect of proficiency on written and oral test scores. I include an interaction with Test type, so that the proficiency effect can be different for written and oral scores. The variance structure of the model allows for correlation between the two scores of each participant and that written and oral scores have different variance.

model <- Gls(
  Score ~ rcs(Proficiency, 4) * Test,
  data = completed.data,
  correlation = corSymm(form = ~ 1 | PARTICIPANT),
  weights = varIdent(form = ~ 1 | Test)
)

Finally I plot the 20 model fits. The result is a bit different from my first analysis. As before proficiency is predictive of test score only at the two extremes: low proficiency below 40 and high proficiency above 80. But in this analysis we also see stronger interaction between test scores and low/high proficiency. You are the expert in education and you know much more about this particular data: Is there a "story" that can explain the different slope of written and oral test scores for students with low proficiency below 40 and students with high proficiency above 80?

PS: Some care is probably warranted in interpreting the results as oral scores are in the range [0,60] while written scores are in the range [0,100]. So there might be floor/ceiling effects.

The R code to reproduce the figures and the analysis:

library("naniar")
library("rms")
library("nlme")
library("tidyverse")
iccmlr::theme_set_amazon()
data <-
  here::here("data", "588224.csv") %>%
  read_csv(
    col_types = cols(PARTICIPANT = col_character())
  ) %>%
  rename(
    X1 = X1_notCentered
  ) %>%
  select(
    PARTICIPANT, X1, X2, SCORE
  ) %>%
  pivot_wider(
    id_cols = c(PARTICIPANT, X1),
    names_from = X2,
    values_from = SCORE
  ) %>%
  rename(
    Written = Test1,
    Oral = Test2,
    Proficiency = X1
  )
data %>%
  select(-PARTICIPANT) %>%
  vis_miss()
to_long <- function(data) {
  data %>%
    pivot_longer(
      c(Written, Oral),
      names_to = "Test",
      values_to = "Score"
    )
}
data_long <- to_long(data)
dd <- datadist(data_long,
  q.display = c(0, 1)
)
options(datadist = "dd")
fit.one.impute <- function(i, xtrans, data) {
  completed.data <- data
  imputed.data <- impute.transcan(
    xtrans,
    imputation = i,
    data = data, list.out = TRUE, pr = FALSE, check = FALSE
  )
  completed.data[names(imputed.data)] <- imputed.data
completed.data <- to_long(completed.data)
model <- Gls(
    Score ~ rcs(Proficiency, 4) * Test,
    data = completed.data,
    correlation = corSymm(form = ~ 1 | PARTICIPANT),
    weights = varIdent(form = ~ 1 | Test)
  )
model %>%
    Predict(Proficiency, Test) %>%
    as_tibble()
}
n.impute <- 20
xtrans <- aregImpute(
  ~ Proficiency + Written + Oral,
  data = data, n.impute = n.impute,
  nk = 4
)
mult.fits <-
  seq(n.impute) %>%
  map_dfr(
    ~ fit.one.impute(., xtrans, data),
    .id = "i"
  )
mult.fits %>%
  ggplot(
    aes(
      Proficiency, yhat,
      group = interaction(i, Test),
      color = Test
    )
  ) +
  geom_point(
    aes(
      Proficiency, Score,
      group = Test,
      color = Test
    ),
    inherit.aes = FALSE,
    data = to_long(data),
    size = 2,
    shape = 1,
    stroke = 1
  ) +
  geom_line(
    alpha = 0.5,
    show.legend = FALSE
  ) +
  labs(
    y = "Score"
  )

Should I do pairwise NA deletion for the predictors in regression and delete NAs from outcome?

Missing data for continuous predictor (X1_c)

1 Answers1

Linked