0

Please run this code in order to create a reproducible example:

set.seed(1)
n <- 10
dat <-data.frame(thesis = rep(c('Yes', 'No'), each = n / 2),
                 satisfaction = c(sample(5, replace = TRUE, size = n / 2), rep(NA, n / 2)))

dat$satisfaction_without_NA <- ifelse(is.na(dat$satisfaction), 0, dat$satisfaction)

(X <- model.matrix(~ thesis * satisfaction_without_NA, data = dat)[, -3])

beta <- c(8, 7, 1) dat$y <- X %*% beta + rnorm(n)

The data are supposed to be the result of a survey, where the competence of students, denoted y is measured in a test. The following data are recorded for each participant:

  • Has the student started a thesis project already? (Column thesis)
  • If the student has started a thesis project, how satisfied is with the guidance by his supervisor? (Column satisfaction, NA when the student has not started yet)

I am considering the following model:

  • For people who have not started yet: $$ y = \beta_0 + \epsilon $$

  • For people who have already started: $$ y = \beta_0 + \beta_1 + \beta_2 \text{satisfaction} + \epsilon $$

With the offset command I can only force coefficients to the value 1 as far as I can see and, furthermore, I was not successful to apply the offset command for pushing to 1 in this case.

The following workarounds seem to work:

# option 1:
lm.fit(X, dat$y)$coefficients

option 2:

new_dat <- data.frame(X, y = dat$y) lm(y ~ . - 1, data = new_dat)

option 3, most simple approach:

lm(y ~ thesis + satisfaction_without_NA, data = dat) lm(y ~ thesis * satisfaction_without_NA, data = dat)

So here are my questions:

  • Does the proposed statistical model make sense? Are there any statistical problems / caveats associated with this approach?
  • Does anyone know a more elegant way how to specify the model in R?
  • Does anyone know pointers to resources that give more information how such a situation can be handled reasonably?

Many thanks in advance and best greetings,

Sebastian

PS: Here a link to the quarto file of this question.

  • See this answer for an approach to this type of situation, where values are by necessity not available for some individuals rather than "missing" in the sense usually implied by NA. The "no-loan/loan" situation there translates directly to "no-thesis/thesis" here. Code satisfaction=0 for "no-thesis" cases instead of NA, do the regression on all cases, and interpret the results as in the linked answer. Try that, then post the result as an answer to your own question (which is OK on this site). – EdM May 25 '23 at 14:42

1 Answers1

1

After following the advice of the above comment, I came to the conclusion that the envisaged model can be fitted by replacing the NAs with zero.

It seems that the coefficients of the simulated model can be accurately estimated by the following model:

lm(y ~ thesis + satisfaction_without_NA, data = dat)

Generating the following output:

Call:
lm(formula = y ~ thesis + satisfaction_without_NA, data = dat)

Coefficients: (Intercept) thesisYes satisfaction_without_NA
8.310 7.018 1.053

Thanks a lot for the input and best greetings :-)