Please run this code in order to create a reproducible example:
set.seed(1)
n <- 10
dat <-data.frame(thesis = rep(c('Yes', 'No'), each = n / 2),
satisfaction = c(sample(5, replace = TRUE, size = n / 2), rep(NA, n / 2)))
dat$satisfaction_without_NA <- ifelse(is.na(dat$satisfaction),
0,
dat$satisfaction)
(X <- model.matrix(~ thesis * satisfaction_without_NA, data = dat)[, -3])
beta <- c(8, 7, 1)
dat$y <- X %*% beta + rnorm(n)
The data are supposed to be the result of a survey, where the competence of students, denoted y is measured in a test. The following data are recorded for each participant:
- Has the student started a thesis project already? (Column
thesis) - If the student has started a thesis project, how satisfied is with the guidance by his supervisor? (Column
satisfaction, NA when the student has not started yet)
I am considering the following model:
For people who have not started yet: $$ y = \beta_0 + \epsilon $$
For people who have already started: $$ y = \beta_0 + \beta_1 + \beta_2 \text{satisfaction} + \epsilon $$
With the offset command I can only force coefficients to the value 1 as far as I can see and, furthermore, I was not successful to apply the offset command for pushing to 1 in this case.
The following workarounds seem to work:
# option 1:
lm.fit(X, dat$y)$coefficients
option 2:
new_dat <- data.frame(X, y = dat$y)
lm(y ~ . - 1, data = new_dat)
option 3, most simple approach:
lm(y ~ thesis + satisfaction_without_NA, data = dat)
lm(y ~ thesis * satisfaction_without_NA, data = dat)
So here are my questions:
- Does the proposed statistical model make sense? Are there any statistical problems / caveats associated with this approach?
- Does anyone know a more elegant way how to specify the model in R?
- Does anyone know pointers to resources that give more information how such a situation can be handled reasonably?
Many thanks in advance and best greetings,
Sebastian
PS: Here a link to the quarto file of this question.
NA. The "no-loan/loan" situation there translates directly to "no-thesis/thesis" here. Codesatisfaction=0for "no-thesis" cases instead ofNA, do the regression on all cases, and interpret the results as in the linked answer. Try that, then post the result as an answer to your own question (which is OK on this site). – EdM May 25 '23 at 14:42