2

I am currently analysing a dataset from a survey that moved participants to a different section depending on their response to a previous question, (e.g. people who said they were not farmers, skipped the farming-related questions). There is also missing data that does simply relates to respondents choosing to skip the question (there is much less of this type of missing data). The structurally missing data has led to a large amount of missing data from each variable (20%) and I assume has led the data to be MNAR.

I am wondering how to appropriately manage these two sets of missing data so I can progress to an ANOVA and then regression.

For context out of a sample of 22076, for each variable there are around 4500 structurally missing values (20%), and around 80 (.4%) missing values due to respondents choosing to not respond.

Any help would be hugely appreciated!

utobi
  • 11,726
Mark D
  • 21
  • 1
    Does this answer your question? https://stats.stackexchange.com/questions/372257/how-do-you-deal-with-nested-variables-in-a-regression-model – J-J-J Jul 27 '23 at 21:42

1 Answers1

0

IIUC, for both of your types of missing data, the fact that the features are missing could be deducible, at least partially with some error, from the remaining features, which means that both are MNAR.

If you are reluctant to discard the data with missing features, you might want to think about imputation, i.e. replacing the missing features with values that have been chosen as a function of the non-missing features, while this function has been learned by some ML algorithm. There are various libraries available which can help you with that. And this would be an appropriate procedure for the non-structurally missing part of your data.

But in the case of the 20% structurally missing values, it doesn't seem appropriate to do imputation. For e.g. filling in missing features related to farming for someone who is e.g. a teacher doesn't really make sense.

I don't really know much about your goals except that you finally want to do regression, but maybe the following might still help.

If you plan to use a data-driven approach for your regression with some more complex models like random forest, gradient boosting machines, or deep neural networks, it is sufficient to just fill in some constant value for all the missing instances of a feature and the model will hopefully figure out itself that this is not carrying information.

But if you are planning a model-based approach, using standard models like e.g. linear or generalized linear (mixed effect) models, you should set those values to zero, so that in the design matrix the columns of e.g. the farming related features will contain zeros for all rows belonging to non-farmers, so there will be no contribution of those features for non-farmers. Thus, structurally missing values will be equivalent to removing the effect of this feature.

frank
  • 10,797