How to handle missing information simultaneously in training and testing set?

Question

I would like to know how to handle missing data in predictive analysis:

In my case, missing information has been decided not to be omitted, however, certain predictive models such as logistic regression, random forest, couldn't handle missing data. So for this reason I have decided to do some data imputation before modelling.

Like all predictive analysis I have a training set and a test set. My confusion is that when I impute the training data, how can I then be able to handle the test data with possible missing information?

I don't understand what you are saying. It is confusing. Are you using an imputation method or not? Please give this some clarification. — Michael R. Chernick, Mar 28 '17 at 14:38
@MichaelChernick Thanks for the reply. The imputation method does not matter I think. My question is: how do we handle the missing information simultaneously in a training and testing set? — user95902, Mar 28 '17 at 14:46
Tree-based model can handle the missing value! If you choose impute missing value, then impute the training data and testing data separately. — wolfe, Mar 28 '17 at 14:56
@7-th Thank you! That's what I want to ask: is the model from an imputed training set still valid for a separately imputed test set? — user95902, Mar 28 '17 at 14:59
Yes, it's valid, if the imputed method is the same. Tree-based model is not to imputed the missing value, but using the proportion of missing value to calculate the feature's information gain. — wolfe, Mar 28 '17 at 15:05
Imputed value always introduce some other errors, so i recommend the tree-based method. — wolfe, Mar 28 '17 at 15:14
Whichever imputation method you use, do you have reservations about using it on both training and test sets? — rolando2, Mar 28 '17 at 15:21
@7-th Yeah true but instead of tree models I would also like to have linear models, and possibly random forest, the tree that cannot handle missing... — user95902, Mar 28 '17 at 15:28
@rolando2 Thanks! My concern was trying to keep the test set intact, and I didn't know if data imputation will lead to over-fitting in the test set or not. — user95902, Mar 28 '17 at 15:30

score -1 · Answer 1 · answered Mar 28 '17 at 15:00

-1

From my experience, there are two options:

Discard instances where there are missing values (do this for both training and test set)
Impute the missing values with the mean value (taking the mean value of the consolidated data - both training and test data)

There are more methods which you can find in this paper

Also, for both options above, you may consider discarding the data fields where the frequency of missing values is high in your model.

answered Mar 28 '17 at 15:00

Duy Bui

139

Imputation using the mean has long been supplanted by other more nimble methods. I like your last point though. – rolando2 Mar 28 '17 at 15:23
@rolando2: Can you explain or list the "more nimble methods"? – stackoverflowuser2010 Aug 01 '17 at 22:16

How to handle missing information simultaneously in training and testing set?

1 Answers1