I have a very specific situation involving missing data in a regression tree (actually part of a random forest) which is not covered by the most popular related questions:
How do decision tree learning algorithms deal with missing values (under the hood)
Why doesn't Random Forest handle missing values in predictors?
Here is a contrived example where the response Y is the expenditure of a customer at their next transaction. The average_spend variable, which measures the average spent by the customer over their previous transactions, is missing values when the customer has never shopped with us before. An example of the data would be:
Y prev_customer average_spend sale_method gender
1 10 FALSE NA offline ...
2 100 TRUE 100 online ...
3 10 FALSE NA offline ...
4 100 TRUE 100 online ...
I would like a splitting rule for average_spend to allocate all missing values to one of the nodes. This is because, intuitively, I feel like the tree should be able to handle the dichotomy between whether the customer is a first time customer or not, i.e. if it doesn't split first on whether a customer is a first time customer, then this split will be useful later. In this sense, would it be ok to impute any value into the average spend?
I am unable to find such a rule in any of the references I have seen though.
Sex: it may still be useful to split on Sex, without dropping NA values, and then split on previous spend. – Alex May 10 '18 at 02:21"So when the tree decides to do some split then if you impute by -Inf, one of the sides (average_spend < splitValue) will contain all NA rows." Correct, and if this is not the best split, it can use the +Inf variable to put NA rows whenever average_spend > splitvalue. I am not just trying to separate out NA's from no NA's. I want NA's to be included in one branch, instead of being discarded.
– Alex May 10 '18 at 23:29average_spendon, say, the condition< 1or>= 1, then if a row has a nan value it satisfies both conditions vacuously, so that row should "survive" in both branches. Its other columns will have to decide its fate. Maybe a naive way to test how this works is to replace every row with a nan in that column with two rows, both identical, but the first substituting the nan for +inf and the second for -inf. – travelingbones Mar 16 '22 at 20:13