I have a dataset with many observations with NA's in one variable (almost a third of them). Actually, is a numerical variable with some values being zero (these zeroes mean 'no data')
My options:
- Try to impute: Due to the characteristic of the data, I suspect imputation would be inaccurate and should create too much noise. Even with mice or missForest.
- Remove observation: A third of my dataset? No way!
- Remove variable: It's a very important variable. I think I can get pretty much info from it
- Convert zeros to NA and leave algorythm handle them.
So I'm searching for an algorithm that can handle those NA's, so, instead of having to remove that variable, loose a lot of observations or introduce noise. My dataset set has 80000 observations and some categorical variables have 1000 or even more different values.
After reading in some dfferent sources, I'm not sure of how these two families of algorithms (Random Forest, Gradient Boosting) handle NA's without problem. My idea was that they can handle them, so I can leave some NA's, that they just omit if necessary, but use the non-NA values of that variable for trainig the model.
So, with Caret, I have read that I only must set na.action = na.pass, and no preprocessing (do not specify preProcess, leave it as its default value NULL).
Would Random Forest (or Ranger) work fine this way? And Gradieny boosting (or XGBoost)?
Thanks :)