Let's start with data description of the website visits I analyse :
- 6M rows
- Dependant variable
quotationis binary and takes values0and1with1% of value 1 - The other 3 variables are
temperature,humidityandminuteof the day
The objective is to identify quotation trend based on the weather to optimize communication campaigns and not to determine if for a given visit there will be a quotation.
To avoid overfitting problems due to the large dataset I decided to cross-validate my tree-models to determine the right one.
My questions :
Due to the low probability of quotation = 1 even the best leaf-node gets a 5% with the training sample. Therefore, if I do a predict() on my Testing sample I get only 0 for all nodes.
- Is there a way with the
party packageto attribute the corresponding node to each value of theTesting sample - Is that the right method to evaluate my different models since
predict()doesn't seem to work for me (0 for all observations)?
I went there but every suggestions are based on predict which is I feel of no help in my case...
party package, that would be very helpful. In the mean time, I'll try to undersample, a method founded here : http://stats.stackexchange.com/questions/28029/training-a-decision-tree-against-unbalanced-data – Yohan Obadia Sep 11 '15 at 15:29