Using probabilities as predictor variables for binary classification

Question

I have training data with each feature being different sources of probability. All of the features are probabilities (between 0 and 1 obviously). This is a binary classification problem.

Note that I can use just one of the features if I choose, just by selecting a cutoff. When I use just one of the features, it sometimes performs better than when I use all of the features (all of the probabilities). So, I decided to add a feature. I compute the performance of each of the features alone. Then, I create a new feature which is comprised of weighted sum of the probabilities of each row, weighting each feature by the performance of that feature only.

Example of original data:

col A    col B    col C    
0.3      0.2      0.13
0.4      0.1      0.5

...with col A alone having an AUC score of 0.6, col B 0.5, col C 0.55. Note that the sum of these is 0.6+0.5+0.55=1.65 So, I add this feature:

col A    col B    col C    new_feature
0.3      0.2      0.13     (0.3*0.6+0.2*0.5+0.13*0.55)/1.65
0.4      0.1      0.5      (0.4*0.6+0.1*0.5+0.5*0.55)/1.65

I add this feature, and the performance degrades instead of improves. I tried Logistic Regression, Random Forest, and other, similar classifiers. I cannot think of a good reason why. How would this decrease the performance of the model with all of the features (including the one that I created)?

@Hunle Does the error on the training set decrease? It might be a case of overfiting. — Armen Aghajanyan, Jan 05 '16 at 17:14
The performance on the training set increases with some classifiers, and increases in others. The performance increases with random forest, svm and logistic regression. The performance decreases with ada boost and extreme trees. — makansij, Jan 05 '16 at 18:10
As I suggested earlier linear recombination of features are only likely to improve RF models if there are dominant interactions, otherwise the trees might just become more correlated (bad for noisy data). http://stats.stackexchange.com/questions/187593/does-a-linear-recombination-of-features-affect-random-forest/187641#187641
For unregularized GLM I guess it does not change the predictions at all if you append a lineaer combination. For ridge regression (squarred penalty of coeeficient size) I guess you allow the high AUC variables to be regularized less, as the coefficient is distributed. — Soren Havelund Welling, Jan 05 '16 at 19:54
If you mean interaction in the formal definition, I think that only applies if the operation is multiplicative. Here, I'm computing a weighted sum, which is additive, so these aren't interactions per se. — makansij, Jan 06 '16 at 15:48
Furthermore, it makes intuitive sense that this new feature would be "dominant", since it is just a weighted average of probabilities. Is my intuition wrong, or is there a formal definition of "dominant" that I'm missing? — makansij, Jan 06 '16 at 15:54

Using probabilities as predictor variables for binary classification

0 Answers0