I have training data with each feature being different sources of probability. All of the features are probabilities (between 0 and 1 obviously). This is a binary classification problem.
Note that I can use just one of the features if I choose, just by selecting a cutoff. When I use just one of the features, it sometimes performs better than when I use all of the features (all of the probabilities). So, I decided to add a feature. I compute the performance of each of the features alone. Then, I create a new feature which is comprised of weighted sum of the probabilities of each row, weighting each feature by the performance of that feature only.
Example of original data:
col A col B col C
0.3 0.2 0.13
0.4 0.1 0.5
...with col A alone having an AUC score of 0.6, col B 0.5, col C 0.55. Note that the sum of these is 0.6+0.5+0.55=1.65 So, I add this feature:
col A col B col C new_feature
0.3 0.2 0.13 (0.3*0.6+0.2*0.5+0.13*0.55)/1.65
0.4 0.1 0.5 (0.4*0.6+0.1*0.5+0.5*0.55)/1.65
I add this feature, and the performance degrades instead of improves. I tried Logistic Regression, Random Forest, and other, similar classifiers. I cannot think of a good reason why. How would this decrease the performance of the model with all of the features (including the one that I created)?
For unregularized GLM I guess it does not change the predictions at all if you append a lineaer combination. For ridge regression (squarred penalty of coeeficient size) I guess you allow the high AUC variables to be regularized less, as the coefficient is distributed.
– Soren Havelund Welling Jan 05 '16 at 19:54interactionin the formal definition, I think that only applies if the operation is multiplicative. Here, I'm computing a weighted sum, which is additive, so these aren'tinteractionsper se. – makansij Jan 06 '16 at 15:48"dominant", since it is just a weighted average of probabilities. Is my intuition wrong, or is there a formal definition of"dominant"that I'm missing? – makansij Jan 06 '16 at 15:54