How to Avoid Overfitting in Spam Classification with Text and Numeric Features

Question

In making a document classifier with scikit-learn, I could easily do so with a straightforward Naive Bayes (NB) classifier like MultinomialNB.

However, I also have numeric features related to these documents, which might hold greater or equal importance to the text itself. Let's suppose that a Random Forest classifier (RF) performs well on these numeric features.

The straightforward ensembling approach would be to take the partial information available to each classifier, averaging predictions based on confidence. However, better to provide all information to RF, I believe, as follows:

I could create a pipeline that first trains NB, then feeds its class probability predictions as a feature into the RF classifier, along with all the remaining numeric features that NB never accounted for. However, the RF classifier would then overfit to the NB prediction feature, given that NB has already seen class labels at training time.

What is the canonical approach to avoiding overfitting in this two-classifier pipeline problem? My hunch is that I should be adding some amount of noise to the NB classifier's prediction probability feature that gets input into RF at training time.

score 4 · Accepted Answer · answered Apr 23 '17 at 02:00

4

As you correctly identified this pipe-line introduces over-fitting, this over-fitting is due to data-leakage and thus renders the current form of this particular modelling pipe-line unusable.

Let's see why data-leakage happens: When feeding the class probability predictions from the NB model as a feature into the RF model, despite the fact that the NB probabilities where fitted without using features available to the RF, you pass information about the predicted labels to the RF model. This information should be unavailable to the RF at the time of prediction or training. This pipe-line is therefore flawed so no amount of noise will fix it. Unfortunately data-leakage references as sparse, "Leakage in Data Mining: Formulation, Detection, and Avoidance" by Kaufman et al. and "Ten Supplementary Analyses to Improve E-commerce Web Sites" by Kohavi and Parekh are nice reads. In general the works of S. Rosset and R. Kohavi seem to touch upon this matter more seriously.

I think you have essentially worked out a basic model-stacking procedure but unfortunately you have mixed up the dataset split. Model stacking is a staple technique for most data science competitors because it allows combining different models together; Kagglers have some very good tutorials eg. here. A canonical reference on stacking is Wolpert's "Stacked generalization"; "Combining Estimates in Regression and Classification" by LeBlanc and Tibshirani is pretty important too. In short, the class probability predictions from a base classier (the NB model here) are treated as meta-features thus allowing the top classifier (the RF model here) to use them directly as highly relevant features. Importantly, these meta-features have to derived without using information (in terms of predictors variables $X$ as well as target values/labels $y$) that will be used in the fitting procedure of the top classier. The simplest way of ensuring data-leakage will be to split the original training dataset to two training datasets; one to be used by the base and one by the top classifier. The overall performance will be estimated as always on a separate test dataset.

answered Apr 23 '17 at 02:00

usεr11852

44,125

I disagree. In particular, you write "you pass information about the predicted labels to the RF model. This information should be unavailable to the RF at the time of prediction or training." If you set up the cross-validation correctly, then at train-time, the RF will only be seeing the naive Bayes model's predictions for the same cases that it's training on. At test-time, it will only see predictions for cases that the Bayes model didn't see the true value of, either. – Kodiologist Apr 23 '17 at 05:13
@Kodiologist: Apologies, I probably explained this inadequately. What is propose is the same thing you comment it should be done (setup CV correctly at first instance by completely separating the training sets of the two classifiers). Please check what deal leakage in stacking is. This question is a typical example, a similar but less severe example in discussed here (about 2/3 down). – usεr11852 Apr 23 '17 at 12:14
That isn't what I said should be done. I still don't see how my proposal would cause the sort of leakage effect described in the article you've linked, either. It can't be that "the meta features in {fold2, fold3, fold4, fold5} are dependent on the target values in fold1" if you're computing the "meta features" without using fold1. – Kodiologist Apr 23 '17 at 14:32
You haven't proposed anything specific to folds so I cannot evaluate your proposal fully in terms of {folds1,2,etc,}; you said it (RF) will only see predictions for cases that the Bayes model didn't see the true value of, which is exactly what I meant too so I don't understand why you disagree (and therefore I think my post is misinterpreted by you). – usεr11852 Apr 23 '17 at 14:53
A few more details: If we train the base classifier with, say, 3-fold CV using all the data and produce the meta-features/predictions $G_A$ for the fold $A$ from using folds $B,C$ only, we indeed do not use fold $A$ information for the meta-features $G_A$ so there is no data leakage yet. Nevertheless, if we now pass these $G ={G_{A...C}}$ to the top classifier we leak information from the original targets. The top classifier is trained with information that should have been unavailable to it because $G$ was constructed knowing the true labels that the top classifiers tries to predict. – usεr11852 Apr 23 '17 at 15:15
Right, I definitely see your argument there. Maybe I don't understand something else. What I propose, to be clear, is: when testing on fold1, train the naive Bayes model on fold2 and fold3, then use its outputs (along with the other features) to train the RF on fold2 and fold3. Then use the naive-Bayes and RF parameters to make predictions for fold1 (with neither model getting to peek at the real response values for fold1). I think that's different from the approach you're advocating, but maybe it's the same? – Kodiologist Apr 23 '17 at 15:48
2

@Kodiologist Let D be a dumb model that simply memorizes the training data completely in a single, deep decision tree. Suppose that instead of stacking RF on NB, we are stacking RF on D. Since D perfectly learns all targets in the training data, RF will find D to be a perfectly informative feature; RF has now overfit to the output of D, and the stacked model will perform miserably. – Brian Bien Apr 23 '17 at 17:13
@Kodio: For simplicity I advocated separating the two training sets at first instance; clearly it is the most data consuming but it is the most straightforward. Thus it is not the same as you propose. (Essentially we have folds $A,...,F$ and use folds $A, B, C$ for the base classier and the rest for the top classier; folds $D, E, F$ will be augmented by meta-features predicted from $D, E, F$). OK! I see what you describe: I think there is still leakage though. In addition we would use the prediction of NB trained on folds 2 & 3 as meta-features for these folds; that would be overfitting too. – usεr11852 Apr 23 '17 at 17:19
1

@BrianBien Ah, now I understand. That's a good point. The CV will still reflect the performance of the model accurately, but the performance will likely be dragged down by this internal overfitting. – Kodiologist Apr 23 '17 at 18:54
@Kodiologist yep! – Brian Bien Apr 23 '17 at 20:05

score 1 · Answer 2 · answered Apr 22 '17 at 18:51

1

Why use an ensemble at all in this case? A random forest can cope fine with dichotomous features, so just give the random forest all the features.

answered Apr 22 '17 at 18:51

Kodiologist

20,116

I'd like to evaluate the performance of chaining models vs simply using one or ensembling them. Also, I'm curious if there's a name for this technique (chaining?) and how others address the overfitting concern I raised. – Brian Bien Apr 22 '17 at 19:57
@BrianBien I'm not familiar with any established term for this kind of ensemble method. To avoid overfitting, use cross-validation, only fitting both models (the naive Bayes classifier and the random forest) with the data outside the current fold, then evaluating the ensemble with the data in the fold. – Kodiologist Apr 22 '17 at 20:43
I don't see how cross validation would solve the problem of the RF model overfitting to the output feature from the NB model. Can you elaborate on how you'd apply CV here? – Brian Bien Apr 22 '17 at 21:54
@BrianBien The same way as you'd apply CV for any other model—where's the new danger of overfitting? You're not treating the outputs of the naive Bayes model as correct values of the DVs; you're using them as a feature. – Kodiologist Apr 22 '17 at 22:43
CV is not the answer in this case because there is data-leakage between the two procedures. Clearly it should be used when training every individual model but the problem remains that information that will not be available at the time of prediction will be used during training. – usεr11852 Apr 23 '17 at 02:08
2

This is being automatically flagged as low quality, probably because it is so short. At present it is more of a comment than an answer by our standards. Can you expand on it? You can also turn it into a comment. – gung - Reinstate Monica May 06 '17 at 02:20
I think it does answer the question as written, although it may not be the sort of answer the asker was hoping for, and I don't know what else needs to be said about it. – Kodiologist May 06 '17 at 14:14

How to Avoid Overfitting in Spam Classification with Text and Numeric Features

2 Answers2

Linked