Hide implicit information in input data from classifier

Question

I have a classification task where the input data (text) contains information which I don't want to use for classification (implicit in the texts) and therefore have to "hide" from the classifier. I have something like a label for the implicit information which is present in the data but can't mask the information in the input.

One practical example would be when building a predictive policing-like classifier which predicts if a person should be checked. Sadly, our training data has a racial bias which we know of. We have a feature describing the race of a checked person. Now I’m searching for a method for training a classifier without the racial bias despite the fact that it is present in the training data.

Do you know some method or research direction I could investigate for this issue?

score 2 · Accepted Answer · answered May 29 '18 at 00:47

I would regress the information onto the variables whose effects you want to remove. The residuals should be uncorrelated with the variables to be hidden. You could then use those residuals in place of the original variables in your classifier.

Another, similar, approach would be to regress the final predictions onto the variables whose effects you want to remove and use the residuals there. Since those are presumably categories, you could use logistic regression with 'race' (or whatever it is) as the predictor.

Hide implicit information in input data from classifier

1 Answers1