I have a classification task where the input data (text) contains information which I don't want to use for classification (implicit in the texts) and therefore have to "hide" from the classifier. I have something like a label for the implicit information which is present in the data but can't mask the information in the input.
One practical example would be when building a predictive policing-like classifier which predicts if a person should be checked. Sadly, our training data has a racial bias which we know of. We have a feature describing the race of a checked person. Now I’m searching for a method for training a classifier without the racial bias despite the fact that it is present in the training data.
Do you know some method or research direction I could investigate for this issue?