1

I have texts from the organisations A, B, C and D For A, B, C I also have the outcomes (0/1). My goal is to create a classification model from the texts of A, B, C, which is able to predict the outcomes of D.

In general, the texts are very similar in structure and the wording is also related. Nevertheless, I can not exclude that every organisation has its own peculiarities in wording.

Now I am considering whether I should apply a specific cross-validation strategy to avoid that my model uses organisation-specific characteristics of the texts for classification. Does it make sense to adapt the CV so that, for example, I form the folders in such a way that an organisation is always a block? Within a folder, you could use A and B for training and evaluate the data at C. Then A and C for training, etc...This way only models that do NOT use the specific elements of A and B should be "successful".

Is there a name for this strategy and can anyone recommend literature? Are there any potential risks with this approach?

Martin
  • 542
  • By casting this as a classification task instead of a prediction task you are probably not meeting the study's goals. But validation is a poor way to underestanding organizational characteristics. It is better to model the effect of those characteristics then to understand their estimated effects with a whole-model fit. – Frank Harrell Nov 14 '20 at 12:43
  • Thank you very much for this thought-provoking impulse. Of course, it is a fundamental question whether an absolute binary decision makes sense. I am thinking about it myself and am still in the planning phase. Nevertheless my problem remains: How do I minimize the error probability of a model from A,B,C for a prediction of the outcomes of D (the outcomes of D are not known and not easily observable) ? Especially if I have a lot of possible explanatory variables? – Martin Nov 14 '20 at 13:43
  • 2
    The fact that a binary decision may ultimately be needed is not what drives the choice of analysis. Given a utility/cost/loss function you convert probabilities to optimum decisions (maximizing expected utility) and can also capitalize on gray zone probabilities (say 0.4-0.6) be used to indicate "no decision". Classifiers know nothing about optimum decisions. – Frank Harrell Nov 14 '20 at 17:11
  • I have understood that point. I am interested in how I ensure that my model generates valid probabilities out of sample. – Martin Nov 14 '20 at 18:43
  • 1
    Depending on the effective sample size and the number of candidate features, one good approach is to use the Efron-Gong optimism bootstrap to validate the whole-sample model, focusing on a proper probability accuracy scoring rule such as the Brier score. – Frank Harrell Nov 14 '20 at 22:18
  • 1
    @Frank Without knowing anything of their studies goals how can you assume they aren’t meeting them? – astel Nov 15 '20 at 19:46
  • Because forced choice classification without admitting that there is a gray zone seldom meets the needs of any study. – Frank Harrell Nov 15 '20 at 22:22
  • @Martin I'm facing a somewhat similar problem and I think what you're referring to is sometimes called "block cross-validation": https://doi.org/10.1111/ecog.02881 (If you work with R, this question might also be of interest: https://stats.stackexchange.com/q/137000/203941). – Fanfoué Aug 30 '22 at 13:18

1 Answers1

3

Whatever kind of resampling (CV or bootstrap) verification/validation scheme you use, when you know or suspect that your data is clustered such as by organization in your case, you need to split at the cluster level, otherwise the clusters create data leaks/lead to dependence. More generally speaking, you need to split so that also the highest-level confounding factors are independent for the splits.

In your case this basically means leave one organization out cross validation (or a similar subset of bootstrapped models) due to there being only 3 organizations.

So splitting strategy and model setup both follow from the data structure (cl you have/set up.

Everything Frank Harrell comments about predicting probabilities and using a proper scoring rule applies - these are modeling decisions/choices which are completely independent of the resampling scheme/data plan for verification.