I ran into a problem where the predicted probabilities of being treatment/control from a machine learning model (the XGB model) matches the actual outcome almost exactly and have area under the curve (AUC) of 0.9999. On the other hand, the logistic regression achieved an AUC of 0.5. However, I cannot use the greedy matching algorithm to find any matches for the treatment if I use the XGB propensity scores. One of the earlier posts here mentioned this problem: How to use propensity scores that exhibit separation. So my question is: in order for the matching algorithm to work, do we have to give up the better machine learning prediction model for an inferior logistic regression? This seems a little counterintuitive.
Asked
Active
Viewed 103 times
1 Answers
1
You are judging your propensity score models on the wrong criteria. The only criteria you should use to assess how good your propensity scores are are how well they achieve balance and the remaining effective sample size when conditioned on (i.e., using matching or weighting). I explain this in my answer here. You should not use AUC to assess propensity score models; the fact that XGB has a higher AUC than LR doesn't mean anything. It's simple to overfit a propensity score model so that you get perfect prediction of the treatment; this doesn't actually mean anything about positivity and doesn't mean the overfit model is preferable.
Noah
- 33,180
- 3
- 47
- 105
-
Thanks for the very helpful answer. But aren't they related at all? I mean a good model that can achieve balance should at least yield an AUC better than 0.5 right? If a model is completely unable to distinguish between treatment and control, let's say everyone gets a predicted probability of 0.5, then I would think the resulting match will be poor right? – Stat Novice Oct 07 '22 at 15:43
-
They might be related but there's no point even computing accuracy if your goal is balance. Such a low AUC suggests there is already good balance in the dataset. – Noah Oct 07 '22 at 16:07
-
Actually they are not balanced at all in the dataset...that's why I am so puzzled. – Stat Novice Oct 07 '22 at 16:25