0

I have a classification problem. The response is whether a player will be banned (Yes=1 or No=0). I am considering a feature whether a player cheats. Intuitively, if a player cheats, they should be more likely to be banned, and thus this feature should be included in the model.

When I look at the ban rate for 2 groups (cheat or dont' cheat), I realize the ban rate is not that much difference.

Table of ban rate
Cheat      Don't cheat
 37%           35%

When I build a logistic regression with that feature (cheat or don't cheat) and some other features, the pvalue of that feature is extremely small (<0.05) with a positive coefficient, which intuitively makes sense since if a player cheats, the chance of them being banned should be higher.

My question is: given a minor difference in the ban rate between 2 groups (cheat vs not-cheat), why is that feature so significant in the output of logistic regression with correct coefficient sign? The model I considered has no interaction terms if it makes any difference

  • 1
    Welcome to Cross Validated! What is your sample size? – Dave Jan 04 '23 at 18:35
  • around 1 million – Nancy Hall Jan 05 '23 at 01:40
  • With one million observations and only one predictor, almost amything will be "significant". Focus on the difference in probability of banning, is that meaningful? Construct a confidence interval for that difference. See https://stats.stackexchange.com/questions/323862/given-big-enough-sample-size-a-test-will-always-show-significant-result-unless – kjetil b halvorsen Jan 15 '23 at 20:26

0 Answers0