Logistic regression : incoherent coefficient values from z tests?

Question

I am working on a customer purchase problem. I have 150 campaigns sent by email, that I denote C0, C1 ... C149. Moreover, for each user i :

Cj= 0 if campaign j is NOT received by customer i,
Cj= 1 if campaign j is received by customer i
nb_campaigns = the number of campaigns received by client i
sucess = 0 or 1 if a customer ordered something thanks to the campaigns he received

I performed a logistic regression to explain the variable sucess with nb_campaigns, I got the following results (with statsmodels) : When I perform a logistic regression to explain sucess with ALL campaigns AND nb_campaigns, I got different coefficients for the intercept and for nb_campaigns and on the top of that, I got the following graph :

It seems abnormal to me that the probability decreases if the number of campaigns increases. And moreover, it is not coherent with the first graph... Is it a code mistake ? I thought maybe it is because I increased the variable nb_campaign in the loop, but in fact if this variable increases by one, then one campaign needs to increase by 1 too.

Do you know what tests are performed in the previous tab and why the coef are not the same ?

If I want to describe the effect of nb_campaign on sucess, is it a mistake to consider all other variables ? Is it better to test sucess vs nb_campaigns and sucess vs all campaigns independently ?

Thank you for your help !

William

You seem to have nb_campaigns as $\sum C_j$. So if you include each of the $C_j$ then also including nb_campaigns can lead to spurious results — Henry, Sep 17 '21 at 12:37
I don't think running this regression is possible since nb_campaigns is a linear function of the cjs. I am surprised that Python did not issue an error message. — dimitriy, Sep 17 '21 at 17:12
Thank you very much for your feedback ! I'll run the logistic regression without nb_campaigns — Anthony G, Sep 19 '21 at 11:42

score 1 · Answer 1 · answered Sep 17 '21 at 12:55

I don't see anything particularly strange about the results.

Notice the scale of the $y$-axis on the second plot, the predicted probabilities are nearly zero. So if you use the dummies for the campaigns to the model, the campaign count does not add much to the results.
Nothing strange about probability of success decreasing with the number of campaigns as well. If you send me 150 marketing e-mails, I guarantee you that I'll mark it as a spam and never look at your e-mails again. I see no reason to expect that flooding your customers with marketing campaigns would lead to success.
If you have dummy variables per each campaign in the second model it can be the case that sending a particularly successful marketing e-mail predicts the success, not the number of the campaigns. If you count the campaigns, the count may include the successful e-mail (or e-mails), hence the count serves as a proxy for observing some particular campaigns. That's why the relation may reverse between the models.
As correctly noticed by @Henry in the comment "You seem to have nb_campaigns as $\sum C_j$. So if you include each of the $C_j$ then also including nb_campaigns can lead to spurious results".
You may be interested in going through some of the threads like the How can adding a 2nd IV make the 1st IV significant? one.

Thank you Tim ! I'm gonna get rid of nb_campaigns. – Anthony G Sep 19 '21 at 11:44 — Anthony G, Sep 19 '21 at 11:44

Logistic regression : incoherent coefficient values from z tests?

1 Answers1