I have the following dataframe in Python:
import pandas as pd
df = pd.DataFrame({'id': [1,2,3,4,5,6,7,8,9,10],
'group': ['a', 'a', 'a', 'b','b', 'b','b','c','c','c'],
'val':[1,0,1,0,1,1,0,0,1,1]})
print(df)
id group val
0 1 a 1
1 2 a 0
2 3 a 1
3 4 b 0
4 5 b 1
5 6 b 1
6 7 b 0
7 8 c 0
8 9 c 1
9 10 c 1
I want to see if ANY of the levels of group has a significant effect on val.
So I do a logistic regression as follows:
import statsmodels.api as sm
dummy_variables = pd.get_dummies(df['group'], drop_first=False)
logit_model = sm.Logit(df['val'], dummy_variables)
result = logit_model.fit()
print(result.summary())
Notice the drop_first=False and I also do not add a constant in the model, due to multicollinearity.
At my real data (not the dummy example) i get a p-value < 0.05 for dummy_a and a p-value > 0.05 for dummy_b and dummy_c.
Is it right to say that a has a signifficant effect on val ?