Long story short - I am lost among the ways to check for relation (whether there is any) between binary variables.
Context - I am working on a side project where I try to analyze marketplace data and see which variables are related to items being sold (the goal is to find out what can be done to sell more items). So I am trying to analyze data and basically see which variables are correlated to the dependent variable which is "sold" (True/False). I am having some difficulties with binary data. For e.g. there's an independent variable "sale" (True/False) and I want to know whether putting an item on sale increases the probability of selling it.
My first instinct was to use the chi-squared test for it:
from scipy.stats import chi2_contingency
contigency = pd.crosstab(df.sold, df.sale)
c, p, dof, expected = chi2_contingency(contingency)
I got p = 2.5e-142. At this point, it seemed right to make a conclusion that sale definitely (since the p-value is very small) has an impact on whether an item was sold or not.
But the I started reading more and found this - https://journals.sagepub.com/doi/pdf/10.1177/8756479308317006?fbclid=IwAR2OEdL0WwghG8gyCp-ee3zYhgVhAhJVGynTBWVUw3JeJfgaWhytMylPbrY. Here it says that
One common way of measuring association in such a table is to use the phi coefficient, ϕ. Values of ϕ lie between 0 and 1.
And according to Wiki (https://en.wikipedia.org/wiki/Phi_coefficient), if I understood it correctly, for binary variables Matthews correlation coefficient should be used. But MCC shows that there's no dependence between sold and sale:
from sklearn import metrics
mc = metrics.matthews_corrcoef(df[b], df.sold_in_30_days)
returns mc=0.03.
Also same as for MCC happened with Pearson's correlation coefficient - i got it to be same as MCC (0.03).
My thoughts - both MCC and Pearson's measure correlation which makes the most sense for numerical values and measure liner dependency, therefore it does not make sense to measure it for binary.
Question - So for the cases where we need to know whether one binary variable impacts the other the ultimate measure is the p-value of the chi-squared test?