2

Long story short - I am lost among the ways to check for relation (whether there is any) between binary variables.

Context - I am working on a side project where I try to analyze marketplace data and see which variables are related to items being sold (the goal is to find out what can be done to sell more items). So I am trying to analyze data and basically see which variables are correlated to the dependent variable which is "sold" (True/False). I am having some difficulties with binary data. For e.g. there's an independent variable "sale" (True/False) and I want to know whether putting an item on sale increases the probability of selling it.

My first instinct was to use the chi-squared test for it:

from scipy.stats import chi2_contingency

contigency = pd.crosstab(df.sold, df.sale) c, p, dof, expected = chi2_contingency(contingency)

I got p = 2.5e-142. At this point, it seemed right to make a conclusion that sale definitely (since the p-value is very small) has an impact on whether an item was sold or not.

But the I started reading more and found this - https://journals.sagepub.com/doi/pdf/10.1177/8756479308317006?fbclid=IwAR2OEdL0WwghG8gyCp-ee3zYhgVhAhJVGynTBWVUw3JeJfgaWhytMylPbrY. Here it says that

One common way of measuring association in such a table is to use the phi coefficient, ϕ. Values of ϕ lie between 0 and 1.

And according to Wiki (https://en.wikipedia.org/wiki/Phi_coefficient), if I understood it correctly, for binary variables Matthews correlation coefficient should be used. But MCC shows that there's no dependence between sold and sale:

from sklearn import metrics
mc = metrics.matthews_corrcoef(df[b], df.sold_in_30_days)

returns mc=0.03.

Also same as for MCC happened with Pearson's correlation coefficient - i got it to be same as MCC (0.03).

My thoughts - both MCC and Pearson's measure correlation which makes the most sense for numerical values and measure liner dependency, therefore it does not make sense to measure it for binary.

Question - So for the cases where we need to know whether one binary variable impacts the other the ultimate measure is the p-value of the chi-squared test?

  • Welcome to Cross Validated! How would you do it for continuous variables? – Dave Apr 17 '22 at 19:00
  • Thanks! So for the continuous vs continuous, I'd first inspect it graphically (using scatterplot) and then use either Pearson correlation or maybe something like this https://stats.stackexchange.com/questions/393903/quadratic-polynomial-how-to-test-correlation-between-x-and-y if graphically it would look like e.g. quadratic. And for continuous vs binary I am actually not sure. – cinnamon Apr 17 '22 at 19:36
  • You can visualize the relationship between two categorical variables as well, though admittedly two binary variables don't make for a gripping plot. And some ideas for plotting a continuous variable against a categorical one. – dipetkov Apr 17 '22 at 19:46
  • Matthews correlation, Phi coefficient and Pearson correlation on binary variables are all the same thing. https://en.wikipedia.org/wiki/Phi_coefficient – Glen_b Apr 18 '22 at 09:44
  • 1
    hi @cinnamon at the end what did you end up using to see the correlation b/w 2 binary variables? – Scope Jan 10 '23 at 17:57

0 Answers0