1

I am working with a big data project, iterating over 30k features.

There is a prt where I compare between several clusters (after kmeans), part of code:

e21 = df[['ping1', 'ping2']]
    stat, p, dof, expected = chi2_contingency(e21)
    t = ({'ping1': str(p), 'name':
        str({}).format(ping)})

Problem: it's written in https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html :

test should be used only if the observed and expected frequencies in each cell are at least 5.

The thing is, that I got many times results as the following result:

Cluster  | ping1    | ping2
0        |  56      |  14
1        |  9       |  89
2        |  111     |  78
3        |  3       |  0

When it becomes "0" versus "0" on the same cluster, I just ignore it. There is no any error, I am just wondering if it's right to continue on such situations.

  • 1
    Such rules of thumb tend to be too severe: they tend to cause people to avoid using tests in some cases where they give good results. In this example, there's not any need to run a test: the difference in distributions among the columns is blatant. – whuber Feb 23 '21 at 13:05
  • So in this case, when I iterating over the data and meet the kind of 0 versus number, it is fine as long as there are other clusters of numbers? – TheUndecided Feb 23 '21 at 13:12
  • Not necessarily: it depend on the other counts and on how many rows and columns the table has. BTW, the scipy documentation is decidedly wrong about one thing: the observed counts don't matter; only the expected values matter. – whuber Feb 23 '21 at 13:15
  • Understood, thank you! – TheUndecided Feb 23 '21 at 13:44

0 Answers0