I am working with a big data project, iterating over 30k features.
There is a prt where I compare between several clusters (after kmeans), part of code:
e21 = df[['ping1', 'ping2']]
stat, p, dof, expected = chi2_contingency(e21)
t = ({'ping1': str(p), 'name':
str({}).format(ping)})
Problem: it's written in https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html :
test should be used only if the observed and expected frequencies in each cell are at least 5.
The thing is, that I got many times results as the following result:
Cluster | ping1 | ping2
0 | 56 | 14
1 | 9 | 89
2 | 111 | 78
3 | 3 | 0
When it becomes "0" versus "0" on the same cluster, I just ignore it. There is no any error, I am just wondering if it's right to continue on such situations.