Relationship between two nominal variables with many categories

Question

I have two nominal variables, one with 10 categories and one with 12 categories (n = ~800). I hypothesise that these variables aren't related, but have been searching for tests that would show a relationship does exist.

I cannot used a Chi-Squared test, because the expected counts are too low, and a Cramer's V statistic won't work because there are too many categories.

I see this question has been asked a couple of times: here here

The first only has dead links and the second seems to focus on binarizing the variables, which I would rather avoid doing.

I have been looking at Goodman and Kruskal’s lambda, since I could hypothesis that one variable is predictive of the other, but it is a) unclear to me if that is appropriate, and b) I can't find anything about how to interpret the output of this test.

Any advice would be greatly appreciated

score 2 · Accepted Answer · answered Apr 03 '20 at 15:59

To the contrary, you can use a $\chi^2$ test. After all, it tests the null hypothesis that in your contingency table, there is no structure beyond the row and column marginals. Which seems to be precisely what you want.

The problem is that the test statistic is only asymptotically $\chi^2$ distributed, and your low expected cell counts interfere with these asymptotics.

However, you can address this by simulating many contingency tables with the row and column totals you see in your data, under the null hypothesis, calculating your test statistic for each simulated table, and finally comparing the test statistic of your actually observed contingency table against this simulated null distribution. No need for asymptotics.

You can do this using the r2dtable() function in base R, which implements an algorithm by Patefield (1981). See also here. In addition, this function may be of interest; its description refers to p. 62-64 in Correspondence Analysis: Theory, Practice and New Strategies by Beh & Lombardo (2014).

Thank you Stephan! Not sure how I feel about selling R-Code! so I appreciate the other implementations. Would this idea also apply to Cramer's V or Goodman & Kurskal's Lambda? ie. could I use the simulated tables to calculate these statistics? — SamPassmore, Apr 06 '20 at 08:11
A follow-up question: how does this approach differ from just using the simulate.p.value argument in the base R chi.sq function? The book you mention seems to just be showing how to achieve a simulated p-value this by using r2dtable, rather than an alternative approach. — SamPassmore, Apr 06 '20 at 08:23
The approach should work for any test statistic, and thus for any test on contingency tables. I'm not sure how this approach differs from the simulation approach in chisq.test(), sorry. That might be a good separate question. — Stephan Kolassa, Apr 06 '20 at 08:27

Relationship between two nominal variables with many categories

1 Answers1