The goal is to find out if the frequency difference of a word "X" in two texts/corpora is significant. We have a $2\times 2$ contingency table of observed and expected frequencies.
To find out if the frequency difference is significant, I use the chi-square test. But which one exactly? Chi-Square Test of Independence?
- $\alpha = 0.05$
- one degree of freedom
- critical value of the chi-square distribution -> 3.84
The formulated hypotheses:
null hypothesis: The frequency of the word "X" in text/corpus A differs not significantly from the frequency of the word "X" in text/corpus B
alternative hypothesis: The frequency of the word "X" in text/corpus A is significantly higher than the frequency of the word "X" in text/corpus B
- Observed Frequency of word X in Corpus A: 2901
- Observed Frequency of word X in Corpus B: 3019
- Observed Frequency of all other words in Corpus A: 90381
- Observed Frequency of all other words in Corpus B: 80281
- Total words of Corpus A: 93282
- Total words of Corpus B: 83300
Result: The chi-square statistic is 35.9258. This value is greater than the critical value of 3.84, accordingly the result is significant. The calculation of the p-value is not mandatory, is it? Suffice it to say that the test statistic of 35.9258 is higher than the critical value of 3.84 of the chi-square distribution?
Back to our result: The word "X" occurs significantly more often in book B and not, as assumed in the alternative hypothesis, in book A. How would you formulate the hypothesis evaluation? It is obvious that we reject the null hypothesis because the frequency difference is significant, but what do we do with our alternative hypothesis?