0

Naturally the answer to the former would be 0 in case of a perfect fit. But how would one go about finding out the value of the worst possible fit?

Background: Doing a simple experiment to crack Caesar Cypher, which in no way is more efficient than just brute forcing the 26 possible shifts, but that's beside the point. Anyway the strategy is to generate a histogram of characters of an encoded text and a probability vector from that containing relative frequencies of all letters in the alphabet. So observation count is the number of letters in said text and there are 26 character frequencies to check against.

After doing the same with a longer example text and thus generating a vector of expected occurances i keep chi squaring both vectors while encoding the cyphered text with all possible shifts (26) at every step to arrive at the most probable number of shifts with the associated minimum chi squared value.

So every iteration i check against the current minimum chi squared value to update it if appropriate and i'd like to initialize that value with its theoretical maximum, which brings me to the question how to determine that.

From what i've gathered it's supposed to be N(k-1) with N=number of observations (so in this case the total number of characters in the encoded text?) and k being the size of the lesser of both vectors (both 26 in this case?). But the value i arrive at with said formula seems far too high. Am i getting variables mixed up? Stumped.

I reckon this to be a rather noobish question and i apologize. Thanks alot in advance to anyone caring to answer!

  • I believe that $N(k-1)$ value is for a null of uniform probabilities over the categories, which (if I understand your question right) you don't have. Consider the simple case of two categories with null probabilities $1/100$ and $99/100$ and observed counts of $100$ and $0$ respectively. $N(k-1)=100$ but the chi-squared value here would be $9900$. Clearly, then, the maximum possible chi-squared statistic is a function of the set of null probabilities not just the sample size and the number of categories. – Glen_b Jul 30 '22 at 08:16
  • I haven't checked but I think you'd maximize it by putting all the count into the category with lowest probability. If that's so, I think the maximum value is $N(1/p_\text{min}-1)$ – Glen_b Jul 30 '22 at 08:24
  • Further, the minimum isn't always $0$; consider again two categories, with $N=20$ and probabilities $1/3$ and $2/3$; in that case, the smallest possible chi-squared statistic is $0.025$, rather than $0$. It looks like several of the premises in your question aren't quite correct. – Glen_b Jul 30 '22 at 08:33
  • @Glen_b hey, thanks for your comments. I was refering to this question:https://stats.stackexchange.com/questions/13211/what-is-the-maximum-for-pearsons-chi-square-statistic – jaderpansen Jul 30 '22 at 10:34
  • Also thanks for pointing out the fact that it's not always possible to arrive at zero since were dealing with discrete observations, fair point. – jaderpansen Jul 30 '22 at 10:50
  • The question and answer there are dealing with the chi-squared test of independence. You're asking about the chi-squared test for (multinomial) goodness of fit. – Glen_b Jul 30 '22 at 18:31
  • i see... thanks alot for the heads-up and your patience. – jaderpansen Jul 31 '22 at 13:31

0 Answers0