Is there a location I can download character distributions for frequency analysis used in decryption attempt validation? I am specifically interested in ASCII value [32, 126] frequency distribution for plain English language text. This would imply case-sensitivity and include punctuation. I'm not concerned with data formats.
Asked
Active
Viewed 2,211 times
6
-
You might have to take some larger body (the standard term is 'corpus', such as BYU's Corpus of Contemporary American English and reduce it yourself. Most focus on words & phrases, so might not have all of the punctuation and such that you're asking for. – Joe Sep 27 '13 at 14:45
2 Answers
6
Second try: Google Ngram Viewer contains raw counts of 1-, 2-, ...-grams of text, retrieved from its book scanning endeavor. The section 1-grams contains counts of the occurence of lettres, numbers and even punctuation. They are provided as tab-separated value files, so the frequencies should be derivable with modest scripting efforts.
Found via Wikipedia article Text corpus.
ojdo
- 2,804
- 14
- 31
1
What about the (frequency analysis) article's first link to Letter frequency (Wikipedia)? It lists letter distribution for English and other languages, all properly sourced:
Letter Relfreq
----------------
e 12.702%
t 9.056%
a 8.167%
o 7.507%
i 6.966%
n 6.749%
s 6.327%
...
ojdo
- 2,804
- 14
- 31
-
I was looking for all ASCII values [0,127] or at least printable ASCII values [32,126]. – recursion.ninja Sep 26 '13 at 13:52
-
-
3@FreshPrinceOfSO these table only deal with a case-insensitive alphabet, while the question asks for ASCII frequencies, so at least lower-/uppercase letters, numbers and punctuation should be included. This is what I tried to address in my second answer. – ojdo Sep 26 '13 at 19:51