Data Request
I am looking for some data that contains English words along with the probability that a person will spell it incorrectly.
The probability depends on various factors. However, a simple measure would be frequency of wrong spellings divided by the total frequency of the word.
Alternatively, the data can also be of a word with a class denoting its likelihood to be mistaken.
Context
I want to create a basic NLP project in which I will generate a prediction - given a word, what is the probability of wrongly spelling it.
Region
An US/UK based English variant would be appreciated, although other combinations are also good.
License
Open data.
Format
A CSV based format would be appreciated. Some examples are given below.
Word,Frequency,FrequencyOfErrors
apple,100,2
asphyxiated,54,14
...
OR
Word,ProbabilityOfError
apple,0.02
asphyxiated,0.26
...
However, it is not necessary, and a general program-friendly format would suffice.
Authority
Data provided by some reliable source, for example, an academic body, would be good. Crowdsourced data would also be sufficient.
Requirements
There should be a large number of words - 20,000 or more.
The individual words should have moderately high frequencies of occurrence - 50 or more.
Non-answers
I found this topic which provides age of acquisition of various words. However, I believe there is very little correlation between making a spelling mistake and knowing the word.