I seek essentially a parallel corpora, which has the dirty original user-generated input on the one side, and a cleaned up version on the other side - with corrected spelling, casing, punctuation and formatting.
So the dirty side should have been composed by human users, and the clean side should have been redacted by human linguists:
'how are you', 'How are you?'
'is their bad waether?', 'Is there bad weather?'
'As of 2011, it was live.', 'As of 2011, it was live.'
Autocorrect data would also be useful. It would need totals or proportional statistics which imply the relative probabilities:
'how', 'How', 0.5
'waether', 'weather', 0.7
'waether', 'whether, 0.1
(This will then be used to train machines to generate more of the dirty content, given cleanish content, thereby augmenting a text data set.)