Actual text and corrected text? Or autocorrect data?

Question

I seek essentially a parallel corpora, which has the dirty original user-generated input on the one side, and a cleaned up version on the other side - with corrected spelling, casing, punctuation and formatting.

So the dirty side should have been composed by human users, and the clean side should have been redacted by human linguists:

'how are you', 'How are you?'
'is their bad waether?', 'Is there bad weather?'
'As of 2011, it was live.', 'As of 2011, it was live.'

Autocorrect data would also be useful. It would need totals or proportional statistics which imply the relative probabilities:

'how', 'How', 0.5
'waether', 'weather', 0.7
'waether', 'whether, 0.1

(This will then be used to train machines to generate more of the dirty content, given cleanish content, thereby augmenting a text data set.)

So the set specifically includes entries without corrections (your 3rd line)? — , Nov 02 '16 at 13:32
Either way. Would be ideal in some sense, but not necessary. — Adam Bittlingmayer, Nov 02 '16 at 20:48

score 2 · Answer 1 · answered Nov 05 '16 at 17:32

OK, this one is a stretch... but instead of comparing text with errors to the corrected text, you can compare original text against incorrect transcriptions.

Reuters Transcribed Subset and data set

This data was created by selecting 20 files each from the 10 largest classes in the Reuters-21578 collection. The files were read out by 3 Indian speakers and an Automatic Speech Recognition (ASR) system was used to generate the transcripts.

Original text - Reuters-21578

You would just have to match the transcribed text with the original.

Thanks much, I don't think it will work for my intended application, but it's an idea, and could be very useful to others. — Adam Bittlingmayer, Nov 05 '16 at 20:15

Actual text and corrected text? Or autocorrect data?

1 Answers1