5

I seek essentially a parallel corpora, which has the dirty original user-generated input on the one side, and a cleaned up version on the other side - with corrected spelling, casing, punctuation and formatting.

So the dirty side should have been composed by human users, and the clean side should have been redacted by human linguists:

'how are you', 'How are you?'
'is their bad waether?', 'Is there bad weather?'
'As of 2011, it was live.', 'As of 2011, it was live.'

Autocorrect data would also be useful. It would need totals or proportional statistics which imply the relative probabilities:

'how', 'How', 0.5
'waether', 'weather', 0.7
'waether', 'whether, 0.1

(This will then be used to train machines to generate more of the dirty content, given cleanish content, thereby augmenting a text data set.)

1 Answers1

2

OK, this one is a stretch... but instead of comparing text with errors to the corrected text, you can compare original text against incorrect transcriptions.

Reuters Transcribed Subset and data set

This data was created by selecting 20 files each from the 10 largest classes in the Reuters-21578 collection. The files were read out by 3 Indian speakers and an Automatic Speech Recognition (ASR) system was used to generate the transcripts.

Original text - Reuters-21578

You would just have to match the transcribed text with the original.

philshem
  • 17,647
  • 7
  • 68
  • 170