This may be an odd question for this site, but tonight I've been enjoying myself by creating a small script that produces (is supposed to produce) sample sentences that resemble English, while being total gibberish.
The idea came from reading a question on StackOverflow.com which involved word wrapping of a text. Some people would use the Lorem Ipsum quote to generate a sample text for demonstration purposes. I thought, why this would be a nice use of a random text generator.
The very intriguing Wug test was also at the back of my mind, and the fact that it is relatively easy to read a sentence with scrambled words, as long as beginning and end letters remain the same. For example:
Ocne uopn a mndihgit derary, whlie I peonredd waek and warey oevr mnay a quinat and ciruous vumole of fgtorteon lero,
I have done some research on (concerning English):
- Word length distribution (Using an approximation of Zipf's Law I found online)
- Letter distribution and first letter distribution
Adding some random punctuation and capitalization, it is looking pretty, but I need some simple algorithms to make the words more realistic looking. Here's a sample text:
Ynssdto lcianche ttlkise aaricod oawsepje. Hast tvnvcfaiesont eteoy prae wwecofuothenroo nmtnhglw lmhwefc etlugloe. Ywio odhw, chlt dhpei tiaqirter, sorrdstg aontli kayhut, tnust, berv dosp wrhhys sblfm. Nkttrbfoeret thpit atea aoecwb ctwrhfae oneeot selm teihug ttolgktrwwmc, wwrleil sga, isdeedeo adnrsi, aydhd asroino dhddonn, lrctp gckort ikhcvo. Tvte hzmdosnd wsad a cwfndoac drnsrtsaths
Obviously, words should contain at least one vowel. It might in fact be idea to make vowel insertion a distinct part of the process. Some consonants should not follow each other (e.g. tvnvcf), and should not be too many in a row.
I was looking for a distribution of the last letters in English words, but that may not be applicable, since word endings can be fairly similar (ing, ane, tion, able, etc), and that might add some familiarity to the sentences.
I'm looking for ideas. Links to resources. Rules of thumb. What can I do to make my script spout more legible gibberish?
In short, what are the general rules for building an English-looking word?